TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

The Other Shoe Drops for GSK in Avandia MDL — Hand Waving on Specific Causation

January 24th, 2011

For GSK, the other shoe dropped in the Avandia multi-district litigation, on January 13, 2011, when the presiding judge denied the defense challenge to plaintiff’s expert witness specific causation opinions, in the first case set for trial.  Burford v. GlaxoSmithKline, PLC, 2011 WL 135017 (E.D.Pa. 2011). 

In the MDL court’s opinion on general causation, In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576 (E.D. Pa. 2011), Judge Rufe determined that she was bound to apply a “Third Circuit” approach to expert witness gatekeeping, which focused on the challenged expert witnesses’ methodology, not their conclusions.  In Burford, Judge Rufe, citing two Third Circuit cases were decided after Daubert, but before Joiner, repeats this basic mistake.  Burford, 2011 WL 135017, *2.  Remarkably, the court’s opinion in Burford recites the current version of Federal Rule of Evidence 702, which states that the court must analyze expert witnesses’ conclusions for being based upon “sufficient facts or data,” as well as for being “the product of reliable principle and methods.” The statute mandates consideration of the reliability and validity of the witness’s conclusions, if those conclusion are in his testimony.  This Rule, enacted by Congress in 2000, is a statute, and thus supersedes prior case law, although the Advisory Notes explain that the language of the rule draws heavily from the United States Supreme Court’s decisions in Daubert, Joiner, and Kumho Tire. The Avandia MDL court ignored both the post-Daubert decisions of the Supreme Court, as well as the controlling language of the statute, in gatekeeping opinions on general and specific causation.

Two expert witnesses on specific causation were the subject of GSK’s challenge in Burford:  Dr. Nicholas DePace and Dr. Judy Melinek.  The court readily dispatches Dr. Melinek, who opines that Mr. Burford’s fatal cardiac event, which she characterizes as a heart attack, was caused by Avandia because Avandia causes heart attacks.  The court correctly noted that this inference was improper because risk does not equal causation in a specific case.

As one well-known epidemiologist has put it:

“An elementary but essential principal that epidemiologists must keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between exposure and disease.”

* * *

“In a courtroom, experts are asked to opine whether the disease of a given patient has been caused by a specific exposure.  This approach of assigning causation in a single person is radically different from the epidemiologic approach, which does not attempt to attribute causation in any individual instance.  Rather, the epidemiologic approach is to evaluate the proposition that the exposure is a cause of the disease in a theoretical sense, rather than in a specific person.”

Kenneth Rothman, Epidemiology: An Introduction 44 (Oxford 2002)(emphasis added).

In addressing the admissibility of Dr. DePace’s expert opinion, however, the MDL Court is led astray by Dr. DePace’s handwaving about having considered and “ruled out” Mr. Burford’s other risk factors. 

To be sure, Dr. DePace has some ideas about how Avandia may, plausibly, cause heart attacks.  In particular, Dr. DePace identified three plausible mechanisms, each of which would have had been accompanied by some biomarker (elevated blood lipids, elevated Lp-PLA2, or hypoglycemia).  This witness, however, could not opine that any of these mechanisms was in operation in producing Mr. Burford’s fatal cardiac event. Burford, at *3.

Undaunted, Dr. DePace opined that he had ruled out Mr. Burford’s other risk factors, but his opinion, even from Judge Rufe’s narrative is clearly hand waving and dissembling.  First, everyone, including every middle age man, has a risk of heart attack or cardiac arrest, although that risk may be modified – increased or lowered – by risks or preventive factors.  Mr. Burford had severe diabetes, which in and of itself, is a risk factor, commonly recognized to equal the size of the risk from having had a previous heart attack.  So Mr. Burford was not at baseline risk; indeed, he started all his diabetes medications with the equivalent risk of someone who had had a heart attack already.

Dr. DePace apparently opined that Mr. Burford’s diabetes, his blood sugar level, was well controlled.  The court accepted this contention at face value, although the reader of the court’s opinion will know that it is rubbish.  Although the court does not recite any blood sugar levels, its narrative of facts includes the following course of medications for Mr. Burford:

  • June 2004, diagnosed with type II diabetes, and treated with metformin
  • April 2005, dose of metformin doubled
  • August 2005, Avandia added to double dose of metformin
  • December 2005, Avandia dose doubled as well
  • June 2006, metformin dose doubled again
  • October or November 2006, sulfonylurea added to Avandia and metformin

This narrative hardly suggests good control.  Mr. Burford was on a downward spiral of disease, which in a little over two years took him from diagnosis to three medications to try to control his diabetes. Despite adding Avandia to metformin, doubling the doses of Avandia, doubling and then quadrupling doses of metformin, Mr. Burford still required yet another, third medication, to achieve glycemic control.  Of course, an expert witness can say anything, but the federal district court is supposed to act as a gatekeeper, to protect juries and parties from their ipse dixit.  Many opinions will be difficult to evaluate, but here, Dr. DePace’s opinion about glycemic control in Mr. Burford comes with a banner headline, which shouts “bogus.”

The addition of a third medication, a sulfonylurea, known to cause hypoglycemia (dangerously low blood sugar), which in turn can cause cardiac events and myocardial infarction, is particularly troubling.  See “Sulfonylurea,” in Wikipedia January 24, 2011.  Sulfonylureas act by stimulating the pancreas to produce more insulin, and the sudden addition of this medication to an already aggressive regime of medication clearly had the ability to induce hypoglycemia in Mr. Burford.  Dr. DePace notes that there is no evidence of an hypoglycemic event, which is often true in diabetic patients who experience a sudden death, but the gatekeeping court should have noticed that Dr. DePace’s lack of evidence did not equate to evidence that the risk or actual causal role (of hypoglycemia) was lacking.  Again, the trial court appeared to be snookered by an expert witness’s hand waving.  Surely gatekeepers must be made of sterner stuff.

Perhaps the most wrongheaded is the MDL court’s handling, or its failure to handle, risk as causation, in Dr. DePace’s testimony.

In his deposition, Dr. DePace testified that a heart attack in a 49 year-old man was “very unusual.”  Such a qualitative opinion does not help the finder of fact.  A heart attack is more likely in any 49 year-old man than in any 21 year-old man, although men of both ages can and do suffer heart attacks.  Clearly, a heart attack is more likely in a 49-year old man who has had diabetes, which has required intensive medication for even a semblance of control, than in a 49 year-old man who has never had diabetes.  Dr. DePace’s opinions fail to show that Mr. Burford had no base-line risk in absence of one particular medication, or that this base-line risk was not operating to produce, sufficiently, his alleged heart attack. 

Rather than being a high-risk group with respect to his Avandia use, according to the FDA’s 2007  meta-analysis, Mr. Burford and other patients on “triple therapy” (Avandia + metformin + sulfonylurea), would have had an odds ratio of 1.1 for any myocardial ischemic event, not statistically significant, as a result of their Avandia use.  Mr. Burford’s additional use of an ACE-inhibitor, along with this three diabetic medications, would place him into yet another sub-subgroup.  Whatever modification or interaction this additional medication created in combination with Avandia, the confidence intervals, which were wide for the odds ratio of 1.1, would  become extremely wide, allowing no meaningful inference. In any event, the court in Burford does not tell us what the risk was opined to be, and whether there were good data and facts to support such an opinion.  Remarkably absent from the court’s opinion in Burford is any consideration of the actual magnitude of the claimed risk (in terms of a hazard ratio, relative risk, odds ratio, risk difference, etc.) for patients like Mr. Burford.  Further absent is any consideration of whether any study showing risk has further shown the risk to be statistically different from 1.0 (no increased risk at all). 

As Ted Frank has noted on PointofLaw Forum, the Avandia MDL raises serious questions about the allocation of technical multi-district litigation cases to judges in the federal system.  “It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law” (January 21, 2011).  The Avandia experience also raises questions about the efficacy of the Federal Judicial Center’s program to train judges in the basic analytical, statistical, and scientific disciplines needed in their gatekeeping capacity. 

Although the Avandia MDL court’s assessment that Dr. DePace’s opinion was suboptimal, Burford at * 4, may translate into GSK’s ability to win before a jury, the point of Rule 702 is that a party should not have to stand trial on such shoddy evidence.

Power in the Courts — Part Two

January 21st, 2011

Post hoc calculations of power were once in vogue, but have now routinely been condemned by biostatisticians and epidemiologists in studies that report confidence intervals around estimates of associations, or “effect sizes.”  Power calculations require an alternative hypothesis against which to measure the rejection of the null hypothesis, and the choice of the alternative is subjective and often arbitrary.  Furthermore, the power calculation must make assumptions about the anticipated variance of the data to be obtained.  Once the data are in fact obtained, those assumptions may be shown wrong.  In other words, sometimes the investigators are “lucky,” and their data are less variable than anticipated.  The variance of the data actually obtained, rather than hypothesized, can best be appreciated from the confidence interval around the actually measured point estimate of risk.

In Part One of “Power in Courts,” I addressed the misplaced emphasis the Avandia MDL court put upon the concept of statistical power.  The court apparently accepted at face value the plaintiffs’ argument that GSK’s clinical trials were “underpowered,” which claim was very misleading.  Power calculations were no doubt done to choose sample size for GSK’s clinical trials, but those a priori estimates were based upon assumptions.  In the case of one very large trial, RECORD, many fewer events occurred than anticipated (which is generally a good thing to happen, and not unusual in the context of a clinical trial that gives patients in all arms of the trial better healthcare than available to the general population).  In one sense, those plaintiffs’ expert witnesses are correct to say that RECORD was “underpowered,” but once the study is done, the real measure of statistical precision is given by the confidence interval.

Because the Avandia MDL is not the only litigation in which courts and lawyers have mistakenly urged power concepts for studies that have already been completed, I have collected some key statements that reflect the general consensus and reasoning against what the Court did.

To be fair, the Avandia court did not fault the defense for not having analyzed and calculated post-hoc power of the clinical trials, all of which failed to find statistically significant associations between Avandia and heart attacks. The court, however, did appear to embrace the plaintiffs’ rhetoric that all the Avandia trials were underpowered, without any consideration given to the width and the upper bounds of the confidence intervals around those trials’ estimates of risk ratios for heart attack.  Remarkably, the Avandia court did not present any confidence intervals for any estimates of effect size, although it did present p-values, which it then badly misinterpreted.  Many of the Avandia trials (and the resulting meta-analyses) confidently ruled out risk ratios, for heart attacks, under 2.0.  The court’s conclusions about power are thus misleading at best.

Several consensus statements address whether considerations of power, after studies are completed and the data are analyzed, are appropriate.  The issue has also been addressed extensively in textbooks and in articles.  I have collected some of the relevant statements, below.  To the extent that the Federal Judicial Center’s Reference Manual on Scientific Evidence appears to urge post hoc power calculations, I hope that the much anticipated  Third Edition will correct the error.

CONSENSUS STATEMENTS

CONSORT

The CONSORT group (Consolidated Standards of Reporting Trials) is a world-wide group that sets quality standard for randomized trials in testing of pharmaceuticals.  CONSORT’s lead author is Douglas Altman, a well-respected biostatistician from Oxford University.  The advice of the CONSORT group is clear:

“There is little merit in calculating the statistical power once the results of the trial are known, the power is then appropriately indicated by confidence intervals.”

Douglas Altman, et al., “The Revised CONSORT Statement for Reporting Randomized Trials:  Explanation and Elaboration,” 134 Ann. Intern. Med. 663, 670 (2001).  See alsoDouglas Altman, et al., “Reporting power calculations is important,” 325 Br. Med. J. 1304 (2002).

STROBE

An effort similar to the CONSORT group has been put together by investigators interested in observational studies, the STROBE group (the Strengthening the Reporting of Observational Studies in Epidemiology).  The STROBE group was made up of leading epidemiologists and biostatisticians, who addressed persistent issues and errors in the reporting of observational studies.  Their advice was equally unequivocal on the issue of post hoc power considerations:

“Do not bother readers with post hoc justifications for study size or retrospective power calculations. From the point of view of the reader, confidence intervals indicate the statistical precision that was ultimately obtained. It should be realized that confidence intervals reflect statistical uncertainty only, and not all uncertainty that may be present in a study (see item 20).”

Vandenbroucke, et al., “Strengthening the reporting of observational studies in epidemiology (STROBE):  Explanation and elaboration,” 18 Epidemiology 805, 815 (2007) (Section 10, sample size).

American Psychological Association

In 1999, a committee of the American Psychological Association met to discuss various statistical issues in psychological research papers.  With respect to power analysis, the committee concluded:

“Once the study is analyzed, confidence intervals replace calculated power in describing the results.”

Wilkinson, Task Force on Statistical Inference, “Statistical methods in psychology journals:  guidelines and explanations,” 54 Am. Psychol. 594-604 (1999)

TEXTBOOKS

Modern Epidemiology

Kenneth Rothman and Sander Greenland are known for many contributions, not the least of which is their textbook on epidemiology.  In the second edition of Modern Epidemiology, the authors explain how and why confidence intervals replace power considerations, once the study is completed and the data are analyzed:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis.  The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect.  * * *  In planning a study, it is reasonable to make conjectures about the magnitude of an effect in order to compute sample-size requirements or power.

In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with sample-size or power calculations (Smith & Bates 1992; Goodman & Berlin 1994). * * * Confidence limits convey much more of the essential information by indicating a range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level).  They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman & Sander Greenland, Modern Epidemiology 192 – 193 (1998)

And in 2008, with the addition of Timothy Lash as a co-author, Modern Epidemiology continued its guidance on power as only a pre-study consideration:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008)

A Short Introduction to Epidemiology

Neil Pierce, an epidemiologist, citing Smith & Bates 1992, and Goodman & Berlin 1994, infra, describes the standard method:

“Once a study has been completed, there is little value in retrospectively performing power calculations since the confidence limits of the observed measure of effect provide the best indication of the range of likely value for true association.”

Neil Pierce, Introduction to Epidemiology (2d ed. 2005)

Statistics at Square One

The British Medical Journal publishes a book, Statistics at Square One, which addresses the issue of post hoc power:

“The concept of power is really only relevant when a study is being planned.  After a study has been completed, we wish to make statements not about hypotheses but about the data, and the way to do this is with estimates and confidence intervals.”

T. Swinscow, Statistics at Square One42 (9thed. London 1996) (citing to a book by Martin Gardiner and Douglas Altman, both highly accomplished biostatisticians).

How to Report Statistics in Medicine

Two authors from the Cleveland Clinic, in a guidebook published by the American College of Physicians:

“Until recently, authors were urged to provide ‘post hoc power calculations’ for non-significant differences.  That is, if the results of the study were negative, a power calculation was to be performed after the fact to determine the adequacy of the sample size.  Confidence intervals also reflect sample size, however, and are more easily interpreted, so the requirement of a post hoc power calculation for non-statistically significant results has given way to reporting the confidence interval (32).”

Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 58 (2d ed. 2006)(citing to Goodman & Berlin, infra).  See also Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 78 (1st ed. 1996)

Clinical Epidemiology:  The Essentials

The Fletchers, both respected clinical epidemiologists, describe standard method and practice:

Statistical Power Before and After a Study is Done

Calculation of statistical power based on the hypothesis testing approach is done by the researchers before a study is undertaken to ensure that enough patients will be entered to have a good chance of detecting a clinically meaningful effect if it is present.  However, after the study is completed this approach is no longer relevant.”  There is no need to estimate effect size, outcome event rates, and variability among patients, they are now known.

Therefore, for researchers who report the results of clinical research and readers who try to understand their meaning, the confidence interval approach is more relevant.  One’s attention should shift from statistical power for a somewhat arbitrarily chosen effect size, which may be relevant in the planning stage, to the actual effect size observed in the study and the statistical precision of that estimate of the true value.”

R. Fletcher, et al., Clinical Epidemiology: The Essentials at 200 (3d ed. 1996)

The Planning of Experiments

Sir David Cox is one of the leading statisticians in the world.  In his classic 1958 text, The Planning of Experiments, Sir David wrote:

“Power is important in choosing between alternative methods of analyzing data and in deciding on an appropriate size of experiment.  It is quite irrelevant in the actual analysis of data.”

David Cox, The Planning of Experiments 161 (1958)

ARTICLES

Cummings & Rivara (2003)

“Reporting of power calculations makes little sense once the study has been done.  We think that reviewers who request such calculations are misguided.”

* * *

“Point estimates and confidence intervals tell us more than any power calculations about the range of results that are compatible with the data.”

Cummings & Rivara, “Reporting statistical information in medical journal articles,” 157 Arch. Pediatric Adolesc. Med. 321, 322 (2003)

Senn (2002)

“Power is of no relevance in interpreting a completed study.

* * *

“The definition of a medical statistician is one who not accept that Columbus discovered America because he said he was looking for India in the trial plan.  Columbus made an error in his power calculation – – he relied on an estimate of the size of the Earth that was too small, but he made one none the less, and it turned out to have very fruitful consequences.”

Senn, “Power is indeed irrelevant in interpreting completed studies,” 325 Br. Med. J. 1304 (2002).

Hoenig & Heisey (2001)

“Once we have constructed a C.I., power calculations yield no additional insight.  It is pointless to perform power calculations for hypotheses outside of the C.I. because the data have already told us that these are unlikely values.”  p. 22a

Hoenig & Heisey, “The Abuse of Power:  The Pervasive Fallacy of Power Calculations for Data Analysis”? American Statistician (2001)

Zumbo & Hubley (1998)

In The Statistician, published by the Royal Statistical Society, these authors roundly condemn post hoc power calculations:

“We suggest that it is nonsensical to make power calculations after a study has been conducted and a statistical decision has been made.  Instead, the focus after a study has been conducted should be on effect size . . . .”

Zumbo & Hubley, “A note on misconceptions concerning prospective and retrospective power,” 47-2 The Statistician 385 (1998)

Goodman & Berlin (1994)

Professor Steven Goodman is a professor of epidemiology in Johns Hopkins University, and the Statistical Editor for the Annals Internal Medicine.  Interestingly, Professor Goodman appeared as an expert witness, opposite Sander Greenland, in hearings on Thimerosal.  His article, with Jesse Berlin, has been frequently cited in support of the irrelevance of post hoc power considerations:

“Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant.”

(p. 200a, ¶1)

“Studies with low statistical power have sample sizes that are too small, producing results that have high statistical variability (low precision).  Confidence intervals are a convenient way to express that variability.”

(p. 200a, ¶2)

“Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected . . . .”

(p. 200 b, top)

“Power is exclusively a pretrial concept; it is the probability of a group of possible results (namely all statistically significant outcomes) under a specified alternative hypothesis.  A study produces only one result.”

(p. 201a, ¶2)

“The perspective after the experiment differs from that before that experiment simply because the result is known.  That may seem obvious, but what is less apparent is that we cannot cross back over the divide and use pre-experiment numbers to interpret the result.  That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit a lottery jackpot (the after-experiment perspective).”

(p. 201a-b)

“For interpretation of observed results, the concept of power has no place, and confidence intervals, likelihood, or Bayesian methods should be used instead.”

(p. 205)

Goodman & Berlin, “The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results,” 121 Ann. Intern. Med. 200, 200, 201, 205 (1994).

Smith & Bates (1992)

This article was published in the journal, Epidemiology, which was founded and edited by Professor Kenneth Rothman:

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.”

“Generally, a negative study with low power will be regarded as providing little evidence against the existence of a causal association.  Often overlooked, however, is that otherwise well-conducted studies of low power can be informative:  the upper bound of the (1 – α)% confidence intervals provides a limit on the likely magnitude of any actual effect.

The purpose of this paper is to extend this argument to show that the use of traditional power calculations is causal inference (that is, after a study has been carried out) can be misleading and inferior to the use of upper confidence limits of estimates of effect.  The replacement of post-study power calculations with confidence interval estimates is not a new idea.”

(p. 449a)

* * *

“It is clear, then, that the use of the upper confidence limit conveys considerable information for the purposes of causal inference; by contrast, the power calculation can be quite misleading.”

(p. 451b)

* * *

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.  As we have demonstrated, they have little, if any, value.  We propose that, in their place, (1 – α)%  upper confidence limits be calculated.”

(p. 451b)

Smith & Bates, “Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies,” 3 Epidemiology 449-52 (1992)

Greenland (1988)

“the arbitrariness of power specification is of course absent once the data are collected, since the statistical power refers to the probability of obtaining a particular type of data.  It is thus not a property of particular data sets.  Statistical power of collected data, as the probability of heads on a coin toss that has already taken place, can, at best, meaningfully refer only to one’s ignorance of the result and loses all meaning when one examines the result.”

Greenland, “On Sample Size and Power Calculations for Studies Using Confidence Limits,” Am. J. Epidem. 236 (1988)

Simon (1986)

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end.  This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

Rothman (1986)

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

Makuch & Johnson (1986)

“[the] confidence interval approach, the method we recommend for interpreting completed trials in order to judge the range of true treatment differences that is reasonable consistent with the observed data.”

Robert W. Makuch & Mary F. Johnson, “Some Issues in the Design and Interpretation of ‘Negative’ Clinical Studies,” 146 Arch. Intern. Med. 986, 986 (1986).

Detsky & Sackett (1985)

“Negative clinical trials that conclude that neither of the treatments is superior are often criticized for having enrolled too few patients.  These criticisms usually are based on formal sample size calculations that compute the number of patients required prospectively, as if the trial had not yet been carried out.  We suggest that this ‘prospective’ sample size calculation is incorrect, for once the trial is over we have ‘hard’ data from which to estimate the actual size of the treatment effect.  We can either generate confidence limits around the observed treatment effect or retrospectively compare it with the effect hypothesized before the trial.”

Detsky & Sackett, “When was a ‘negative’ clinicaltrial big enough?  How many patients you need depends on what you found,” 145 Arch. Intern. Med. 709 (1985).

Power in the Courts — Part One

January 18th, 2011

The Avandia MDL court, in its recent decision to permit plaintiffs’ expert witnesses to testify about general causation, placed substantial emphasis on the statistical concept of power.  Plaintiffs’ key claim is that the Avandia causes heart attacks, yet no clinical trial of the oral anti-diabetic medication Avandia found a statistically significant increased risk of heart attacks.  Plaintiffs’ expert witnesses argued that all the clinical trials of Avandia were “underpowered,” and thus the failure to find an increased risk was a Type II (false-negative) error that resulted from the small size of the clinical trials:

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(emphasis in original).

The true effect, according to plaintiffs’ expert witnesses, could be seen only through aggregating the data, across clinical trials, in a meta-analysis.  The proper conduct, reporting, and interpretation of meta-analyses were thus crucial issues for the Avandia MDL court, which appeared to have difficulty with statistical concepts.  The court’s difficulty, however, may have had several sources beyond misleading plaintiffs’ expert witness testimony, and the defense’s decision not to call an expert in biostatistics and meta-analysis at the Rule 702 hearing.

Another source of confusion about statistical power may well have come from the very reference work designed to help judges address statistical and scientific evidence in their judicial capacities:  The Reference Manual on Scientific Evidence.

Statistical power is discussed in the both the chapters on statistics and on epidemiology in The Reference Manual on Scientific Evidence.  The chapter on epidemiology, however, provides misleading guidance on the use of power:

“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity. The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.79  The power of a study expresses the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha, or statistical significance, specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.80 Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often power curves are used in the design of a study to determine what size the study populations should be.81

Michael D. Green, D. Michael Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000).  See also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000)

This guidance is misleading in the context of epidemiologic studies because power curves are rarely used any more to assess completed studies.  Power calculations are, of course, used to help determine sample size for a planned study.  After the data are collected, however, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.

The authors of the chapter on epidemiology cite to a general review paper, id. at p. 362n.79, which does indeed address the concept of statistical power, but the author, a well-known statistician, addresses the issue primarily in the context of planning a statistical analysis, and in discrimination litigation, where the test result will be expressed in a p-value, without a measure of “effect size,” and more important, without a measure of a “confidence interval” around the estimate of effect size:

“The chance of rejecting the false null hypothesis, under the assumptions of an alternative, is called the power of the test. Simply put, among many ways in which we can test a null hypothesis, we want to select a test that has a large power to correctly distinguish between two alternatives. Generally speaking, the power of a test increases with the size of the sample, and tests have greater power, and therefore perform better, the more extreme the alternative considered becomes.

Often, however, attention is focused on the first type of error and the level of significance. If the evidence, then, is not statistically significant, it may be because the null hypothesis is true or because our test did not have sufficient power to discern a difference between the null hypothesis and an alternative explanation. In employment discrimination cases, for example, separate tests for small samples of employees may not yield statistically significant results because each test may not have the ability to discern the null hypothesis of nondiscriminatory employment from illegal patterns of discrimination that are not extreme. On the other hand, a test may be so powerful, for example, when the sample size is very large, that the null hypothesis may be rejected in favor of an alternative explanation that is substantively of very little difference.  ***

Attention must be paid to both types of errors and the risks of each, the level of significance, and the power. The trier of fact can better interpret the result of a significance test if he or she knows how powerful the test is to discern alternatives. If the power is too low against alternative explanations that are illegal practices, then the test may fail to achieve statistical significance even though the illegal practices may be operating. If the power is very large against a substantively small and legally permissible difference from the null hypothesis, then the test may achieve statistical significance even though the employment practices are legal.”

Stephen E. Fienberg,  Samuel H. Krislov, and Miron L. Straf, “Understanding and Evaluating Statistical Evidence in Litigation,” 36 Jurimetrics J. 1, 22-23 (1995).

Professor Fienberg’s characterization is accurate, but his description of “post-hoc” assessment of power was not provided for the context of  edemiologic studies, which today virtually always report confidence intervals around the studies’ estimates of effect size.  These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out by the data in a given study.  Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis.  A wily plaintiff’s expert witness can always arbitrarily select a sufficiently low alternative hypothesis, say a relative risk of 1.01, such that any study would have a vanishingly small probability of correctly distinguishing the null and alternative hypotheses.

The Reference Manual is now undergoing a revision, for an anticipated third edition.  A saner appreciation of the concept of power as it is used in epidemiologic studies and clinical trials would be helpful to courts and to lawyers who litigate cases involving this kind of statistical evidence.

Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion

January 10th, 2011

If GlaxoSmithKline (GSK) did not have bad luck when it comes to its oral anti-diabetic medication Avandia, it would have no luck at all.

On January 4, 2011, the federal judge who oversees the Avandia multi-district litigation (MDL) in Philadelphia entered an order denying GSK’s motion to exclude the causation opinion testimony of plaintiffs’ expert witnesses.  In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(Rufe, J.)[cited as “Op.”].  The decision is available on the CBS Interactive Business Network news blog, BNET

Based largely upon a meta-analysis of randomized clinical trials (RCTs) by Dr Steven Nissen and Ms Kathleen Wolski, plaintiffs’ witnesses opined that Avandia (rosiglitizone) causes heart attacks and strokes.  Because meta-analysis has received so little serious judicial attention in connection with Rule 702 or 703 motions, this opinion by the Hon. Cynthia Rufe, deserves careful attention by all students of “Daubert” law.  Unfortunately, that attention is likely to be critical — Judge Rufe’s opinion fails to engage the law and facts of the case, while committing serious mistakes on both fronts.

The Law

The reader will know that things are not going well for a sound legal analysis when the trial court begins by misstating the controlling law for decision:

“Under the Third Circuit framework, the focus of the Court’s inquiry must be on the experts’ methods, not their conclusions. Therefore, the fact that Plaintiffs’ experts and defendants’ experts reach different conclusions does not factor into the Court’s assessment of the reliability of their methods.”

Op. at 2 (internal citation omitted).

and

“As noted, the experts are not required to use the best possible methods, but rather are required to use scientifically reliable methods.”

Op. at 26.

Although the United States Supreme Court attempted, in Daubert, to draw a distinction between the reliability of an expert witness’s methodology and conclusion, that Court soon realized that the distinction is flawed. If an expert witness’s proffered testimony is discordant from regulatory and scientific conclusions, a reasonable, disinterested scientists would be led to question the reliability of the testimony’s methodology and its inferences from facts and data, to its conclusion.  The Supreme Court recognized this connection in General Electric v. Joiner, and the connection between methodology and conclusions was ultimately incorporated into a statute, the revised Federal Rule of Evidence 702:

“[I]f scientific, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training or education, may testify thereto in the form of an opinion or otherwise, if

  1. the testimony is based upon sufficient fact or data,
  2. the testimony is the product of reliable principles and methods; and
  3. the witness has applied the principles and methods reliably to the facts.”

The Avandia MDL court thus ignored the clear mandate of a statute, Rule 702(1), and applied an unspecified “Third Circuit” framework, which is legally invalid to the extent it departs from the statute.

The Avandia court’s ruling, however, goes beyond this clear error in applying the wrong law.  Judge Rufe notes that:

“The experts must use good grounds to reach their conclusions, but not necessarily the best grounds or unflawed methods.”

Op. at 2-3 (internal citations omitted).

Here the trial court’s double negative is confusing.  The court clearly suggests that plaintiffs’ experts must use “good grounds,” but that their methods can be flawed and still survive challenge.  We can certainly hope that the trial court did not intend to depart so far from the statute, scientific method, and common sense, but the court’s own language suggests that it abused its discretion in applying a clearly incorrect standard.

Misstatements of Fact

The apparent errors of the Avandia decision transcend mistaken legal standards, and go to key facts of the case.  Some errors perhaps show inadvertence or inattention, for instance, when the court states that the RECORD trial, an RCT conducted by GSK, set out “specifically to compare the cardiovascular safety of Avandia to that of Actos (a competitor medication in the same class).  Op. at 4.  In fact, Actos (or pioglitazone) was not involved in the RECORD trial, which involved Avandia, along with two other oral anti-diabetic medications, metformin and sulfonylurea. 

Erroneous Reliance upon p-values to the exclusion of Confidence Intervals

Other misstatements of fact, however, suggest that the trial court did not understand the scientific evidence in the case.  By way of example, the trial court erroneously over-emphasized p-values, and ignored the important interpretative value of the corresponding confidence intervals.  For example, we are told that “[t]he NISSEN meta-analysis combined 42 clinical trials, including the RECORD trial and other RCTs, and found that Avandia increased the risk of myocardial infarction by 43%, a statistically significant result (p = .031).”  Op. at 5.  Ignoring for the moment that the cited meta-analysis did not include the RECORD RCT, the Court should have have reported the p-value along with the corresponding two-sided 95% confidence interval:

“the odds ratio for myocardial infarction was 1.43 (95% confidence interval [CI], 1.03 to 1.98; P = 0.03).”

Steven E. Nissen, M.D., and Kathy Wolski, M.P.H., “Effect of Rosiglitazone on the Risk of Myocardial Infarction and Death from Cardiovascular Causes,” 356 New Engl. J. Med. 2457, 2457 (2007).

The Court repeats this error later in its opinion:

“In 2007, the New England Journal of Medicine published the NISSEN meta-analysis, which combined results from 42 double-blind RCTs and found that patients taking Avandia had a statistically significant 43% increase in myocardial ischemic events. NISSEN used all publicly available data from double-blind RCTs of Avandia in which cardiovascular disease events were recorded, thereby eliminating one major drawback of meta-analysis: the biased selection of studies.”

Op. at 17.  The second time, however, the Court introduced new factual errors.  The Court erred in suggesting that Nissen uses all publicly available data.  There were, in fact, studies available to Nissen and to the public, which met Nissen’s inclusion criteria, but which he failed to include in his meta-analysis.  Nissen’s meta-analysis was thus biased by its failure to have conducted a complete, thorough review of the medical literature for qualifying RCTs.  Furthermore, contrary to the Court’s statement, Nissen included non-double-blinded RCTs, as his own published paper makes clear.

Erroneous Interpretation of p-values

The court erred in its interpretation of p-values:

 “The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance). “

Op. at 25 (internal citation omitted).  The p-value is, of course, the probability that results as large or larger would have been observed, given the truth of the null hypothesis that there is no difference between Avandia and its comparator medications.  The p-value does not permit a probabilistic assessment of the correctness of the null hypothesis; nor does it permit a straightforward probabilistic assessment of the correctness of the alternative hypothesis of rejecting the null hypothesis.

See Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 122, 357 (2000).

Hand Waiving over Statin Use

The Court appeared to have been confused by plaintiffs’ rhetoric that statin use masked a real risk of heart attacks in the Avandia RCTs. 

“It is not clear whether statin use was allowed in the DREAM study.”

Op. at 25.  The problem is that the Court fails to point to any evidence that the use of statins differed between the Avandia and comparator arms of the RCTs.  Statins have been one of the great pharmaceutical success stories of the last 15 years, and it is reasonable to believe that today most diabetic patients (who often high blood fats) would taking statins.  At the time of the DREAM study, the prevalence of use would have been lower than today, but there was no evidence mentioned that the use was different between the Avandia and other arms of the DREAM trial.

Errors in Interpreting RCTs by Intention to Treat Analyses

For unexplained reasons, the court was impressed by what it called a high dropout rate in one of the larger Avandia RCTs:

“The ADOPT study was marred by a very high dropout rate (more than 40% of the subjects did not complete the four year follow up) and the use of statins during the trial.”

Op. at 25.  Talk about being hoisted with one’s own petard!  The high dropout rate in ADOPT resulted from the fact that this RCT was a long-term test of “glycemic control.”  Avandia did better with respect to durable glycemic control than two major, accepted medications, metformin and sulfonylurea, and thus the dropouts came mostly in the comparator arms as patients not taking Avandia required more and stronger medications, or even injected insulin.  The study investigators were obligated to analyze their data in accord with “intention to treat” principles, and so patients removed from the trial due to lack of glycemic control could no longer be counted with respect to any outcome of interest.  Avandia patients thus had longer follow-up time, and more opportunity to have events due to their underlying pathologic physiology (diabetes and diabetic-related heart attacks).

Ignoring Defense Arguments

GSK may have hurt itself by electing not to call an expert witness at the Daubert hearing in this MDL.  Still, the following statement by the Court is hard to square with opening argument given at the hearing:

“GSK points out no specific flaws or limitations in the design or implementation of the NISSEN meta-analysis”

Op. at 6.  If true, then shame on GSK; but somehow this statement seems too incredible to be true.

Ignoring the Difference between myocardial ischemic events and myocardial infarction (MI)

MI occurs when heart muscle dies as a result of a blockage in a blood vessel that brings oxygenated blood.  An ischemic event is defined very broadly in GSK’s study:

“To minimize the possibility of missing events of interest, all events coded with broadly inclusive AE terms captured from investigator reports were reviewed. SAEs identified from the trials database included cardiac failure, angina pectoris, acute pulmonary edema, all cases of chest pain without a clear non-cardiac etiology and myocardial infarction/myocardial ischemia.”

Alexander Cobitz MD, PhD, et al., “A retrospective evaluation of congestive heart failure and myocardial ischemia events in 14 237 patients with type 2 diabetes mellitus enrolled in 42 short-term, double-blind, randomized clinical studies with rosiglitazone,” 17 Pharmacoepidem. & Drug Safety 769, 770 (2008).

In its pooled analysis, GSK was clearly erring on the side of safety in creating its composite end point, but the crucial point is that GSK included events that had nothing to do with MI.  The MDL court appears to have accepted uncritically the plaintiffs’ expert witnesses’ claim that the difference between myocardial ischemic events and MI is only a matter of degree.  The Court found “that the experts were able to draw reliable conclusions about myocardial infarction” from a meta-analysis about a different end point, “by virtue of their expertise and the available data.”  Op. at 10.  This is hand waiving or medical alchemy.

Uncritical Acceptance of Mechanistic Evidence Related to Increased Congestive Heart Failure (CHF) in Avandia Users

The court noted that plaintiffs’ expert witnesses relied upon a well-established relationship  between Avandia and congestive heart failure (CHF).  Op. at 14.  True, true, but immaterial.  Avandia causes fluid retention, but so do other drugs in this class of drugs as well.  Actos causes fluid retention, and carries the same warning for CHF, but there is no evidence that Actos causes MI or stroke.  Although the Court’s desire to have a mechanism of causation is understandable, that desire cannot substitute for actual evidence.

Misuse of Power Analyses

The Avandia MDL Court mistakenly referred to inadequate statistical power in the context of interpreting data of heart attacks in Avandia RCTs. 

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

Op. at 5.

The Court might have helped itself by adverting to the Reference Manual of Scientific Evidence:

“Power is the chance that a statistical test will declare an effect when there is an effect to declare. This chance depends on the size of the effect and the size of the sample.”

Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 125 – 26, 357 (2000) (internal citations omitted).  In other words, you cannot assess the power of the study unless you specify the size of the association of the alternative hypothesis, and the sample size, among other things.  It is true that most of the Avandia trials were not powered to detect heart attacks, but the concept of power requires the user to specify at least the alternative hypothesis against which the study is being assessed for power. Once the studies were completed, and the data became available, there was no longer any need or use for the consideration of power; the statistical precision of the studies’ results was given by their confidence intervals.

Incorrect Use of the Concept of Replication

The MDL court erred in accepting the plaintiffs’ expert witnesses’ bolstering of Nissen’s meta-analytic results by their claim that Nissen’s results had been “replicated”:

“[T]he NISSEN results have been replicated by other researchers. For example, the SINGH meta-analysis pooled data from four long-term clinical trials, and also found a statistically significant increase in the risk of myocardial infarction for patients taking Avandia. GSK and the FDA have also replicated the results of NISSEN through their own meta-analyses.”

Op. at 6 (internal citations omitted).

“The SINGH, GSK and FDA meta-analyses replicated the key findings of the NISSEN study.43”

Op. at 17.

These statements mistakenly suggest that Nissen’s meta-analysis was able to generate a reliable conclusion that there was a statistically significant association between Avandia use and MI.  The Court’s insistence that Nissesn was replicated does not become more true for having been stated twice.  Nissen’s meta-analysis was not an observational study in the usual sense.  His publication made very clear what studies were included (and not at all clear what studies were excluded), and the meta-analytic model that he used.  Thus, it is trivially true that anyone could have replicated his analysis, and indeed, several researchers did so.  See, e.g., George A. Diamond, MD, et al., “Uncertain Effects of Rosiglitazone on the Risk for Myocardial Infarction and Cardiovascular Death,” 147 Ann. Intern. Med. 578 (2007).

But Nissen’s results were not replicated by Singh, GSK, or the FDA, because these other meta-analyses used different methods, different endpoints (in GSK’s analysis), different inclusion criteria, different data, and different interpretative methods.  Most important, GSK and FDA could not reproduce the statistically significant finding for their summary estimate of association between Avandia and heart attacks.

One definition of replication that the MDL court might have consulted makes clear that replication is a repeat of the same experiment to determine whether the same (or a consistent) result is obtained:

“REPLICATION — The execution of an experiment or survey more than once so as to confirm the findings, increase precision, and obtain a closer estimation of sampling error.  Exact replication should be distinguished from consistency of results on replication.  Exact replication is often possible in the physical sciences, but in the biological and behavioral sciences, to which epidemiology belongs, consistency of results on replication is often the best that can be attained. Consistency of results on replication is perhaps the most important criterion in judgments of causality.”

Miquel Porta, Sander Greenland, and John M. Last, eds., A Dictionary of Epidemiology, 5th ed., at 214 (2008).  The meta-analyses of Singh, GSK, and FDA did not, and could not, replicate Nissen’s.  Singh’s meta-analysis obtained a result similar to Nissen’s, but the other meta-analyses by GSK, FDA, and Manucci failed to yield a statistically significant result for MI.  This is replication only in Wonderland.

It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law.