The Avandia MDL court, in its recent decision to permit plaintiffs’ expert witnesses to testify about general causation, placed substantial emphasis on the statistical concept of power. Plaintiffs’ key claim is that the Avandia causes heart attacks, yet no clinical trial of the oral anti-diabetic medication Avandia found a statistically significant increased risk of heart attacks. Plaintiffs’ expert witnesses argued that all the clinical trials of Avandia were “underpowered,” and thus the failure to find an increased risk was a Type II (false-negative) error that resulted from the small size of the clinical trials:
“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”
In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(emphasis in original).
The true effect, according to plaintiffs’ expert witnesses, could be seen only through aggregating the data, across clinical trials, in a meta-analysis. The proper conduct, reporting, and interpretation of meta-analyses were thus crucial issues for the Avandia MDL court, which appeared to have difficulty with statistical concepts. The court’s difficulty, however, may have had several sources beyond misleading plaintiffs’ expert witness testimony, and the defense’s decision not to call an expert in biostatistics and meta-analysis at the Rule 702 hearing.
Another source of confusion about statistical power may well have come from the very reference work designed to help judges address statistical and scientific evidence in their judicial capacities: The Reference Manual on Scientific Evidence.
Statistical power is discussed in the both the chapters on statistics and on epidemiology in The Reference Manual on Scientific Evidence. The chapter on epidemiology, however, provides misleading guidance on the use of power:
“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity. The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.79 The power of a study expresses the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha, or statistical significance, specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.80 Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often power curves are used in the design of a study to determine what size the study populations should be.81”
Michael D. Green, D. Michael Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000). See also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000)
This guidance is misleading in the context of epidemiologic studies because power curves are rarely used any more to assess completed studies. Power calculations are, of course, used to help determine sample size for a planned study. After the data are collected, however, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.
The authors of the chapter on epidemiology cite to a general review paper, id. at p. 362n.79, which does indeed address the concept of statistical power, but the author, a well-known statistician, addresses the issue primarily in the context of planning a statistical analysis, and in discrimination litigation, where the test result will be expressed in a p-value, without a measure of “effect size,” and more important, without a measure of a “confidence interval” around the estimate of effect size:
“The chance of rejecting the false null hypothesis, under the assumptions of an alternative, is called the power of the test. Simply put, among many ways in which we can test a null hypothesis, we want to select a test that has a large power to correctly distinguish between two alternatives. Generally speaking, the power of a test increases with the size of the sample, and tests have greater power, and therefore perform better, the more extreme the alternative considered becomes.
Often, however, attention is focused on the first type of error and the level of significance. If the evidence, then, is not statistically significant, it may be because the null hypothesis is true or because our test did not have sufficient power to discern a difference between the null hypothesis and an alternative explanation. In employment discrimination cases, for example, separate tests for small samples of employees may not yield statistically significant results because each test may not have the ability to discern the null hypothesis of nondiscriminatory employment from illegal patterns of discrimination that are not extreme. On the other hand, a test may be so powerful, for example, when the sample size is very large, that the null hypothesis may be rejected in favor of an alternative explanation that is substantively of very little difference. ***
Attention must be paid to both types of errors and the risks of each, the level of significance, and the power. The trier of fact can better interpret the result of a significance test if he or she knows how powerful the test is to discern alternatives. If the power is too low against alternative explanations that are illegal practices, then the test may fail to achieve statistical significance even though the illegal practices may be operating. If the power is very large against a substantively small and legally permissible difference from the null hypothesis, then the test may achieve statistical significance even though the employment practices are legal.”
Stephen E. Fienberg, Samuel H. Krislov, and Miron L. Straf, “Understanding and Evaluating Statistical Evidence in Litigation,” 36 Jurimetrics J. 1, 22-23 (1995).
Professor Fienberg’s characterization is accurate, but his description of “post-hoc” assessment of power was not provided for the context of edemiologic studies, which today virtually always report confidence intervals around the studies’ estimates of effect size. These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out by the data in a given study. Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis. A wily plaintiff’s expert witness can always arbitrarily select a sufficiently low alternative hypothesis, say a relative risk of 1.01, such that any study would have a vanishingly small probability of correctly distinguishing the null and alternative hypotheses.
The Reference Manual is now undergoing a revision, for an anticipated third edition. A saner appreciation of the concept of power as it is used in epidemiologic studies and clinical trials would be helpful to courts and to lawyers who litigate cases involving this kind of statistical evidence.