The Misuse of Power in the Courts
A claim that a study has low power is meaningless unless both the alternative hypothesis and the level of significance are included in the statement. See Sander Greenland, “Nonsignificance Plus High Power Does Not Imply Support Over the Alternative,” 22 Ann. Epidemiol. 364 (2012); Sander Greenland & Charles, Poole, “Problems in common interpretations of statistics in scientific articles, expert reports, and testimony,” 51 Jurimetrics J. 113, 121-22 (2011).
Power can always be assessed as low by selecting an alternative hypothesis sufficiently close to the null. A study, using risk ratios, which has high power against an alternative hypothesis of 2.0, may have very low power against an alternative of 1.1. Because risk ratios greater than two are often used to attribute specific causation, measuring power of a study against an alternative hypothesis of a doubling of risk might well be a reasonable approach in some cases. For instance, in Miller v. Pfizer, 196 F. Supp. 2d 1062, 1079 (D. Kan. 2002), aff’d, 356 F. 3d 1326 (10th Cir.), cert. denied, 543 U.S. 917 (2004), the trial court’s Rule 706 expert witness calculated the power of a study to exceed 90% probability to detect a doubling of risk in a pooled analysis of suicidality in clinical trial data of an anti-depressant. Report of John Concato, M.D., 2001 WL 1793169, *9 (D.Kan. 2001). Unless a court was willing to specify the level at which it would find the risk ratio unhelpful or not probative, such as a relative risk greater than two, power analyses of completed studies are not particularly useful.
Plaintiffs’ counsel rightly complain when defendants claim that a study with a statistically “non-significant” risk ratio greater than 1.0 has no probative value. Although random error (or bias and confounding) may account for the increased risk, the risk may be real. If studies consistently show an increased risk, even though all the studies have reported p-values > 5%, meta-analytic approaches may very well help rule out chance as a likely explanation for the increased risk. The complaint that a study, however, is underpowered, without more, does not help plaintiff establish an association; nor does the complaint establish that the study provides no useful information.
The power of a study depends upon several variables, including the size of the alternative hypothesis, the sample size, the expected value and its variance, and the acceptable level of probability for false-positive findings, which level is reflected in the pre-specified p-value, α, at which level the study’s findings would be interpreted as not likely consistent with the null hypothesis. The lower α is set, the lower will be the power of a test or a study, all other things being equal. Similarly, moving from a two-tailed to a one-tailed test of significance will increase power. Courts have acknowledged that both Type I and Type II errors, and the corresponding α and β, are important, but they have overlooked that Type II errors are usually less relevant to the litigation process. See, e.g., DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 948 (3d Cir. 1990). A single study that failed to show a statistically significant difference in the outcome of interest does not support a conclusion that the outcome is not causally related to the exposure under study. In products liability litigation, the parties are typically not assigned a burden of proving the absence of causation.
In the Avandia litigation, plaintiffs’ key claim is that the medication, an oral anti-diabetic, causes heart attacks, even though none of the several dozen clinical trials found a statistically significant increased risk. Plaintiffs’ expert witnesses argued that all the clinical trials of Avandia were “underpowered,” and thus the failure to find an increased risk was a Type II (false-negative) error that resulted from the small size of the clinical trials. The Avandia MDL court, considering Rule 702 challenges to plaintiffs’ expert witness opinions, accepted this argument:
“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs [randomized controlled trials] upon which GSK relies are all underpowered to study cardiac risks.”
In re Avandia Mktg., Sales Practices & Prods. Liab. Litig., 2011 WL 13576, at *2 (E.D. Pa. 2011) (emphasis in original). The Avandia MDL court failed to realize that the power argument was empty without a specification of an alternative hypothesis. For instance, in one of the larger trials of Avandia, the risk ratio for heart attack was a statistically non-significant 1.14, with a 95% confidence interval that spanned 0.80 to 1.63. P.D. Home, et al., Rosiglitazone Evaluated for Cardiovascular Outcomes in Oral Agent Combination Therapy for Type 2 Diabetes (RECORD), 373 Lancet 2125 (2009). This trial, standing alone, thus had excellent power against an alternative hypothesis that Avandia doubled the risk of heart attacks; such an alternative hypothesis would clearly be rejected based upon the RECORD trial. On the other hand, an alternative hypothesis of 1.2 would not be. The confidence interval, by giving a quantification of random error, conveys results reasonably compatible with the study estimate; the claim of “low power” against an unspecified alternative hypothesis, conveys nothing.
Last year, in a hormone therapy breast cancer case, the Eight Circuit confused power with β, and succumbed to plaintiff’s expert witness’s argument that he was justified in ignoring several large, well-conducted clinical trials and observational studies because they were “underpowered,” without specifying the alternative hypothesis he was using to make his claim:
“Statistical power is ‘the probability of rejecting the null hypothesis in a statistical test when a particular alternative hypothesis happens to be true’. Merriam–Webster Collegiate Dictionary 973 (11th ed. 2003). In other words, it is the probability of observing false negatives. Power analysis can be used to calculate the likelihood of accurately measuring a risk that manifests itself at a given frequency in the general population based on the sample size used in a particular study. Such an analysis is distinguishable from determining which study among several is the most reliable for evaluating whether a correlative or even a causal relationship exists between two variables.”
Kuhn v. Wyeth, Inc., 686 F.3d 618, 622 n.5 (8th Cir. 2012), rev’g, In re Prempro Prods. Liab. Litig., 765 F. Supp. 2d 1113 (W.D. Ark. 2011). The Kuhn court’s formulation, “in other words,” is incorrect. Power is not the probability of observing false negatives; it is the probability of correctly rejecting the null in favor of a specified alternative hypothesis, at a specified level of significance probability. The court’s further discussion of “accurately measuring” mischievously confuses one aspect of statistical power concerned with random variability, with study validity. The 8th Circuit’s opinion never discusses or discloses what alternative hypothesis the plaintiff’s expert witness had in mind when disavowing certain studies as underpowered. I suspect that none was ever provided, and that the judges missed the significance of the omission. The courts would seem better off using the confidence intervals around point estimates to assess the statistical imprecision in the observed data, rather than improper power analyses that fail to specify a legally significant alternative hypothesis.