The Third Edition of the Reference Manual on Scientific Evidence (2011) [RMSE3ed] treats statistical power in three of its chapters, those on statistics, epidemiology, and medical testimony. Unfortunately, the treatment is not always consistent.
The chapter on statistics has been consistently among the best and most frequently ignored content of the three editions of the Reference Manual. The most recent edition offers a good introduction to basic concepts of sampling, random variability, significance testing, and confidence intervals. David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in RMSE3ed 209 (2011). Kaye and Freedman provide an acceptable non-technical definition of statistical power:
“More precisely, power is the probability of rejecting the null hypothesis when the alternative hypothesis … is right. Typically, this probability will depend on the values of unknown parameters, as well as the preset significance level α. The power can be computed for any value of α and any choice of parameters satisfying the alternative hypothesis. … Frequentist hypothesis testing keeps the risk of a false positive to a specified level (such as α = 5%) and then tries to maximize power. Statisticians usually denote power by the Greek letter beta (β). However, some authors use β to denote the probability of accepting the null hypothesis when the alternative hypothesis is true; this usage is fairly standard in epidemiology. Accepting the null hypothesis when the alternative holds true is a false negative (also called a Type II error, a missed signal, or a false acceptance of the null hypothesis).”
Id. at 254 n.106
The definition is not, however, without problems. First, it introduces a nomenclature issue likely to be confusing for judges and lawyers. Kaye and Freeman use β to denote statistical power, but they acknowledge that epidemiologists use β to denote the probability of a Type II error. And indeed, both the chapters on epidemiology and medical testimony use β to reference Type II error rate, and thus denote power as the complement of β, or (1- β). See Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3ed 549, 582, 626 ; John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, Abogado, “Reference Guide on Medical Testimony,” in RMSE3ed 687, 724. This confusion in nomenclature is regrettable, given the difficulty many lawyers and judges seem have in following discussions of statistical concepts.
Second, the reason for introducing the confusion about β is doubtful. Kaye and Freeman suggest that statisticians usually denote power by β, but they offer no citations. A quick review (not necessarily complete or even a random sample) suggests that many modern statistics texts denote power as (1- β). See, e.g., Richard D. De Veaux, Paul F. Velleman, and David E. Bock, Intro Stats 545-48 (3d ed. 2012); Rand R. Wilcox, Fundamentals of Modern Statistical Methods 65 (2d ed. 2010). At the end of the day, there really is no reason for the conflicting nomenclature and the likely confusion it engenders. Indeed, the duplicative handling of statistical power, and other concepts, suggests that it is time to eliminate the repetitive discussions, in favor of one, clear, thorough discussion in the statistics chapter.
Third, Kaye and Freeman problematically refer to β as the probability of accepting the null hypothesis when elsewhere they more carefully instruct that a non-significant finding results in not rejecting the null hypothesis as opposed to accepting the null. Id. at 253. See also Daniel Rubinfeld, “Reference Guide on Multiple Regression,“ in RMSE3d 303, 321 (describing a p-value > 5% as leading to failing to reject the null hypothesis).
Fourth, Kaye and Freedman’s discussion of power, unlike most of their chapter, offers advice that is controversial and unclear:
“On the other hand, when studies have a good chance of detecting a meaningful association, failure to obtain significance can be persuasive evidence that there is nothing much to be found.”
RMSE3d at 254. Note that the authors leave open what a legal or clinically meaningful association is, and thus offer no real guidance to judges on how to evaluate power after data are collected and analyzed. As Professor Sander Greenland has argued, in legal contexts, this reliance upon observed power (as opposed to power as a guide in determining appropriate sample size in the planning stages of a study) is arbitrary and “unsalvageable as an analytic tool.” See Sander Greenland, “Nonsignificance Plus High Power Does Not Imply Support Over the Alternative,” 22 Ann. Epidemiol. 364, 364 (2012).
The chapter on epidemiology offers similar controversial advice on the use of power:
“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity.93 The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.94 The power of a study is the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha (or statistical significance) specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.95 Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often, power curves are used in the design of a study to determine what size the study populations should be.96”
Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” RMSE3ed 549, 582. Although the authors correctly emphasize the need to specify an alternative hypothesis, their discussion and advice are empty of how that alternative should be selected in legal contexts. The suggestion that power curves can be constructed is, of course, true, but irrelevant unless courts know where on the power curve they should be looking. The authors are correct that power is used to determine adequate sample size under specified conditions, but again, the use of power curves in this setting is today rather uncommon. Investigators select a level of power corresponding to an acceptable Type II error rate, and an alternative hypothesis that would be clinically meaningful for their research, in order to determine their sample size. Translating clinical into legal meaningfulness is not always straightforward.
In a footnote, the authors of the epidemiology chapter note that Professor Rothman has been “one of the leaders in advocating the use of confidence intervals and rejecting strict significance testing.” RMSE3d at 579 n.88. What the chapter fails, however, to mention is that Rothman has also been outspoken in rejecting post-hoc power calculations that the epidemiology chapter seems to invite:
“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”
Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008). See also Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986) (“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”)
The selective, incomplete scholarship of the epidemiology chapter on the issue of statistical power is not only unfortunate, but it distorts the authors’ evaluation of the sparse case law on the issue of power. For instance, they note:
“Even when a study or body of studies tends to exonerate an agent, that does not establish that the agent is absolutely safe. See Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767 (N.D. Ohio 2010). Epidemiology is not able to provide such evidence.”
RMSE3d at 582 n.93; id. at 582 n.94 (“Thus, in Smith v. Wyeth-Ayerst Labs. Co., 278 F.Supp. 2d 684, 693 (W.D.N.C. 2003), and Cooley v. Lincoln Electric Co., 693 F. Supp. 2d 767, 773 (N.D. Ohio 2010), the courts recognized that the power of a study was critical to assessing whether the failure of the study to find a statistically significant association was exonerative of the agent or inconclusive.”)
Here Green, Freedman, and Gordis shift the burden to the defendant and make the burden one of absolute certainty in the product’s safety. This is not a legal standard. The cases they cite amplify the error. In Cooley, for instance, the defense expert would have opined that welding fume exposure did not cause parkinsonism or Parkinson’s disease. Although the expert had not conducted a meta-analysis, he had reviewed the confidence intervals around the point estimates of the available studies. Many of the point estimates were at or below 1.0, and in some cases, the upper bound of the confidence interval excluded 1.0. The trial court expressed its concern that the expert witness had inferred “evidence of absence” from “absence of evidence.” Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767, 773 (N.D. Ohio 2010). This concern, however, was misguided given that many studies had tested the claimed association, and that virtually every case-control and cohort study had found risk ratios at or below 1.0, or very close to 1.0. What the court in Cooley, and the authors of the epidemiology chapter in the RSME3d have lost sight of, is that when the hypothesis is repeatedly tested, with failure to reject the null hypothesis, and with point estimates at or very close to 1.0, and with narrow confidence intervals, then the claimed association is probably incorrect. See, e.g., Anthony J. Swerdlow, Maria Feychting, Adele C. Green, Leeka Kheifets, David A. Savitz, International Commission for Non-Ionizing Radiation Protection Standing Committee on Epidemiology, “Mobile Phones, Brain Tumors, and the Interphone Study: Where Are We Now?” 119 Envt’l Health Persp. 1534, 1534 (2011) (“Although there remains some uncertainty, the trend in the accumulating evidence is increasingly against the hypothesis that mobile phone use can cause brain tumors in adults.”).
The Cooley court’s comments have some validity when applied to a single study, but not to the impressive body of exculpatory epidemiologic evidence that pertains to welding fume and Parkinson’s disease. Shortly after the Cooley case was decided, a published meta-analysis of welding fume or manganese exposure demonstrated a reduced level of risk for Parkinson’s disease among persons occupationally exposed to welding fumes or manganese. James Mortimer, Amy Borenstein, and Lorene Nelson, “Associations of welding and manganese exposure with Parkinson disease: Review and meta-analysis,” 79 Neurology 1174 (2012).