Statistical Power in the Academy

Previously I have written about the concept of statistical power and how it is used and abused in the courts.  See here and here.

Statistical power was discussed in both the chapters on statistics and on epidemiology in the Second Edition of The Reference Manual on Scientific Evidence. In my earlier posts, I pointed out that the chapter on epidemiology provided some misleading, outdated guidance on the use of power.  See Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000) (recommending use of power curves to assess whether failure to achieve statistical significance is exonerative of the exposure in question).  This chapter suggests that “[t]he concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.” Id.; see also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000).

The fact of the matter is that power curves are rarely or never used in contemporary epidemiology, and post-hoc power calculations have been discouraged and severely criticized for a long time. After the data are collected, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.  These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out (on the basis of random variation only) by the data in a given study. Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis.

Twenty-five years ago, the use of post-hoc power was thoughtfully put in the dust bin of statistical techniques in the leading clinical medical journal:

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end.  This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

An accompanying editorial by Ken Rothman reinforced the guidance given by Simon:

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

These two papers must be added to the 20 consensus statements, textbooks, and articles I previously cited.  See Schachtman, Power in the Courts, Part Two (2011).

The danger of the Reference Manual’s misleading advice is illustrated in a recent law review article by Professor Gold, of the Rutgers Law School, who asks “[w]hat if, as is frequently the case, such study is possible but of limited statistical power?”  Steve C. Gold, “The ‘Reshapement’ of the False Negative Asymmetry in Toxic Tort Causation, 37 William Mitchell L. Rev. 101, 117 (2011) (available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1797826).

Never mind for the moment that Professor Gold offers no empirical evidence to support his assertion that studies of limited statistical power are “frequently” used in litigation.  Gold critically points to Dunn v. Sandoz Pharmaceuticals Corp., 275 F. Supp. 2d 672, 677–81, 684 (M.D.N.C. 2003), a Parlodel case in which case plaintiff relied upon a single case-control study that found an elevated odds ratio (8.4), which was not statistically significant.  Gold at 117.  Gold complains that “a study’s limited statistical power, rather than the absence of a genuine association, may lead to statistically insignificant results that courts treat as disproof of causation, particularly in situations without the large study samples that result from mass exposures.” Id.  Gold goes on to applaud two cases for emphasizing consideration of post-hoc power.  Id. at 117 & n. 80 – 81 (citing Smith v. Wyeth-Ayerst Labs. Co., 278 F. Supp. 2d 684, 692 – 93 (W.D.N.C. 2003) (“[T]he concept of power is key because it’s helpful in evaluating whether the study‘s outcome . . . is exonerative or inconclusive.”), and Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767, 774 (N.D. Ohio 2010) (prohibiting expert witness from opining that epidemiologic studies are evidence of no association unless the witness “has performed a methodologically reliable analysis of the studies’ statistical power to support that conclusion”).

What of Professor Gold’s suggestion that power should be considered in evaluating studies that do not have statistically significant outcomes of interest?  See id. at 117. Not only is Gold’s endorsement at odds with sound scientific and statistical advice, but his approach reveals a potential hypocrisy when considered in the light of his criticisms of significance testing.  Post-hoc power tests ignore the results obtained, including the variance of the actual study results, and they are calculated based upon a predetermined arbitrary measure of Type I error (alpha) that is the focus of so much of Gold’s discomfort with statistical evidence.  Of course, power calculations also are made on the basis of arbitrarily selected alternative hypotheses, but this level of arbitrariness seems not to disturb Gold so much.

Where does the Third Edition of the Reference Manual on Scientific Evidence come out on this issue?  The Third Edition is not yet published, but Professor David Kaye has posted his chapter on statistics on the internet.  David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” chapter 5.  http://www.personal.psu.edu/dhk3/pubs/11-FJC-Ch5-Stat.pdf (David Freedman died in 2008, after the chapter was submitted to the National Academy of Sciences for review; only Professor Kaye responded to the Academy’s reviews).

The chapter essentially continues the Second Edition’s advice:

“When a study with low power fails to show a significant effect, the results may therefore be more fairly described as inconclusive than negative. The proof is weak because power is low. On the other hand, when studies have a good chance of detecting a meaningful association, failure to obtain significance can be persuasive evidence that there is nothing much to be found.”

Chapter 5, at 44-46 (citations and footnotes omitted).

The chapter’s advice is not, of course, limited to epidemiologic studies, where a risk ratio or a risk difference is typically reported with an appropriate confidence interval.  In the broad generality of considering all statistical tests, some of which do not report a measure of “effect size,” and the variability of the sample statistic, the chapter’s advice is fine.  But, as we can see from Professor Gold’s discussion and case review, the advice runs into trouble when measured against the methodological standards for evaluating an epidemiologic study’s results when confidence intervals are available.  Gold’s assessment of the cases is considerably skewed by his failure to recognize the inappropriateness of post-hoc power assessments of epidemiologic studies.