Power in the Courts — Part Two

Post hoc calculations of power were once in vogue, but have now routinely been condemned by biostatisticians and epidemiologists in studies that report confidence intervals around estimates of associations, or “effect sizes.”  Power calculations require an alternative hypothesis against which to measure the rejection of the null hypothesis, and the choice of the alternative is subjective and often arbitrary.  Furthermore, the power calculation must make assumptions about the anticipated variance of the data to be obtained.  Once the data are in fact obtained, those assumptions may be shown wrong.  In other words, sometimes the investigators are “lucky,” and their data are less variable than anticipated.  The variance of the data actually obtained, rather than hypothesized, can best be appreciated from the confidence interval around the actually measured point estimate of risk.

In Part One of “Power in Courts,” I addressed the misplaced emphasis the Avandia MDL court put upon the concept of statistical power.  The court apparently accepted at face value the plaintiffs’ argument that GSK’s clinical trials were “underpowered,” which claim was very misleading.  Power calculations were no doubt done to choose sample size for GSK’s clinical trials, but those a priori estimates were based upon assumptions.  In the case of one very large trial, RECORD, many fewer events occurred than anticipated (which is generally a good thing to happen, and not unusual in the context of a clinical trial that gives patients in all arms of the trial better healthcare than available to the general population).  In one sense, those plaintiffs’ expert witnesses are correct to say that RECORD was “underpowered,” but once the study is done, the real measure of statistical precision is given by the confidence interval.

Because the Avandia MDL is not the only litigation in which courts and lawyers have mistakenly urged power concepts for studies that have already been completed, I have collected some key statements that reflect the general consensus and reasoning against what the Court did.

To be fair, the Avandia court did not fault the defense for not having analyzed and calculated post-hoc power of the clinical trials, all of which failed to find statistically significant associations between Avandia and heart attacks. The court, however, did appear to embrace the plaintiffs’ rhetoric that all the Avandia trials were underpowered, without any consideration given to the width and the upper bounds of the confidence intervals around those trials’ estimates of risk ratios for heart attack.  Remarkably, the Avandia court did not present any confidence intervals for any estimates of effect size, although it did present p-values, which it then badly misinterpreted.  Many of the Avandia trials (and the resulting meta-analyses) confidently ruled out risk ratios, for heart attacks, under 2.0.  The court’s conclusions about power are thus misleading at best.

Several consensus statements address whether considerations of power, after studies are completed and the data are analyzed, are appropriate.  The issue has also been addressed extensively in textbooks and in articles.  I have collected some of the relevant statements, below.  To the extent that the Federal Judicial Center’s Reference Manual on Scientific Evidence appears to urge post hoc power calculations, I hope that the much anticipated  Third Edition will correct the error.

CONSENSUS STATEMENTS

CONSORT

The CONSORT group (Consolidated Standards of Reporting Trials) is a world-wide group that sets quality standard for randomized trials in testing of pharmaceuticals.  CONSORT’s lead author is Douglas Altman, a well-respected biostatistician from Oxford University.  The advice of the CONSORT group is clear:

“There is little merit in calculating the statistical power once the results of the trial are known, the power is then appropriately indicated by confidence intervals.”

Douglas Altman, et al., “The Revised CONSORT Statement for Reporting Randomized Trials:  Explanation and Elaboration,” 134 Ann. Intern. Med. 663, 670 (2001).  See alsoDouglas Altman, et al., “Reporting power calculations is important,” 325 Br. Med. J. 1304 (2002).

STROBE

An effort similar to the CONSORT group has been put together by investigators interested in observational studies, the STROBE group (the Strengthening the Reporting of Observational Studies in Epidemiology).  The STROBE group was made up of leading epidemiologists and biostatisticians, who addressed persistent issues and errors in the reporting of observational studies.  Their advice was equally unequivocal on the issue of post hoc power considerations:

“Do not bother readers with post hoc justifications for study size or retrospective power calculations. From the point of view of the reader, confidence intervals indicate the statistical precision that was ultimately obtained. It should be realized that confidence intervals reflect statistical uncertainty only, and not all uncertainty that may be present in a study (see item 20).”

Vandenbroucke, et al., “Strengthening the reporting of observational studies in epidemiology (STROBE):  Explanation and elaboration,” 18 Epidemiology 805, 815 (2007) (Section 10, sample size).

American Psychological Association

In 1999, a committee of the American Psychological Association met to discuss various statistical issues in psychological research papers.  With respect to power analysis, the committee concluded:

“Once the study is analyzed, confidence intervals replace calculated power in describing the results.”

Wilkinson, Task Force on Statistical Inference, “Statistical methods in psychology journals:  guidelines and explanations,” 54 Am. Psychol. 594-604 (1999)

TEXTBOOKS

Modern Epidemiology

Kenneth Rothman and Sander Greenland are known for many contributions, not the least of which is their textbook on epidemiology.  In the second edition of Modern Epidemiology, the authors explain how and why confidence intervals replace power considerations, once the study is completed and the data are analyzed:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis.  The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect.  * * *  In planning a study, it is reasonable to make conjectures about the magnitude of an effect in order to compute sample-size requirements or power.

In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with sample-size or power calculations (Smith & Bates 1992; Goodman & Berlin 1994). * * * Confidence limits convey much more of the essential information by indicating a range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level).  They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman & Sander Greenland, Modern Epidemiology 192 – 193 (1998)

And in 2008, with the addition of Timothy Lash as a co-author, Modern Epidemiology continued its guidance on power as only a pre-study consideration:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008)

A Short Introduction to Epidemiology

Neil Pierce, an epidemiologist, citing Smith & Bates 1992, and Goodman & Berlin 1994, infra, describes the standard method:

“Once a study has been completed, there is little value in retrospectively performing power calculations since the confidence limits of the observed measure of effect provide the best indication of the range of likely value for true association.”

Neil Pierce, Introduction to Epidemiology (2d ed. 2005)

Statistics at Square One

The British Medical Journal publishes a book, Statistics at Square One, which addresses the issue of post hoc power:

“The concept of power is really only relevant when a study is being planned.  After a study has been completed, we wish to make statements not about hypotheses but about the data, and the way to do this is with estimates and confidence intervals.”

T. Swinscow, Statistics at Square One42 (9thed. London 1996) (citing to a book by Martin Gardiner and Douglas Altman, both highly accomplished biostatisticians).

How to Report Statistics in Medicine

Two authors from the Cleveland Clinic, in a guidebook published by the American College of Physicians:

“Until recently, authors were urged to provide ‘post hoc power calculations’ for non-significant differences.  That is, if the results of the study were negative, a power calculation was to be performed after the fact to determine the adequacy of the sample size.  Confidence intervals also reflect sample size, however, and are more easily interpreted, so the requirement of a post hoc power calculation for non-statistically significant results has given way to reporting the confidence interval (32).”

Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 58 (2d ed. 2006)(citing to Goodman & Berlin, infra).  See also Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 78 (1st ed. 1996)

Clinical Epidemiology:  The Essentials

The Fletchers, both respected clinical epidemiologists, describe standard method and practice:

Statistical Power Before and After a Study is Done

Calculation of statistical power based on the hypothesis testing approach is done by the researchers before a study is undertaken to ensure that enough patients will be entered to have a good chance of detecting a clinically meaningful effect if it is present.  However, after the study is completed this approach is no longer relevant.”  There is no need to estimate effect size, outcome event rates, and variability among patients, they are now known.

Therefore, for researchers who report the results of clinical research and readers who try to understand their meaning, the confidence interval approach is more relevant.  One’s attention should shift from statistical power for a somewhat arbitrarily chosen effect size, which may be relevant in the planning stage, to the actual effect size observed in the study and the statistical precision of that estimate of the true value.”

R. Fletcher, et al., Clinical Epidemiology: The Essentials at 200 (3d ed. 1996)

The Planning of Experiments

Sir David Cox is one of the leading statisticians in the world.  In his classic 1958 text, The Planning of Experiments, Sir David wrote:

“Power is important in choosing between alternative methods of analyzing data and in deciding on an appropriate size of experiment.  It is quite irrelevant in the actual analysis of data.”

David Cox, The Planning of Experiments 161 (1958)

ARTICLES

Cummings & Rivara (2003)

“Reporting of power calculations makes little sense once the study has been done.  We think that reviewers who request such calculations are misguided.”

* * *

“Point estimates and confidence intervals tell us more than any power calculations about the range of results that are compatible with the data.”

Cummings & Rivara, “Reporting statistical information in medical journal articles,” 157 Arch. Pediatric Adolesc. Med. 321, 322 (2003)

Senn (2002)

“Power is of no relevance in interpreting a completed study.

* * *

“The definition of a medical statistician is one who not accept that Columbus discovered America because he said he was looking for India in the trial plan.  Columbus made an error in his power calculation – – he relied on an estimate of the size of the Earth that was too small, but he made one none the less, and it turned out to have very fruitful consequences.”

Senn, “Power is indeed irrelevant in interpreting completed studies,” 325 Br. Med. J. 1304 (2002).

Hoenig & Heisey (2001)

“Once we have constructed a C.I., power calculations yield no additional insight.  It is pointless to perform power calculations for hypotheses outside of the C.I. because the data have already told us that these are unlikely values.”  p. 22a

Hoenig & Heisey, “The Abuse of Power:  The Pervasive Fallacy of Power Calculations for Data Analysis”? American Statistician (2001)

Zumbo & Hubley (1998)

In The Statistician, published by the Royal Statistical Society, these authors roundly condemn post hoc power calculations:

“We suggest that it is nonsensical to make power calculations after a study has been conducted and a statistical decision has been made.  Instead, the focus after a study has been conducted should be on effect size . . . .”

Zumbo & Hubley, “A note on misconceptions concerning prospective and retrospective power,” 47-2 The Statistician 385 (1998)

Goodman & Berlin (1994)

Professor Steven Goodman is a professor of epidemiology in Johns Hopkins University, and the Statistical Editor for the Annals Internal Medicine.  Interestingly, Professor Goodman appeared as an expert witness, opposite Sander Greenland, in hearings on Thimerosal.  His article, with Jesse Berlin, has been frequently cited in support of the irrelevance of post hoc power considerations:

“Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant.”

(p. 200a, ¶1)

“Studies with low statistical power have sample sizes that are too small, producing results that have high statistical variability (low precision).  Confidence intervals are a convenient way to express that variability.”

(p. 200a, ¶2)

“Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected . . . .”

(p. 200 b, top)

“Power is exclusively a pretrial concept; it is the probability of a group of possible results (namely all statistically significant outcomes) under a specified alternative hypothesis.  A study produces only one result.”

(p. 201a, ¶2)

“The perspective after the experiment differs from that before that experiment simply because the result is known.  That may seem obvious, but what is less apparent is that we cannot cross back over the divide and use pre-experiment numbers to interpret the result.  That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit a lottery jackpot (the after-experiment perspective).”

(p. 201a-b)

“For interpretation of observed results, the concept of power has no place, and confidence intervals, likelihood, or Bayesian methods should be used instead.”

(p. 205)

Goodman & Berlin, “The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results,” 121 Ann. Intern. Med. 200, 200, 201, 205 (1994).

Smith & Bates (1992)

This article was published in the journal, Epidemiology, which was founded and edited by Professor Kenneth Rothman:

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.”

“Generally, a negative study with low power will be regarded as providing little evidence against the existence of a causal association.  Often overlooked, however, is that otherwise well-conducted studies of low power can be informative:  the upper bound of the (1 – α)% confidence intervals provides a limit on the likely magnitude of any actual effect.

The purpose of this paper is to extend this argument to show that the use of traditional power calculations is causal inference (that is, after a study has been carried out) can be misleading and inferior to the use of upper confidence limits of estimates of effect.  The replacement of post-study power calculations with confidence interval estimates is not a new idea.”

(p. 449a)

* * *

“It is clear, then, that the use of the upper confidence limit conveys considerable information for the purposes of causal inference; by contrast, the power calculation can be quite misleading.”

(p. 451b)

* * *

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.  As we have demonstrated, they have little, if any, value.  We propose that, in their place, (1 – α)%  upper confidence limits be calculated.”

(p. 451b)

Smith & Bates, “Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies,” 3 Epidemiology 449-52 (1992)

Greenland (1988)

“the arbitrariness of power specification is of course absent once the data are collected, since the statistical power refers to the probability of obtaining a particular type of data.  It is thus not a property of particular data sets.  Statistical power of collected data, as the probability of heads on a coin toss that has already taken place, can, at best, meaningfully refer only to one’s ignorance of the result and loses all meaning when one examines the result.”

Greenland, “On Sample Size and Power Calculations for Studies Using Confidence Limits,” Am. J. Epidem. 236 (1988)

Simon (1986)

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end.  This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

Rothman (1986)

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

Makuch & Johnson (1986)

“[the] confidence interval approach, the method we recommend for interpreting completed trials in order to judge the range of true treatment differences that is reasonable consistent with the observed data.”

Robert W. Makuch & Mary F. Johnson, “Some Issues in the Design and Interpretation of ‘Negative’ Clinical Studies,” 146 Arch. Intern. Med. 986, 986 (1986).

Detsky & Sackett (1985)

“Negative clinical trials that conclude that neither of the treatments is superior are often criticized for having enrolled too few patients.  These criticisms usually are based on formal sample size calculations that compute the number of patients required prospectively, as if the trial had not yet been carried out.  We suggest that this ‘prospective’ sample size calculation is incorrect, for once the trial is over we have ‘hard’ data from which to estimate the actual size of the treatment effect.  We can either generate confidence limits around the observed treatment effect or retrospectively compare it with the effect hypothesized before the trial.”

Detsky & Sackett, “When was a ‘negative’ clinicaltrial big enough?  How many patients you need depends on what you found,” 145 Arch. Intern. Med. 709 (1985).