Statistical Significance Test Anxiety

Although lawyers are known as a querulous lot, statisticians may not be far behind. The famous statistician John Wilder Tukey famously remarked that the collective noun for the statistical profession should be a “quarrel” of statisticians.[1]

Recently, philosopher Deborah Mayo, who has written insightfully about the “statistics wars,”[2] published an important article that addressed an attempt by some officers of the American Statistical Association (ASA) to pass off their personal views of statistical significance testing as views of the ASA.[3] This attempt took not only the form of an editorial over the name of the Executive Director, without a disclaimer, but also an email campaign to push journal editors to abandon statistical significance testing. Professor Mayo’s recent article explores the interesting concept of intellectual conflicts of interest arising from journal editors and association leaders who use their positions to advance their personal views. As discussed in some of my own posts, the conflict of interest led another ASA officer to appoint a Task Force on statistical significance testing, which has now, finally, been published in multiple fora.

Last week, on January 11, 2022, Professor Mayo convened a Zoom forum, “Statistical Significance Test Anxiety,” moderated by David Hand, at which she and Yoav Benjamini, an author of the ASA President’s Task Force, presented. About 70 statisticians and scientists from around the world attended.

Professor Mayo has hosted several editorial commentaries on her editorial in Conservation Biology, including guest blog posts from:

Brian Dennis
Philip Stark
Kent Staley
Yudi Pawitan
Christian Hennig
Ionides and Ritov
Brian Haig
Daniël Lakens

and my humble post, which is set out in full, below. There are additional posts on “statistical test anxiety” coming; check Professor Mayo’s blog for additional commentaries.

     *     *     *     *     *     *     *     *     *     *     *     *     *     *     *

Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice

The metaphor of law as an “empty vessel” is frequently invoked to describe the law generally, as well as pejoratively to describe lawyers. The metaphor rings true at least in describing how the factual content of legal judgments comes from outside the law. In many varieties of litigation, not only the facts and data, but the scientific and statistical inferences must be added to the “empty vessel” to obtain a correct and meaningful outcome.

Once upon a time, the expertise component of legal judgments came from so-called expert witnesses, who were free to opine about the claims of causality solely by showing that they had more expertise than the lay jurors. In Pennsylvania, for instance, the standard for qualify witnesses to give “expert opinions” was to show that they had “a reasonable pretense to expertise on the subject.”

In the 19th and the first half of the 20th century, causal claims, whether of personal injuries, discrimination, or whatever, virtually always turned on a conception of causation as necessary and sufficient to bring about the alleged harm. In discrimination claims, plaintiffs pointed to the “inexorable zero,” in cases in which no Black citizen was ever seated on a grand jury, in a particular county, since the demise of Reconstruction. In health claims, the mode of reasoning usually followed something like Koch’s postulates.

The second half of the 20th century was marked by the rise of stochastic models in our understanding of the world. The consequence is that statistical inference made its way into the empty vessel. The rapid introduction of statistical thinking into the law did not always go well. In a seminal discrimination case, Casteneda v. Partida, 430 U.S. 432 (1977), in an opinion by Associate Justice Blackmun, the court calculated a binomial probability for observing the sample result (rather than a result at least as extreme as such a result), and mislabeled the measurement “standard deviations” rather than standard errors:

“As a general rule for such large samples, if the difference between the expected value and the observed number is greater than two or three standard deviations, then the hypothesis that the jury drawing was random would be suspect to a social scientist.  The II-year data here reflect a difference between the expected and observed number of Mexican-Americans of approximately 29 standard deviations. A detailed calculation reveals that the likelihood that such a substantial departure from the expected value would occur by chance is less than I in 10140.”

Id. at 430 U.S. 482, 496 n.17 (1977). Justice Blackmun was graduated from Harvard College, summa cum laude, with a major in mathematics.

Despite the extreme statistical disparity in the 11-year run of grand juries, Justice Blackmun’s opinion provoked a robust rejoinder, not only on the statistical analysis, but on the Court’s failure to account for obvious omitted confounding variables in its simplistic analysis. And then there were the inconvenient facts that Mr. Partida was a rapist, indicted by a grand jury (50% with “Hispanic” names), which was appointed by jury commissioners (3/5 Hispanic). Partida was convicted by a petit jury (7/12 Hispanic), in front a trial judge who was Hispanic, and he was denied a writ of habeas court by Judge Garza, who went on to be a member of the Court of Appeals. In any event, Justice Blackmun’s dictum about “two or three” standard deviations soon shaped the outcome of many thousands of discrimination cases, and was translated into a necessary p-value of 5%.

Beginning in the early 1960s, statistical inference became an important feature of tort cases that involved claims based upon epidemiologic evidence. In such health-effects litigation, the judicial handling of concepts such as p-values and confidence intervals often went off the rails.  In 1989, the United States Court of Appeals for the Fifth Circuit resolved an appeal involving expert witnesses who relied upon epidemiologic studies by concluding that it did not have to resolve questions of bias and confounding because the studies relied upon had presented their results with confidence intervals.[4] Judges and expert witnesses persistently interpreted single confidence intervals from one study as having a 95 percent probability of containing the actual parameter.[5] Similarly, many courts and counsel committed the transposition fallacy in interpreting p-values as posterior probabilities for the null hypothesis.[6]

Against this backdrop of mistaken and misrepresented interpretation of p-values, the American Statistical Association’s p-value statement was a helpful and understandable restatement of basic principles.[7] Within a few weeks, however, citations to the p-value Statement started to show up in the briefs and examinations of expert witnesses, to support contentions that p-values (or any procedure to evaluate random error) were unimportant, and should be disregarded.[8]

In 2019, Ronald Wasserstein, the ASA executive director, along with two other authors wrote an editorial, which explicitly called for the abandonment of using “statistical significance.”[9] Although the piece was labeled “editorial,” the journal provided no disclaimer that Wasserstein was not speaking ex cathedra.

The absence of a disclaimer provoked a great deal of confusion. Indeed, Brian Turran, the editor of Significancepublished jointly by the ASA and the Royal Statistical Society, wrote an editorial interpreting the Wasserstein editorial as an official ASA “recommendation.” Turran ultimately retracted his interpretation, but only in response to a pointed letter to the editor.[10] Turran adverted to a misleading press release from the ASA as the source of his confusion. Inquiring minds might wonder why the ASA allowed such a press release to go out.

In addition to press releases, some people in the ASA started to send emails to journal editors, to nudge them to abandon statistical significance testing on the basis of what seemed like an ASA recommendation. For the most part, this campaign was unsuccessful in the major biomedical journals.[11]

While this controversy was unfolding, then President Karen Kafadar of the ASA stepped into the breach to state definitively that the Executive Director was not speaking for the ASA.[12]  In November 2019, the ASA board of directors approved a motion to create a “Task Force on Statistical Significance and Replicability.”[8] Its charge was “to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA Board.”

Professor Mayo’s editorial has done the world of statistics, as well as the legal world of judges, lawyers, and legal scholars, a service in calling attention to the peculiar intellectual conflicts of interest that played a role in the editorial excesses of some of  the ASA’s leadership. From a lawyer’s perspective, it is clear that courts have been misled, and distracted by, some of the ASA officials who seem to have worked to undermine a consensus position paper on p-values.[13]

Curiously, the task force’s report did not find a home in any of the ASA’s several scholarly publications. Instead “The ASA President’s Task Force Statement on Statistical Significance and Replicability[14] appeared in the The Annals of Applied  Statistics, where it is accompanied by an editorial by ASA former President Karen Kafadar.[15]  In November 2021, the ASA’s official “magazine,” Chance, also published the Task Force’s Statement.[16]

Judges and litigants who must navigate claims of statistical inference need guidance on the standard of care scientists and statisticians should use in evaluating such claims. Although the Taskforce did not elaborate, it advanced five basic propositions, which had been obscured by many of the recent glosses on the ASA 2016 p-value statement, and the 2019 editorial discussed above:

  1. “Capturing the uncertainty associated with statistical summaries is critical.”
  2. “Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data.”
  3. “The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.”
  4. “Thresholds are helpful when actions are required.”
  5. “P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.”

Although the Task Force’s Statement will not end the debate or the “wars,” it will go a long way to correct the contentions made in court about the insignificance of significance testing, while giving courts a truer sense of the professional standard of care with respect to statistical inference in evaluating claims of health effects.


[1] David R. Brillinger, “. . . how wonderful the field of statistics is. . . ,” Chap. 4, 41, 44, in Xihong Lin, et al., eds., Past, Present, and Future of Statistical Science (2014).

[2] Deborah MayoStatistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018).

[3] Deborah Mayo, “The Statistics Wars and Intellectual Conflicts of Interest,” Conservation Biology (2021) (in press).

[4] Brock v. Merrill Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989).

[5] Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004) (“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”) (Both authors testify for claimants cases involving alleged environmental and occupational harms.); Schachtman, “Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012) (collecting numerous examples of judicial offenders).

[6] See, e.g., In re Ephedra Prods. Liab. Litig., 393 F.Supp. 2d 181, 191, 193 (S.D.N.Y. 2005) (Rakoff, J.) (credulously accepting counsel’s argument that the use of a critical value of less than 5% of significance probability increased the “more likely than not” burden of proof upon a civil litigant). The decision has been criticized in the scholarly literature, but it is still widely cited without acknowledging its error. See Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 65 (2009).

[7] Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The Am. Statistician 129 (2016); see “The American Statistical Association’s Statement on and of Significance” (March 17, 2016). The commentary beyond the “bold faced” principles was at times less helpful in suggesting that there was something inherently inadequate in using p-values. With the benefit of hindsight, this commentary appears to represent editorizing by the authors, and not the sense of the expert committee that agreed to the six principles.

[8] Schachtman, “The American Statistical Association Statement on Significance Testing Goes to Court, Part I” (Nov. 13, 2018), “Part II” (Mar. 7, 2019).

[9] Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 Am. Statistician S1, S2 (2019); see Schachtman,“Has the American Statistical Association Gone Post-Modern?” (Mar. 24, 2019).

[10] Brian Tarran, “THE S WORD … and what to do about it,” Significance (Aug. 2019); Donald Macnaughton, “Who Said What,” Significance 47 (Oct. 2019).

[11] See, e.g., David Harrington, Ralph B. D’Agostino, Sr., Constantine Gatsonis, Joseph W. Hogan, David J. Hunter, Sharon-Lise T. Normand, Jeffrey M. Drazen, and Mary Beth Hamel, “New Guidelines for Statistical Reporting in the Journal,” 381 New Engl. J. Med. 285 (2019); Jonathan A. Cook, Dean A. Fergusson, Ian Ford, Mithat Gonen, Jonathan Kimmelman, Edward L. Korn, and Colin B. Begg, “There is still a place for significance testing in clinical trials,” 16 Clin. Trials 223 (2019).

[12] Karen Kafadar, “The Year in Review … And More to Come,” AmStat News 3 (Dec. 2019); see also Kafadar, “Statistics & Unintended Consequences,” AmStat News 3,4 (June 2019).

[13] Deborah Mayo, “The statistics wars and intellectual conflicts of interest,” 36 Conservation Biology (2022) (in-press, online Dec. 2021).

[14] Yoav Benjamini, Richard D. DeVeaux, Bradly Efron, Scott Evans, Mark Glickman, Barry Braubard, Xuming He, Xiao Li Meng, Nancy Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young, and Karen Kafadar, “The ASA President’s Task Force Statement on Statistical Significance and Replicability,” 15 Annals of Applied Statistics (2021) (in press).

[15] Karen Kafadar, “Editorial: Statistical Significance, P-Values, and Replicability,” 15 Annals of Applied Statistics (2021).

[16] Yoav Benjamini, Richard D. De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry I. Graubard, Xuming He, Xiao-Li Meng, Nancy M. Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young & Karen Kafadar, “ASA President’s Task Force Statement on Statistical Significance and Replicability,” 34 Chance 10 (2021).