The Washington Legal Foundation has released a Working Paper, No. 201, by Kirby Griffis, entitled “The Role of Statistical Significance in Daubert / Rule 702 Hearings,” in its Critical Legal Issues Working Paper Series, (Mar. 2017) [cited below as Griffis]. I am a fan of many of the Foundation’s Working Papers (having written one some years ago), but this one gives me pause.
Griffis’s paper manages to avoid many of the common errors of lawyers writing about this topic, but adds little to the statistics chapter in the Reference Manual on Scientific Evidence (3d ed. 2011), and he propagates some new, unfortunate misunderstandings. On the positive side, Griffis studiously avoids the transposition fallacy in defining significance probability, and he notes that multiplicity from subgroups and multiple comparisons often undermines claims of statistical significance. Griffis gets both points right. These are woefully common errors, and they deserve the emphasis Griffis gives to them in this working paper.
On the negative side, however, Griffis falls into error on several points. Griffis helpfully narrates the Supreme Court’s evolution in Daubert and then in Joiner, but he fails to address the serious mischief and devolution introduced by the Court’s opinion in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27, 131 S.Ct. 1309 (2011). See Schachtman, “The Matrixx – A Comedy of Errors” (April 6, 2011)”; David Kaye, “Trapped in the Matrixx: The U.S. Supreme Court and the Need for Statistical Significance,” BNA Product Safety & Liability Reporter 1007 (Sept. 12, 2011). With respect to statistical practice, this Working Paper is at times wide of the mark.
Although avoiding the transposition fallacy, Griffis falls into another mistake in interpreting tests of significance; he states that a non-significant result tells us that an hypothesis is “perfectly consistent with mere chance”! Griffis at 9. This is, of course, wrong, or at least seriously misleading. A failure to reject the null hypothesis does not prove the null such that we can say that the “null results” in one study were perfectly consistent with chance. The test may have lacked power to detect an “effect size” of interest. Furthermore, tests of significance cannot rule out systematic bias or confounding, and that limitation alone ensures that Griffis’s interpretation is mistaken. A null result may have resulted from bias or confounding that obscured a measurable association.
Griffis states that p-values are expressed as percentages “usually 95% or 99%, corresponding to 0.05 or 0.01,” but this states things backwards. The p-value that is pre-specified to be “significant” is a probability or percentage that is low; it is the coefficient of confidence used to construct a confidence interval that is the complement of the significance probability. Griffis at 10. An alpha, or pre-specified statistical significance level, of 5% thus corresponds to a coefficient of confidence of 95% (or 1.0 – 0.05).
The Mid-p Controversy
In discussing the emerging case law, Griffis rightly points to cases that chastise Dr. Nicholas Jewell for the many liberties he has taken in various litigations as an expert witness for the lawsuit industry. One instance cited by Griffis is the Lipitor diabetes litigation, where the MDL court suggested that Jewell switched improperly from a Fisher’s exact test to a mid-test. Griffis at 18-19. Griffis seems to agree, but as I have explained elsewhere, Fisher’s exact test generates a one-tailed measure of significance probability, and the analyst is left to one of several ways of calculating a two-tailed test. See “Lipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test” (April 21, 2016). The mid-p is one legitimate approach for asymmetric distributions, and is more favorable to the defense than passing off the one-tailed measure as the result of the test. The mere fact that a statistical software package does not automatically specify the mid-p for a Fisher’s exact analysis does not make invoking this measure into p-hacking or other misconduct. Doubling the attained significance probability of a particular Fisher’s exact test result is generally considered less accurate than a mid-p calculation, even though some software packages using doubling attained significance probability as a default. As much as we might dislike bailing Jewell out of Daubert limbo, on this one, limited point, he deserved a better hearing.
On recounting the Bendectin litigation, Griffis refers to the epidemiologic studies of birth defects and Bendectin as “experiments,” Griffis at 7, and then describes such studies as comparing “populations,” when he clearly meant “samples.” Griffis at 8.
Griffis conflates personal bias with bias as a scientific concept of systematic error in research, a confusion usually perpetuated by plaintiffs’ counsel. See Griffis at 9 (“Coins are not the only things that can be biased: scientists can be, too, as can their experimental subjects, their hypotheses, and their manipulations of the data.”) Of course, the term has multiple connotations, but too often an accusation of personal bias, such as conflict of interest, is used to avoid engaging with the merits of a study.
Griffis correctly describes the measure known as “relative risk” as a determination of the “the strength of a particular association.” Griffis at 10. The discussion then lapses into using a given relative risk as a measure of the likelihood that an individual with the exposure studied develop the disease. Sometimes this general-to-specific inference is warranted, but without further analysis, it is impossible to tell whether Griffis lapsed from general to specific, deliberately or inadvertently, in describing the interpretation of relative risk.
Griffis is right in his chief contention that the proper planning, conduct and interpretation statistical tests is hugely important to judicial gatekeeping of some expert witness opinion testimony under Federal Rule of Evidence 702 (and under Rule 703, too). Judicial and lawyer aptitude in this area is low, and needs to be bolstered.