P-Values: Pernicious or Perspicacious?

Professor Kingsley R. Browne, of the Wayne State University Law School, recently published a paper that criticized the use of p-values and significance testing in discrimination litigation. Kingsley R. Browne, “Pernicious P-Values: Statistical Proof of Not Very Much,” 42 Univ. Dayton L. Rev. 113 (2017) (cited below as Browne). Browne amply documents the obvious and undeniable, that judges, lawyers, and even some ill-trained expert witnesses, are congenitally unable to describe and interpret p-values properly. Most of Browne’s examples are from the world of anti-discrimination law, but he also cites a few from health effects litigation as well. Browne also cites from many of the criticisms of p-values in the psychology and other social science literature.

Browne’s efforts to correct judicial innumeracy are welcomed, but they take a peculiar turn in this law review article. From the well-known state of affairs of widespread judicial refusal or inability to discuss statistical concepts accurately, Browne argues for what seem to be two incongruous, inconsistent responses. Rejecting the glib suggestion of former Judge Posner that evidence law is not “fussy” about evidence, Browne argues that federal evidence law requires courts to be “fussy” about evidence, and that Rule 702 requires courts to exclude expert witnesses, whose opinions fail to “employ[] in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.” Browne at 143 (quoting from Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999). Browne tells us, with apparently appropriate intellectual rigor, that “[i]f a disparity that does not provide a p-value of less than 0.05 would not be accepted as meaningful in the expert’s discipline, it is not clear that the expert should be allowed to testify – on the basis of his expertise in that discipline – that the disparity is, in fact, meaningful.” Id.

In a volte face, Browne then argues that p-values do “not tell us much,” basically because they are dependent upon sample size. Browne suggests that the quantitative disparity between expected value and observed proportion or average can be assessed without the use of p-values, and that measuring a p-value “adds virtually nothing and just muddies the water.” Id. at 152. The prevalent confusion among judges and lawyers seems sufficient in Browne’s view to justify his proposal, as well as his further suggestion that Rule 403 should be invoked to exclude p-values:

The ease with which reported p-values cause a trier of fact to slip into the transposition fallacy and the difficulty of avoiding that lapse of logic, coupled with the relatively sparse information actually provided by the p-value, make p-values prime candidates for exclusion under Federal Rule of Evidence 403. *** If judges, not to mention the statistical experts they rely on, cannot use the information without falling into fallacious reasoning, the likelihood that the jury will misunderstand the evidence is very high. Since the p-value actually provides little useful relevant information, the high risk of misleading the jury greatly exceeds its scant probative value, so it simply should not be presented to the jury.”

Id. at 152-53.

And yet, elsewhere in the same article, Browne ridicules one court and several expert witnesses who have argued in favor of conclusions that were based upon p-values up to 50%.1 The concept of p-values cannot be so flexible as to straddle the extremes of having no probative value, and yet capable of rendering an expert witness’s opinions ludicrous. P-values quantify an estimate of random error, even if that error rate varies with sample size. To be sure, the measure of random error depends upon the specified model and assumption of a null hypothesis, but the crucial point is that the estimate (whether mean, proportion, risk ratio, risk difference, etc.) is rather meaningless without some further estimate of random variability of the estimate. Of course, random error is not the only type of error, but the existence of other potential systematic errors is hardly a reason to ignore random error.

In the science of health effects, many applications of p-values have given way to the use of confidence intervals, which arguably provide more direct assessments of both sample estimates, along with ranges of potential outcomes that are reasonably compatible with the sample estimates. Remarkably, Browne never substantively discusses confidence intervals in his article.

Under the heading of other problems with p-values and significance testing, Browne advances four additional putative problems with p-values. First, Browne asserts with little to no support that “[t]he null hypothesis is unlikely a priori.” Id. at 155. He fails to tell us why the null hypothesis of no disparity is not a reasonable starting place in the absence of objective evidence of a prior estimate. Furthermore, a null hypothesis of no difference will have legal significance in claims of health effects, or of unlawful discrimination.

Second, Browne argues that significance testing will lead to “[c]onflation of statistical and practical (or legal) significance” in the minds of judges and jurors. Id. at 156-58. This charge is difficult to sustain. The actors in legal cases can probably best appreciate practical significance and its separation from statistical significance, most readily. If a large class action showed that the expected value of a minority’s proportion was 15%, and the observed proportion was 14.8%, p < 0.05, most innumerate judges and jurors would sense that this disparity was unimportant and that no employer would fine tune its discriminatory activities so closely as to achieve such a meaningless difference.

Third, Browne reminds us that the validity and the interpretation of a p-value turns on the assumption that the statistical model is perfectly specified. Id. at 158-59. His reminder is correct, but again, this aspect of p-values (or confidence intervals) is relatively easy to explain, as well as to defend or challenge. To be sure, there may be legitimate disputes about whether an appropriate model was used (say binomial versus hypergeometric), but such disputes are hardly the most arcane issues that judges and jurors will face.

Fourth, Browne claims that “the alternative hypothesis is seldom properly specified.” Id. at 159-62. Unless analysts are focused on measuring pre-test power or type II error, however, they need not advance an alternative hypothesis. Furthermore, it is hardly a flaw with significance testing that it does not account for systematic bias or confounding.

Browne does not offer an affirmative response such as urging courts to adopt a Bayesian program. A Bayesian response to prevalent blunders in interpreting statistical significance would introduce perhaps even more arcane and hard-to-discern blunders in court proceedings. Browne also leaves courts without a meaningful approach to evaluate random error other than to engage in crude comparisons between two means or proportions. The recommendations in this law review article appear to be a giant step, backwards, into an epistemic void.


1See Browne at 146, citing In re Photochromic Lens Antitrust Litig., 2014 WL 1338605 (M.D. Fla. April 3, 2014) (reversing magistrate judge’s exclusion of an expert witness who had advanced claims based upon p-value of 0.50); id. at 147 n. 116, citing In re High-Tech Employee Antitrust Litig., 2014 WL 1351040 (N.D. Cal. 2014).