How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance? Inconsistently and at times incoherently.
Professor Berger’s Introduction
In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:
“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63”
Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).
This seems rather backwards. Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.
Berger’s citations in support are curiously inaccurate. Footnote 62 cites the Cook case:
“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”
The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study.
Footnote 63 is no better:
“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”
With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis. The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies. The court’s review was hardly extensive. When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony. In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009). Berger’s characterization of the review is incorrect, and her failure to cite the subsequent procedural history disturbing.
Chapter on Statistics
The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction. The authors carefully describe significance probability and p-values, and explain:
“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”
David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011). Although the chapter confuses and conflates the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.
Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:
“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.
There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”
Id. at 256 -57. This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.
Chapter on Multiple Regression
The chapter on regression does not add much to the earlier and later discussions. The author asks rhetorically what is the appropriate level of statistical significance, and answers:
“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47”
Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.
Chapter on Epidemiology
The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.
“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”
Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573. The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics. The problem is that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample. Furthermore, the chapter’s statement here omits reference to the conditional nature of the probability that makes it dependent upon the assumption of correctness of the null hypothesis.
The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.” The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary. Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off. Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.
The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:
“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”
Id. at 573n.68. Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing. The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance. See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).
The epidemiology chapter also chastises courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct:
“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).”
Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”
577 n81. The footnotes goes to explain further the difference between alpha probability and burden of proof probability, but incorrectly asserts that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.” Id. The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expect value if the null hypothesis is true. Kaye and Freedman’s chapter on statistics does much better at describing and avoiding the transpositional fallacy when describing p-values.
When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct. What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.
The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:
“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87
Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”
Id. at 578-79. Mostly true, but again rather unhelpful to judges and lawyers. The authors divide the world up into “strict” testers and those critical of “strict” testing. Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all? I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis. Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them. But what of those critical of strict testing? This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008). The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample. Confidence intervals, however, do not eliminate the need to interpret the extent of random error.
In the final analysis, the epidemiology chapter is unclear and imprecise. I believe it confuses matters more than it clarifies. There is clearly room for improvement in the Fourth Edition.