Judicial Control of the Rate of Error in Expert Witness Testimony

In Daubert, the Supreme Court set out several criteria or factors for evaluating the “reliability” of expert witness opinion testimony. The third factor in the Court’s enumeration was whether the trial court had considered “the known or potential rate of error” in assessing the scientific reliability of the proffered expert witness’s opinion. Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 593 (1993). The Court, speaking through Justice Blackmun, failed to provide much guidance on the nature of the errors subject to gatekeeping, on how to quantify the errors, and on to know how much error was too much. Rather than provide a taxonomy of error, the Court lumped “accuracy, validity, and reliability” together with a grand pronouncement that these measures were distinguished by no more than a “hen’s kick.” Id. at 590 n.9 (1993) (citing and quoting James E. Starrs, “Frye v. United States Restructured and Revitalized: A Proposal to Amend Federal Evidence Rule 702,” 26 Jurimetrics J. 249, 256 (1986)).

The Supreme Court’s failure to elucidate its “rate of error” factor has caused a great deal of mischief in the lower courts. In practice, trial courts have rejected engineering opinions on stated grounds of their lacking an error rate as a way of noting that the opinions were bereft of experimental and empirical evidential support[1]. For polygraph evidence, courts have used the error rate factor to obscure their policy prejudices against polygraphs, and to exclude test data even when the error rate is known, and rather low compared to what passes for expert witness opinion testimony in many other fields[2]. In the context of forensic evidence, the courts have rebuffed objections to random-match probabilities that would require that such probabilities be modified by the probability of laboratory or other error[3].

When it comes to epidemiologic and other studies that require statistical analyses, lawyers on both sides of the “v” frequently misunderstand p-values or confidence intervals to provide complete measures of error, and ignore the larger errors that result from bias, confounding, study validity (internal and external), inappropriate data synthesis, and the like[4]. Not surprisingly, parties fallaciously argue that the Daubert criterion of “rate of error” is satisfied by expert witness’s reliance upon studies that in turn use conventional 95% confidence intervals and measures of statistical significance in p-values below 0.05[5].

The lawyers who embrace confidence intervals and p-values as their sole measure of error rate fail to recognize that confidence intervals and p-values are means of assessing only one kind of error: random sampling error. Given the carelessness of the Supreme Court’s use of technical terms in Daubert, and its failure to engage in the actual evidence at issue in the case, it is difficult to know whether the Court intended to suggest that random error was the error rate it had in mind[6]. The statistics chapter in the Reference Manual on Scientific Evidence helpfully points out that the inferences that can be drawn from data turn on p-values and confidence intervals, as well as on study design, data quality, and the presence or absence of systematic errors, such as bias or confounding.  Reference Manual on Scientific Evidence at 240 (3d 2011) [Manual]. Random errors are reflected in the size of p-values or the width of confidence intervals, but these measures of random sampling error ignore systematic errors such as confounding and study biases. Id. at 249 & n.96.

The Manual’s chapter on epidemiology takes an even stronger stance: the p-value for a given study does not provide a rate of error or even a probability of error for an epidemiologic study:

“Epidemiology, however, unlike some other methodologies—fingerprint identification, for example—does not permit an assessment of its accuracy by testing with a known reference standard. A p-value provides information only about the plausibility of random error given the study result, but the true relationship between agent and outcome remains unknown. Moreover, a p-value provides no information about whether other sources of error – bias and confounding – exist and, if so, their magnitude. In short, for epidemiology, there is no way to determine a rate of error.”

Manual at 575. This stance seems not entirely justified given that there are Bayesian approaches that would produce credibility intervals accounting for sampling and systematic biases. To be sure, such approaches have their own problems and they have received little to no attention in courtroom proceedings to date.

The authors of the Manual’s epidemiology chapter, who are usually forgiving of judicial error in interpreting epidemiologic studies, point to one United States Court of Appeals case that fallaciously interpreted confidence intervals magically to quantify bias and confounding in a Bendectin birth defects case. Id. at 575 n. 96[7]. The Manual could have gone further to point out that, in the context of multiple studies, of different designs and analyses, cognitive biases involved in evaluating, assessing, and synthesizing the studies are also ignored by statistical measures such as p-values and confidence intervals. Although the Manual notes that assessing the role of chance in producing a particular set of sample data is “often viewed as essential when making inferences from data,” the Manual never suggests that random sampling error is the only kind of error that must be assessed when interpreting data. The Daubert criterion would appear to encompass all varieties or error, not just random error.

The Manual’s suggestion that epidemiology does not permit an assessment of the accuracy of epidemiologic findings misrepresents the capabilities of modern epidemiologic methods. Courts can, and do, invoke gatekeeping approaches to weed out confounded study findings. SeeSorting Out Confounded Research – Required by Rule 702” (June 10, 2012). The “reverse Cornfield inequality” was an important analysis that helped establish the causal connection between tobacco smoke and lung cancer[8]. Olav Axelson studied and quantified the role of smoking as a confounder in epidemiologic analyses of other putative lung carcinogens.[9] Quantitative methods for identifying confounders have been widely deployed[10].

A recent study in birth defects epidemiology demonstrates the power of sibling cohorts in addressing the problem of residual confounding from observational population studies with limited information about confounding variables. Researchers looking at various birth defect outcomes among offspring of women who used certain antidepressants in early pregnancy generally found no associations in pooled data from Iceland, Norway, Sweden, Finland, and Denmark. A putative association between maternal antidepressant use and a specific kind of cardiac defect (right ventricular outflow tract obstruction or RVOTO) did appear in the overall analysis, but was reversed when the analysis was limited to the sibling subcohort. The study found an apparent association between RVOTO defects and first trimester maternal exposure to selective serotonin reuptake inhibitors, with an adjusted odds ratio of 1.48 (95% C.I., 1.15, 1.89). In the adjusted analysis for siblings, the study found an OR of 0.56 (95% C.I., 0.21, 1.49) in an adjusted sibling analysis[11]. This study and many others show how creative analyses can elucidate and quantify the direction and magnitude of confounding effects in observational epidemiology.

Systematic bias has also begun to succumb to more quantitative approaches. A recent guidance paper by well-known authors encourages the use of quantitative bias analysis to provide estimates of uncertainty due to systematic errors[12].

Although the courts have failed to articulate the nature and consequences of erroneous inference, some authors would reduce all of Rule 702 (and perhaps 704, 403 as well) to a requirement that proffered expert witnesses “account” for the known and potential errors in their opinions:

“If an expert can account for the measurement error, the random error, and the systematic error in his evidence, then he ought to be permitted to testify. On the other hand, if he should fail to account for any one or more of these three types of error, then his testimony ought not be admitted.”

Mark Haug & Emily Baird, “Finding the Error in Daubert,” 62 Hastings L.J. 737, 739 (2011).

Like most antic proposals to revise Rule 702, this reform vision shuts out the full range of Rule 702’s remedial scope. Scientists certainly try to identify potential sources of error, but they are not necessarily very good at it. See Richard Horton, “Offline: What is medicine’s 5 sigma?” 385 Lancet 1380 (2015) (“much of the scientific literature, perhaps half, may simply be untrue”). And as Holmes pointed out[13], certitude is not certainty, and expert witnesses are not likely to be good judges of their own inferential errors[14]. Courts continue to say and do wildly inconsistent things in the course of gatekeeping. Compare In re Zoloft (Setraline Hydrochloride) Products, 26 F. Supp. 3d 449, 452 (E.D. Pa. 2014) (excluding expert witness) (“The experts must use good grounds to reach their conclusions, but not necessarily the best grounds or unflawed methods.”), with Gutierrez v. Johnson & Johnson, 2006 WL 3246605, at *2 (D.N.J. November 6, 2006) (denying motions to exclude expert witnesses) (“The Daubert inquiry was designed to shield the fact finder from flawed evidence.”).


[1] See, e.g., Rabozzi v. Bombardier, Inc., No. 5:03-CV-1397 (NAM/DEP), 2007 U.S. Dist. LEXIS 21724, at *7, *8, *20 (N.D.N.Y. Mar. 27, 2007) (excluding testimony from civil engineer about boat design, in part because witness failed to provide rate of error); Sorto-Romero v. Delta Int’l Mach. Corp., No. 05-CV-5172 (SJF) (AKT), 2007 U.S. Dist. LEXIS 71588, at *22–23 (E.D.N.Y. Sept. 24, 2007) (excluding engineering opinion that defective wood-carving tool caused injury because of lack of error rate); Phillips v. Raymond Corp., 364 F. Supp. 2d 730, 732–33 (N.D. Ill. 2005) (excluding biomechanics expert witness who had not reliably tested his claims in a way to produce an accurate rate of error); Roane v. Greenwich Swim Comm., 330 F. Supp. 2d 306, 309, 319 (S.D.N.Y. 2004) (excluding mechanical engineer, in part because witness failed to provide rate of error); Nook v. Long Island R.R., 190 F. Supp. 2d 639, 641–42 (S.D.N.Y. 2002) (excluding industrial hygienist’s opinion in part because witness was unable to provide a known rate of error).

[2] See, e.g., United States v. Microtek Int’l Dev. Sys. Div., Inc., No. 99-298-KI, 2000 U.S. Dist. LEXIS 2771, at *2, *10–13, *15 (D. Or. Mar. 10, 2000) (excluding polygraph data based upon showing that claimed error rate came from highly controlled situations, and that “real world” situations led to much higher error (10%) false positive error rates); Meyers v. Arcudi, 947 F. Supp. 581 (D. Conn. 1996) (excluding polygraph in civil action).

[3] See, e.g., United States v. Ewell, 252 F. Supp. 2d 104, 113–14 (D.N.J. 2003) (rejecting defendant’s objection to government’s failure to quantify laboratory error rate); United States v. Shea, 957 F. Supp. 331, 334–45 (D.N.H. 1997) (rejecting objection to government witness’s providing separate match and error probability rates).

[4] For a typical judicial misstatement, see In re Zoloft Products, 26 F. Supp.3d 449, 454 (E.D. Pa. 2014) (“A 95% confidence interval means that there is a 95% chance that the ‘‘true’’ ratio value falls within the confidence interval range.”).

[5] From my experience, this fallacious argument is advanced by both plaintiffs’ and defendants’ counsel and expert witnesses. See also Mark Haug & Emily Baird, “Finding the Error in Daubert,” 62 Hastings L.J. 737, 751 & n.72 (2011).

[6] See David L. Faigman, et al. eds., Modern Scientific Evidence: The Law and Science of Expert Testimony § 6:36, at 359 (2007–08) (“it is easy to mistake the p-value for the probability that there is no difference”)

[7] Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989), modified, 884 F.2d 166 (5th Cir. 1989), cert. denied, 494 U.S. 1046 (1990). As with any error of this sort, there is always the question whether the judges were entrapped by the parties or their expert witnesses, or whether the judges came up with the fallacy on their own.

[8] See Joel B Greenhouse, “Commentary: Cornfield, Epidemiology and Causality,” 38 Internat’l J. Epidem. 1199 (2009).

[9] Olav Axelson & Kyle Steenland, “Indirect methods of assessing the effects of tobacco use in occupational studies,” 13 Am. J. Indus. Med. 105 (1988); Olav Axelson, “Confounding from smoking in occupational epidemiology,” 46 Brit. J. Indus. Med. 505 (1989); Olav Axelson, “Aspects on confounding in occupational health epidemiology,” 4 Scand. J. Work Envt’l Health 85 (1978).

[10] See, e.g., David Kriebel, Ariana Zeka1, Ellen A Eisen, and David H. Wegman, “Quantitative evaluation of the effects of uncontrolled confounding by alcohol and tobacco in occupational cancer studies,” 33 Internat’l J. Epidem. 1040 (2004).

[11] Kari Furu, Helle Kieler, Bengt Haglund, Anders Engeland, Randi Selmer, Olof Stephansson, Unnur Anna Valdimarsdottir, Helga Zoega, Miia Artama, Mika Gissler, Heli Malm, and Mette Nørgaard, “Selective serotonin reuptake inhibitors and ventafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design,” 350 Brit. Med. J. 1798 (2015).

[12] Timothy L.. Lash, Matthew P. Fox, Richard F. MacLehose, George Maldonado, Lawrence C. McCandless, and Sander Greenland, “Good practices for quantitative bias analysis,” 43 Internat’l J. Epidem. 1969 (2014).

[13] Oliver Wendell Holmes, Jr., Collected Legal Papers at 311 (1920) (“Certitude is not the test of certainty. We have been cock-sure of many things that were not so.”).

[14] See, e.g., Daniel Kahneman & Amos Tversky, “Judgment under Uncertainty:  Heuristics and Biases,” 185 Science 1124 (1974).