Reference Manual – Desiderata for 4th Edition – Part IV – Confidence Intervals

Putting aside the idiosyncratic chapter by the late Professor Berger, most of the third edition of the Reference Manual presented guidance on many important issues.  To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges. On several issues, especially statistical in nature, the fourth edition could benefit from an editor to ensure that the individual chapters, written by different authors, actually agree on key concepts.  One such example is the third edition’s treatment of confidence intervals.[1]

The “DNA Identification” chapter noted that the meaning of a confidence interval is subtle,[2] but I doubt that the authors, David Kaye and George Sensabaugh, actually found it subtle or difficult. In the third edition’s chapter on statistics, David Kaye and co-author, the late David A. Freedman, gave a reasonable definition of confidence intervals in their glossary:

confidence interval. An estimate, expressed as a range, for a parameter. For estimates such as averages or rates computed from large samples, a 95% confidence interval is the range from about two standard errors below to two standard errors above the estimate. Intervals obtained this way cover the true value about 95% of the time, and 95% is the confidence level or the confidence coefficient.”[3]

Intervals, not the interval, which is correct. This chapter made clear that it was the procedure of obtaining multiple samples with intervals that yielded the 95% coverage. In the substance of their chapter, Kaye and Freedman are explicit about how intervals are constructed, and that:

“the confidence level does not give the probability that the unknown parameter lies within the confidence interval.”[4]

Importantly, the authors of the statistics chapter named names; that is, they cited some cases that butchered the concept of the confidence interval.[5] The fourth edition will have a more difficult job because, despite the care taken in the statistics chapter, many more decisions have misstated or misrepresented the meaning of a confidence interval.[6] Citing more cases perhaps will disabuse federal judges of their reliance upon case law for the meaning of statistical concepts.

The third edition’s chapter on multiple regression defined confidence interval in its glossary:

confidence interval. An interval that contains a true regression parameter with a given degree of confidence.”[7]

The chapter avoided saying anything obviously wrong only by giving a very circular definition. When the chapter substantively described a confidence interval, it ended up giving an erroneous one:

“In general, for any parameter estimate b, the expert can construct an interval around b such that there is a 95% probability that the interval covers the true parameter. This 95% confidence interval is given by: b ± 1.96 (SE of b).”[8]

The formula provided is correct, but the interpretation of a 95% probability that the interval covers the true parameter is unequivocably wrong.[9]

The third edition’s chapter by Shari Seidman Diamond on survey research, on the other hand, gave an anodyne example and a definition:

“A survey expert could properly compute a confidence interval around the 20% estimate obtained from this sample. If the survey were repeated a large number of times, and a 95% confidence interval was computed each time, 95% of the confidence intervals would include the actual percentage of dentists in the entire population who would believe that Goldgate was manufactured by the makers of Colgate.

                 *  *  *  *

Traditionally, scientists adopt the 95% level of confidence, which means that if 100 samples of the same size were drawn, the confidence interval expected for at least 95 of the samples would be expected to include the true population value.”[10]

Similarly, the third edition’s chapter on epidemiology correctly defined the confidence interval operationally as a process of iterative intervals that collectively cover the true value in 95% of all the intervals:

“A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”[11]

Not content to leave it well said, the chapter’s authors returned to the confidence interval and provided another, more problematic definition, a couple of pages later in the text:

“A confidence interval is a range of possible values calculated from the results of a study. If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[12]

The first sentence refers to “a study”; that is, one study, one range of values. The second sentence then tells us that “the range” (singular, presumably referring back to the single “a study”), will capture 95% of the results from many resamplings from the same population. Now the definition is not framed with respect to the true population parameter, but the results from many other samples. The authors seem to have given the first sample’s confidence interval the property of including 95% of all future studies, and that is incorrect. From reviewing the case law, courts remarkably have gravitated to the second, incorrect definition.

The glossary to the third edition’s epidemiology chapter clearly, however, runs into the ditch:

“confidence interval. A range of values calculated from the results of a study within which the true value is likely to fall; the width of the interval reflects random error. Thus, if a confidence level of .95 is selected for a study, 95% of similar studies would result in the true relative risk falling within the confidence interval.”[13]

Note that the sentence before the semicolon talked of “a study” with “a range of values,” and that there is a likelihood of that range including the “true value.” This definition thus used the singular to describe the study and to describe the range of values.  The definition seemed to be saying, clearly but wrongly, that a single interval from a single study has a likelihood of containing the true value. The second full sentence ascribed a probability, 95%, to the true relative risk’s falling within “the interval.” To point out the obvious, “the interval,” is singular, and refers back to “a study,” also singular. At best, this definition was confusing; at worst, it was wrong.

The Reference Manual has a problem beyond its own inconsistencies, and the refractory resistance of the judiciary to statistical literacy. There are any number of law professors and even scientists who have held out incorrect definitions and interpretations of confidence intervals.  It would be helpful for the fourth edition to caution its readers, both bench and bar, to the prevalent misunderstandings.

Here, for instance, is an example of a well-credentialed statistician, who gave a murky definition in a declaration filed in federal court:

“If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[14]

The expert witness correctly identifies the repeated sampling, but specifies a 95% probability to “the range,” which leaves unclear whether it is the range of all intervals or “a 95% confidence interval,” which is in the antecedent of the statement.

Much worse was a definition proffered in a recent law review article by well-known, respected authors:

“A 95% confidence interval, in contrast, is a one-sided or two-sided interval from a data sample with 95% probability of bounding a fixed, unknown parameter, for which no nondegenerate probability distribution is conceived, under specified assumptions about the data distribution.”[15]

The phrase “for which no nondegenerate probability distribution is conceived,” is unclear as to whether the quoted phrase refers to the confidence interval or to the unknown parameter. It seems that the phrase modifies the noun closest to it in the sentence, the “fixed, unknown parameter,” which suggests that these authors were simply trying to emphasize that they were giving a frequentist interpretation and not conceiving of the parameter as a random variable as Bayesians would. The phrase “no nondegenerate” appears to be a triple negative, since a degenerate distribution is one that does not have a variation. The phrase makes the definition obscure, and raises questions what is being excluded by the phrase.

The more concerning aspect of the quoted footnote is its obfuscation of the important distinction between the procedure of repeatedly calculating confidence intervals (which procedure has a 95% success rate in the long run) and the probability that any given instance of the procedure, in a single confidence interval, contains the parameter. The latter probability is either zero or one.

The definition’s reference to “a” confidence interval, based upon “a” data sample, actually leaves the reader with no way of understanding the definition to be referring to the repeated process of sampling, and the set of resulting intervals. The upper and lower interval bounds are themselves random variables that need to be taken into account, but by referencing a single interval from a single data sample, the authors misrepresent the confidence interval and invite a Bayesian interpretation.[16]

Sadly, there is a long tradition of scientists and academics in giving errant definitions and interpretations of the confidence interval.[17] Their error is not harmless because they invite the attribution of a high level of probability to the claim that the “true” population measure is within the reported confidence interval. The error encourages readers to believe that the confidence interval is not conditioned upon the single sample result, and it misleads readers into believing that not only random error, but systematic and data errors are accounted for in the posterior probability.[18] 


[1]Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012).

[2] David H. Kaye & George Sensabaugh, “Reference Guide on DNA Identification Evidence” 129, 165 n.76.

[3] David H. Kaye & David A. Freedman, “Reference Guide on Statistics” 211, 284-5 (Glossary).

[4] Id. at 247.

[5] Id. at 247 n.91 & 92 (citing DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993); SmithKline Beecham Corp. v. Apotex Corp., 247 F. Supp. 2d 1011, 1037 (N.D. Ill. 2003), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005); In re Silicone Gel Breast Implants Prods. Liab. Litig, 318 F. Supp. 2d 879, 897 (C.D. Cal. 2004) (“a margin of error between 0.5 and 8.0 at the 95% confidence level . . . means that 95 times out of 100 a study of that type would yield a relative risk value somewhere between 0.5 and 8.0.”).

[6] See, e.g., Turpin v. Merrell Dow Pharm., Inc., 959 F.2d 1349, 1353–54 & n.1 (6th Cir. 1992) (erroneously describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10”); American Library Ass’n v. United States, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), rev’d on other grounds, 539 U.S. 194 (2003); Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc., 482 F.Supp. 2d 478, 495 (D.N.J.2007) (“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”) (apparently relying party’s expert witness’s report), aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd., 344 Fed.Appx. 595 (Fed. Cir. 2009); Eli Lilly & Co. v. Teva Pharms, USA, 2008 WL 2410420, *24 (S.D. Ind. 2008) (stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”); Benavidez v. City of Irving, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009) (interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”); Pritchard v. Dow Agro Sci., 705 F. Supp. 2d 471, 481, 488 (W.D. Pa. 2010) (excluding Dr. Bennet Omalu who assigned a 90% probability that an 80% confidence interval excluded relative risk of 1.0), aff’d, 430 F. App’x 102 (3d Cir.), cert. denied, 132 S. Ct. 508 (2011); Estate of George v. Vermont League of Cities and Towns, 993 A.2d 367, 378 n.12 (Vt. 2010) (erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”); Garcia v. Tyson Foods, 890 F. Supp. 2d 1273, 1285 (D. Kan. 2012) (quoting expert witness Robert G. Radwin, who testified that a 95% confidence interval in a study means “if I did this study over and over again, 95 out of a hundred times I would expect to get an average between that interval.”); In re Chantix (Varenicline) Prods. Liab. Litig., 889 F. Supp. 2d 1272, 1290n.17 (N.D. Ala. 2012); In re Zoloft Products, 26 F. Supp. 3d 449, 454 (E.D. Pa. 2014) (“A 95% confidence interval means that there is a 95% chance that the ‘‘true’’ ratio value falls within the confidence interval range.”), aff’d, 858 F.3d 787 (3d Cir. 2017); Duran v. U.S. Bank Nat’l Ass’n, 59 Cal. 4th 1, 36, 172 Cal. Rptr. 3d 371, 325 P.3d 916 (2014) (“Statisticians typically calculate margin of error using a 95 percent confidence interval, which is the interval of values above and below the estimate within which one can be 95 percent certain of capturing the ‘true’ result.”); In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832, 842 (2017) (correctly quoting an incorrect definition from the third edition at p.580), rev’d on other grounds, 235 N.J. 229, 194 A.3d 503 (2018); In re Testosterone Replacement Therapy Prods. Liab., No. 14 C 1748, MDL No. 2545, 2017 WL 1833173, *4 (N.D. Ill. May 8, 2017) (“A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time.”); Maldonado v. Epsilon Plastics, Inc., 22 Cal. App. 5th 1308, 1330, 232 Cal. Rptr. 3d 461 (2018) (“The 95 percent ‘confidence interval’, as used by statisticians, is the ‘interval of values above and below the estimate within which one can be 95 percent certain of capturing the “true” result’.”); Escheverria v. Johnson & Johnson, 37 Cal. App. 5th 292, 304, 249 Cal. Rptr. 3d 642 (2019) (quoting uncritically and with approval one of plaintiff’s expert witnesses, Jack Siemiatycki, who gave the jury an example of a study with a relative risk of 1.2, with a “95 percent probability that the true estimate is between 1.1 and 1.3.” According to the court, Siemiatycki went on to explain that this was “a pretty tight interval, and we call that a confidence interval. We call it a 95 percent confidence interval when we calculate it in such a way that it covers 95 percent of the underlying relative risks that are compatible with this estimate from this study.”); In re Viagra (Sildenafil Citrate) & Cialis (Tadalafil) Prods. Liab. Litig., 424 F.Supp.3d 781, 787 (N.D. Cal. 2020) (“For example, a given study could calculate a relative risk of 1.4 (a 40 percent increased risk of adverse events), but show a 95 percent “confidence interval” of .8 to 1.9. That confidence interval means there is 95 percent chance that the true value—the actual relative risk—is between .8 and 1.9.”); Rhyne v. United States Steel Corp., 74 F. Supp. 3d 733, 744 (W.D.N.C. 2020) (relying upon, and quoting, one of the more problematic definitions given in the third edition at p.580: “If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the population.”); Wilant v. BNSF Ry., C.A. No. N17C-10-365 CEB, (Del. Super. Ct. May 13, 2020) (citing third edition at p.573, “a confidence interval provides ‘a range (interval) within which the risk likely would fall if the study were repeated numerous times’.”; “[s]o a 95% confidence interval indicates that the range of results achieved in the study would be achieved 95% of the time when the study is replicated from the same population.”); Germaine v. Sec’y Health & Human Servs., No. 18-800V, (U.S. Fed. Ct. Claims July 29, 2021) (giving an incorrect definition directly from the third edition, at p.621; “[a] “confidence interval” is “[a] range of values … within which the true value is likely to fall[.]”).

[7] Daniel Rubinfeld, “Reference Guide on Multiple Regression” 303, 352.

[8] Id. at 342.

[9] See Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman, “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” 31 Eur. J. Epidemiol. 337, 343 (2016).

[10] Shari Seidman Diamond, “Reference Guide on Survey Research” 359, 381.

[11] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” 549, 573.

[12] Id. at 580.

[13] Id. at 621.

[14] In re Testosterone Replacement Therapy Prods. Liab. Litig., Declaration of Martin T. Wells, Ph.D., at 2-3 (N.D. Ill., Oct. 30, 2016). 

[15] Joseph Sanders, David Faigman, Peter Imrey, and A. Philip Dawid, “Differential Etiology: Inferring Specific Causation in the Law from Group Data in Science,” 63 Arizona L. Rev. 851, 898 n.173 (2021).

[16] The authors are well-credentialed lawyers and scientists. Peter Imrey, was trained in, and has taught, mathematical statistics, biostatistics, and epidemiology. He is a professor of medicine in the Cleveland Clinic Lerner College of Medicine. A. Philip Dawid is a distinguished statistician, an Emeritus Professor of Statistics, Cambridge University, Darwin College, and a Fellow of the Royal Society. David Faigman is the Chancellor & Dean, and the John F. Digardi Distinguished Professor of Law at the University of California Hastings College of the Law. Joseph Sanders is the A.A. White Professor, at the University of Houston Law Center. I have previously pointed this problem in these authors’ article. “Differential Etiologies – Part One – Ruling In” (June 19, 2022).

[17] See, e.g., Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004) (“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”); Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 60-61 n. 17 (2007) (quoting Clapp and Ozonoff with obvious approval); Déirdre DwyerThe Judicial Assessment of Expert Evidence 154-55 (Cambridge Univ. Press 2008) (“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”); Frank C. Woodside, III & Allison G. Davis, “The Bradford Hill Criteria: The Forgotten Predicate,” 35 Thomas Jefferson L. Rev. 103, 110 (2013) (“A confidence interval provides both the relative risk found in the study and a range (interval) within which the risk would likely fall if the study were repeated numerous times.”); Christopher B. Mueller, “Daubert Asks the Right Questions:  Now Appellate Courts Should Help Find the Right Answers,” 33 Seton Hall L. Rev. 987, 997 (2003) (describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant & Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 Judicature 12, 19 (2003) (“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, Modern Epidemiology 190 (Philadelphia 1998);  John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New Reference Manual on Scientific Evidence,” 74 N.C.L.Rev. 1183, 1212 n.172 (1996) (“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

[18] See Brock v. Merrill Dow Pharm., Inc., 874 F.2d 307, 311–12 (5th Cir. 1989) (incorrectly stating that the court need not resolve questions of bias and confounding because “the studies presented to us incorporate the possibility of these factors by the use of a confidence interval”). Bayesian credible intervals can similarly be misleading when the interval simply reflects sample results and sample variance, but not the myriad other ways the estimate may be wrong.