TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

A Bayesian Toehold in the New Reference Guide to Epidemiology

April 4th, 2026

The most recent edition of the Reference Manual on epidemiology distinguishes more carefully between Bayesian and frequentist approaches to statistical analyses than did its previous iterations. In past editions, the authors conflated confidence and credible intervals, an error that is studiously avoided in the text of the chapter on epidemiology, in the fourth edition.[1]

The chapter acknowledges that “most published research does not” use Bayesian credible intervals of posterior probabilities. The authors then offer a largely unsupported conclusion about a “toehold”:

“Epidemiologic studies assessed by Bayesian statistical analyses have begun to gain a toehold in litigation, although court opinions are still dominated by discussion of traditional significance testing.”[2]

The authors do not define what a toehold is; nor do they specify whether it is a big toe or pinky toe. The new chapter cites three cases, which out of the universe of cases, seems like a tiny toe. The three cases cited by the Reference Manual as a toehold raise serious questions about the legitimacy of using Bayesian analyses, at least to date.

  1. Langrell.

In Langrell,[3] one of the three cases cited by the Manual, an expert witness claimed to have used a “Bayesian approach,” but in reality no Bayesian statistics were involved. The Manual describes the result in Langrell as admitting the testimony of a specific causation expert witness who had used a Bayesian approach for specific causation of a cancer “so rare that it was “unlikely or impossible for epidemiological studies to be performed.”[4]

Citing Langrell for the stated proposition was questionable scholarship at best. The case was one of several cancer claims against railroad employers, in which Robert Peter Gale served as an expert witness. Dr. Robert Peter Gale is a well-credentialed clinician whose career has focused on lymphopoietic cancers.[5] He has no apparent expertise in statistics or epidemiology.

In one reported decision, Byrd, Dr. Gale attempted to offer a “Bayesian” opinion that railroad yard exposures caused a worker’s lung cancer. The claimant had also been a two-pack per day smoker for many years.[6] The published opinion refers to Dr. Gale’s having used Bayesian methods, but there is nothing in the published opinion to suggest that such methods had been used.[7] Gale appeared to equate Bayesian analysis with a non-quantitative differential etiology. Given the claimant’s extensive smoking history, the trial court excluded Dr. Gale’s proffered opinion on the cause of the claimant’s lung cancer, as unreliable.

In another railroad case brought by Saul Hernandez, Gale also claimed to use Bayesian methods to assess the causation of the claimant’s stomach cancer. There is only one mention, however, of Bayes in Gale’s report:

“My opinion is based in Bayesian probabilities which consider the interdependence of individual probabilities. This process is sometimes referred to as differential diagnosis or differential causation determination or differential etiology. Differential diagnosis is a method of reasoning widely-accepted in medicine.”[8]

To be explicit, there was no discussion of prior or posterior probabilities or odds, no discussion of likelihood ratios, or Bayes factors. There was absolutely nothing in Dr. Gale’s report that would warrant his claim that he had done a Bayesian analysis of specific causation or of the “interdependence of individual probabilities” of putative specific causes. The court excluded Dr. Gale’s proffered opinion in Hernandez, with its scant reference to a Bayesian analysis.[9]

The third instance of Gale’s purported use of a Bayesian analysis occurred in the Langrell case, cited by the Manual. The authors of the new Manual do not specify what kind of rare cancer was involved in the Langrell case. For the record, Mr. Langrell developed squamous cell carcinoma of the tonsils, which is the most common type of oropharyngeal cancer, which has been studied for many decades. Alcohol, tobacco, and human papillomavirus (HPV), have long been associated with the occurrence of such cancers. Mr. Langrell had a history of exposure to all three risk factors. Contrary to Gale’s poor-mouthing about lack of data, there are many large cohort studies of railroad yard workers with diesel fume exposure.[10]

The full extent of the district court’s exposition about Gale’s “Bayesian” method was to state that:

“He testified he used a Bayesian approach, allowing him to ‘consider interdependence of individual probabilities’ and to render an opinion as to ‘whether the weight of the evidence indicates it is more likely than not to a reasonable degree of medical probability that exposure to the carcinogens discussed was a cause of tonsil cancer in Mr. Langrell’.”[11]

There is no evidence that Dr. Gale had the competence to conduct a Bayesian analysis, or that he actually did one. Dr. Gale’s participation in the Langrell, Byrd, and Hernandez cases seems like poor evidence of a toehold for Bayesian methods. Not even a pinky toe.

We might forgive the credulity of the judicial officers in these cases, but why would Dr. Gale state that he had done a Bayesian analysis? The only reason that suggests itself is that Dr. Gale was bloviating in order to give his specific causation opinions an aura of scientific and mathematical respectability.  Falsus in uno, falsus in omnibus.[12] In two of the three related cases, his opinion was rejected. The Manual cites only the case in which Gale’s opinion was admitted. The cited opinion offers no support for Gale’s having actually conducted a Bayesian analysis of any sort.

  1. In re Abilify.

The second cited example of toe holds was the use of a Bayesian analysis by a statistician, David Madigan, in the Abilify litigation. Madigan has published on Bayesian statistics, but his litigation activities have repeatedly raised issues whether Madigan’s Bayesian analyses are reliable.

The Abilify litigation involved claims that the anti-psychotic medication caused impulsive gambling, eating, shopping, and sex. Of course, psychotic behavior itself involves those impulsive behaviors and many others. The Manual cited a decision of the multi-district litigation court that noted that “[n]umerous federal courts have found Dr. Madigan’s methodology of detecting safety signals using a combination of frequentist and Bayesian algorithms to be reliable under Rule 702 and Daubert.”[13]

The “signals” to which the Manual citation refers are suggestions of possible causal associations; they are hypotheses generated from pharmacovigilance studies of adverse event reports, not tests of those hypotheses. Signals are not causes; they may not rise even to the level of associations. The particular analyses proffered by Madigan in Abilify, and in many other litigations, for plaintiffs, involves comparing the rate of reporting specific adverse events for the drug with the reporting rate for all drugs, or for comparator drugs. The outcome of these analyses is a reporting rate ratio, not an incidence ratio.

The following 2 x 2 table illustrates how adverse event data are using to create “signals” of disproportional reporting.

The FDA provides very clear guidance on the meaning and use of such signal-finding algorithms or disproportionality analyses (DPAs):

“In the context of spontaneous report systems, some authors use the term “signal of disproportionate reporting” (SDR) when discussing associations highlighted by DPA methods. In reality, most SDRs that emerge from spontaneous report databases represent non-causal effects because the reports are associated with treatment indications (i.e., confounding by indication), co-prescribing patterns, co-morbid illnesses, protopathic bias, channeling bias, or other reporting artifacts, or, the reported adverse events are already labeled or are medically trivial.”[14]

Disproportionality analyses are not part of analytical epidemiology, but Madigan has tried to pass them off as such in any number of litigations. More discerning courts have excluded his attempts. In the Accutane litigation in Atlantic County, New Jersey, Judge Johnson conducted an extensive pre-trial hearing on challenges to Madigan’s causation opinions, and found them wanting under the New Jersey analogue of Federal Rule of Evidence 702.[15] On appeal, the New Jersey Supreme Court reviewed and affirmed the exclusion of Madigan’s litigation opinions that isotretinoin causes Crohn’s disease.[16]

The pattern of adverse event report filing in connection with isotretinoin has been carefully studied; it illustrates the FDA’s point about artifacts. One such study of isotretinoin adverse event reporting showed that attorneys reported  87.8% cases, while physicians reported 6.0%, and consumers reported only 5.1% cases. For the entire FAERS database, only 3.6% reports for all drug reactions during the same time period were reported by attorneys (p value < .01).[17]

In other areas less affected by litigation-created reporting bias, the results of DPAs have been compared with analytical epidemiology. A DPA of statin use and bladder cancer suggested a reporting odds ratio of 1.48, 95% CI; 1.36-1.61. The authors, in a peer-reviewed publication, reported the result with clearly inappropriate causal language: “Multi-methodological approaches suggest that statins are associated with an increased risk for bladder cancer.”[18] An appropriate meta-analysis of analytical epidemiologic studies reported an actual odds ratio of 1.07, 95 % CI (0.95, 1.21), which finding was interpreted as suggesting “that there was no association between statin use and risk of bladder cancer.”[19]

Dr. Madigan’s use of Bayesian methods to analyze reporting ratios and his passing them off as evidence that can support causal inference is a paradigmatic instance of an inappropriate methodology. Dr. Madigan’s use of Bayesian methods to analyze reporting rates seems like poor evidence of a toehold.

  1. In re Testosterone.

The third case cited by the Manual for the toehold proposition arose in the multi-district litigation created for claims against manufacturers of testosterone. This MDL aggregated cases based upon a speculative Public Citizen petition that transdermal testosterone used by men who have low testosterone levels causes heart attacks and strokes. The plaintiffs adopted what appeared to be a strategy of deploying complex arguments and analyses to obfuscate and defeat Rule 702 gatekeeping. As part of this strategy, two of the plaintiffs’ expert witness conducted a Bayesian “hypothesis test,” by which they took an out-of-date meta-analysis,[20] removed some of the studies that they incorrectly decided were duplicative, and recalculated a credible interval instead of a confidence interval.

This Bayesian hypothesis test came up in several decisions of the MDL court. The Manual cited only to a decision dated August 23, 2018, which it characterized as denying a motion to exclude expert witness testimony that advanced a Bayesian critique of epidemiologic studies.[21]

Looking at the cited decision of August 23, 2018, we see a reference to a previous ruling in May 2017, when the court held that an expert witness’s failure and inability to “quantify the cardiovascular risk he finds in his Bayesian analysis … is an issue affecting the weight to be accorded to his analysis, not its admissibility.”[22] On its face, this opinion does not quite make sense given that a Bayesian analysis would necessarily involve a quantification of posterior probability. The referenced May 2017 opinion also demonstrates the court’s failure to understand basic frequentist concepts, when it recited incorrect definitions of p-value and confidence intervals:

“According to conventional statistical practice, such a result—that is, a finding of a positive association between smoking and development of the disease—would be considered statistically significant if there is a 95% probability, also expressed as a “p-value” of <0.05, that the observed association is not the product of chance. If, however, the p-value were greater than 0.05, the observed association would not be regarded as statistically significant, according to prevailing conventions, because there is a greater than 5% probability that the association observed was the result of chance.

* * *

Statistical significance can also be expressed equivalently in terms of a confidence interval. A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time.”[23]

There is, however, also a discussion in the May 2017 decision to the Bayesian hypothesis test, which had been developed by plaintiffs’ expert witnesses,

Burt Gerstman and Martin Wells.[24] The new Manual’s citation to the testosterone MDL case seems to be to this Bayesian analysis.

While the testosterone MDL case cited by the Manual refers only obliquely to a putative Bayesian analysis that had no quantification, the May 2017 decision, not cited by the Manual, actually involved a Bayesian analysis that supposedly yielded a posterior probability of 85% that there was some increased risk for a composite of heart attack and stroke outcomes from use of testosterone therapies.

In the May 2017 decision, the MDL court rejected AbbVie’s Rule 702 motion to exclude Gerstman’s opinion based upon the Bayesian hypothesis test. AbbVie’s approach to the challenge to the Gerstman-Wells’ Bayesian analysis seemed to avoid the complexity inherent in the analysis. The AbbVie motion included several grounds, not all discussed in the court’s decision of May 2017, for excluding the Bayesian analysis, including:

“1) the plaintiffs’ witnesses’ failure to publish their analysis;

2) the challenged witness’s having never published a significant Bayesian analysis previously;

3) the absence of Bayesian analyses in the relevant studies on testosterone;

4) the rarity of Bayesian analyses in product liability cases;

5) the witnesses’ failure to state what the actual risk was, as opposed to the probability that it exceeded 1.0; and

6) the defense expert witness’s calculation that the “Increased [cardiovascular] risk meets only a 70% level of evidence, which is far below the 95% level required.”[25]

Grounds one through four were extremely weak as stated, and ground five did not affect the relevancy of the analysis to general causation. Ground six was the shot in the foot, with the defense’s falling into the trap of conflating the coefficient of confidence (95%) with the posterior probability of a Bayesian analysis.

According to the district court’s opinion, AbbVie challenged Gerstman’s Bayesian analysis because Gerstman never used or published on Bayesian statistics, and thus he lacked expertise in Bayesian analysis. This part of the challenge was readily dismissed because the level of qualifications for an expert witness is very low. A somewhat more substantive objection complained that the Bayesian analysis was “inappropriately based on subjective assumptions.”

The MDL court refused to exclude Gerstman’s Bayesian analysis, relying in part upon the suggestion in the statistics chapter of the Reference Manual third edition that Bayesians constitute a “a well-established minority” in the field of statistics.[26]

On AbbVie’s claim that Bayesian methods are excessively “subjective,” the court declared that AbbVie had failed to explain how the subjective aspect of Bayesian analysis made the proffered Bayesian analysis “any less reliable than frequentist approaches to statistics, which also involve subjective judgments in interpretation of study results.”

Unfortunately, important issues raised by the plaintiffs’ Bayesian meta-analysis were not raised by counsel or addressed by the MDL court’s initial gatekeeping opinion of May 2017. The court briefly revisited the Bayesian analysis as proffered by Martin Wells, with the same lack of specificity, in August 2018.[27] The Bayesian analysis had been prepared jointly by Gerstman and Wells, and the August 2018 decision followed the earlier decision from 2017, without adding any analysis or explanation.

A third challenge to Wells’ Bayesian analysis was filed in 2019, by a different defendant in the testosterone MDL. This challenge was supported by an expert witness report that carefully identified the invalidity of the proffered Bayesian analysis.

Bayes’ Rule is a theorem that provides a posterior probability for a claim or proposition based upon a prior probability and the strength of the evidence at hand. Unlike frequentist statistics, which treat the population value (mean or risk ratio) as having a fixed, but unknown value, Bayesian analyses treat both prior and posterior probabilities as probability distributions. Every Bayesian analysis must start with a prior probability, and therein lies a serious methodological problem, not addressed by the MDL testosterone court in May 2017.

In the Bayesian hypothesis test advanced by the plaintiffs’ expert witnesses in the testosterone cases was based on a method described by John Carlin.[28] The analysis invokes a prior risk ratio of 1.0, which standing alone might seem like a perfectly fair and disinterested prior. The chosen variance around 1.0, which makes up the prior probability distribution, however, was extremely wide and flat, essentially encompassing no risk at the low end, and absolute risk, at the high end. A flat distribution implies that the priors of testosterone causing all heart attacks and strokes, preventing all such outcomes, and having no effect at all, were roughly equally likely as a starting point. Given that we start with a very good understanding that testosterone does not prevent all heart attacks and strokes; nor does it cause all such events, we know that these starting points are unrealistic. The starting assumptions of the plaintiffs’ meta-analysis were, therefore, completely unrealistic and counterfactual.

Carlin’s method used in the proffered Bayesian meta-analysis in the testosterone cases further assumed a “hierarchical normal model.” Carlin described his assumption as reasonable “as long as the studies are large and observed counts are not too small.”[29] In the dataset used by plaintiffs’ expert witnesses, however, virtually all the studies had very low event counts, often zero or one, in either the TRT or placebo arm, or both. Carlin acknowledged that it was difficult to assess the validity of the normal model, and emphasized that

“[a] study of the sensitivity of conclusions to the choice of prior would be important.”[30]

Subsequent simulation studies of Carlin’s approach have shown that so-called “vague” or “non-informative” priors, such as were used by plaintiffs’ expert witnesses, can exercise an “unintentionally large degree of influence on any inferences.”[31]

AbbVie’s earlier challenges to Gerstman and Wells failed to note that they had offered no tests of the validity of Carlin’s method in the context of meta-analyzing clinical trials for sparse safety outcomes. The challenge filed in the Martin case, in 2019, challenged the unsupported assumptions of the proffered Bayesian hypothesis test. This Rule 702 challenge pointed out not only the subjectivity of the assumed prior probability distribution, but its counter-factual nature, and the failure of the proffered Bayesian analysis to comply with the methodological requirements of Carlin’s method.

There were additional problems with the Bayesian hypothesis test as put forward by plaintiffs’ expert witnesses. First, advancing of a causal claim with an 85% posterior probability was bound to be confused with the plaintiffs’ burden of proof of greater than 50%, notwithstanding that the calculated posterior probability did not take into account uncertainty from bias and other non-random errors in the aggregated clinical trial data, which were out-of-date and which had questionable inclusionary and exclusionary criteria. Second, the posterior probability was based upon a composite end point that combined heart attack and stroke. As a later deposition of one of the Bayesian analysts, Martin Wells, showed, had the Carlin method been applied to just the heart attack summary point estimate, then the posterior probability that TRT causes heart attack would have been less than 50%, and thus greater than 50% that testosterone does not cause heart attack.[32]

Notwithstanding the plaintiffs’ failure to rebut the very specific methodological challenges to their witnesses’ Bayesian analysis, the MDL court denied the third Rule 702 motion to exclude, without meaningful analysis.[33] The case (Martin) was later tried to a jury that returned a verdict for the defense. Neither in Martin nor in any other testosterone case that was tried did plaintiffs actually present their Bayesian analysis to the trier of fact. The likely interpretation of this failure is that the Bayesian analysis was always meant to obfuscate the weaknesses of their causation case and to help deflect Rule 702 challenges.

The ultimate verdict on the plaintiffs’ case and the Bayesian hypothesis test with its ill-informed non-inormative priors was returned only after most of the MDL cases were tried or had settled. In 2023, a “mega-trial,” a large, well-conducted randomized controlled trial was concluded and published with findings of no increased risk of heart or stroke after long-term use of TRT in men who resembled the TRT plaintiffs.  The trial enrolled over 5,000 men, about whom the researchers reported that a primary composite cardiovascular end-point event occurred in 182 men (7.0%) on testosterone therapy, and in 190 men (7.3%) receiving placebo, with a hazard ratio below one (HR = 0.96, 95% C.I., 0.78 – 1.17). None of the components of the composite (heart attack, stroke) showed an increased risk.[34]

“Falshood flies, and Truth comes limping after it; so that when Men come to be undeceived, it is too late, the Jest is over, and the Tale has had its Effect: Like a Man who has thought of a good Repartee, when the Discourse is changed, or the Company parted: Or, like a Physician who hath found out an infallible Medicine after the Patient is dead.”[35]

CONCLUSION

The Reference Manual’s chapter on epidemiology claims that Bayesian analyses have gained a toehold in litigation. The authors cited three cases, all involving the evaluation of health effects. One of the cases (Langrell) cited a claim of specific causation, and the case cited showed no evidence of an actual Bayesian analysis. The cited case was one of three in which the same expert witness, Dr. Gale, claimed to use Bayesian analysis. The other two cases, not cited, rejected the admissibility of Dr. Gale’s proffered testimony.

The second case cited (In re Ability) actually involved a Bayesian analysis, but for a so-called disproportionality analysis, which is a technique for interpreting a signal of possible health effect. The misuse of the analysis by the Bayesian analyst (David Madigan) was overlooked by the court, and by the Reference Manual.

The third case cited by the Manual also involved an actual Bayesian analysis, In re Testosterone, in the form of a Bayesian hypothesis test. The proffered analysis actually did, in theory, speak to a material issue of general causation. The Manual’s credulous citation, and the MDL court’s gatekeeper, however, overlooked that the methodology was misspecified and misapplied in multiple ways.

If these three citations are a toehold, then we need a tow-truck for these wrecks!


[1] Steve C. Gold, Michael D. Green, Jonathan Chevrier, & Brenda Eskenazi, Reference Guide on Epidemiology, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 939 (4th ed. 2025) [cited as GGCE]

[2] GGCE at 963 n.178.

[3] Langrell v. Union Pac. Ry. Co., No. 8:18CV57, 2020 WL 3037271, at *3 (D. Neb. June 5, 2020).

[4] Id.

[5] See, e.g., Robert Peter Gale, et al., Fetal Liver Transplantation (1987); Robert Peter Gale & Thomas Hauser, CHERNOBYL: THE FINAL WARNING (1988); Kenneth A. Foon, Robert Peter Gale, et al., IMMUNOLOGIC APPROACHES TO THE CLASSIFICATION AND MANAGEMENT OF LYMPHOMAS AND LEUKEMIAS (1988); Eric Lax & Robert Peter Gale, RADIATION: WHAT IT IS, WHAT YOU NEED TO KNOW (2013).

[6] Byrd v. Union Pacific RR, 453 F. Supp. 3d 1260 (D. Neb. 2020).

[7] Id. at 1270 (“Dr. Gale states that his opinion is based on Bayesian probabilities which consider the interdependence ofindividual probabilities. This process is sometimes referred to as differential diagnosis or differential etiology.”).

[8] Report of Robert Peter Gale in Saul Hernandez at 13 (July 23, 2019)[on file with author]. There was no evidence that Mr. Hernandez was tested for infection by helicobacter pylori.

[9] Hernandez v. Union Pacific RR, No. 8: 18CV62 (D. Neb. Aug. 14, 2020).

[10] See, e.g., Monireh Sadat Seyyedsalehi, Giulia Collatuzzo, Federica Teglia & Paolo Boffetta, Occupational exposure to diesel exhaust and head and neck cancer: a systematic review and meta-analysis of cohort studies, 33 EUR. J. CANCER PREV. 435 (2024).

[11] Langrell v. Union Pac. Ry. Co., No. 8:18CV57, 2020 WL 3037271, at *3-4 (D. Neb. June 5, 2020).

[12] Dr. Gale’s testimony has not fared well elsewhere. See, e.g., In re Incretin-Based Therapies Prods. Liab. Litig., 524 F.Supp.3d 1007 (S.D. Cal. 2021) (excluding Gale); Wilcox v. Homestake Mining Co., 619 F. 3d 1165 (10th Cir. 2010); June v. Union Carbide Corp., 577 F. 3d 1234 (10th Cir. 2009) (affirming exclusion of Dr. Gale and entry of summary judgment); Finestone v. Florida Power & Light Co., 272 F. App’x 761 (11th Cir. 2008); In re Rezulin Prods. Liab. Litig., 309 F.Supp.2d 531 (S.D.N.Y. 2004) (excluding Dr. Gale from offering ethical opinions); Cundy v. BNSF Ry, No. 40095-6-III.  Wash. Ct. App. (Mar. 5, 2026) (affirming dismissal of case; Gale was one of plaintiffs expert witnesses); Russo v. Metro-North RR., Index No. 159201/2019, 2025 NY Slip Op 34659(U), N.Y.S.Ct., N.Y. Cty. (Dec. 5, 2025); Saverino v. Metro-North RR, 2024 NY Slip Op 31326(U), Index No. 161353/2019, N.Y. S. Ct., N.Y. Cty. (Apr. 8, 2024).

[13] In re Abilify (Arpiprazole) Prods. Liab. Litig., No. 3:16MD2734, 2021 WL 4951944, at *5 (N.D. Fla. July 15, 2021).

[14] FDA Adverse Event Reporting System (FAERS) (Last updated Sept. 8, 2014), available at <http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm>.

[15] In re Accutane Litig., No. 271(MCL), 2015 WL 753674, at *15 (N.J. Super. Law Div., Feb. 20, 2015) (Hon. Nelson C. Johnson, also known as the author of Boardwalk Empire).

[16] In re Accutane, 234 N.J. 340 (2018) (affirming exclusion of David Madigan).

[17] Derrick J. Stobaugh, et al., Alleged isotretinoin-associated inflammatory bowel disease: Disproportionate reporting by attorneys to the Food and Drug Administration Adverse Event Reporting System, 69 J. AM. ACAD. DERMATOL. 393 (2013).

[18] Mai Fujimoto, et al., Association between Statin Use and Bladder Cancer: Data Mining of a Spontaneous Reporting Database and a Claim Database, 1 J. PHARMACOL. & PHARMACOVIGILANCE 1 (2015).

[19] Xiao-long Zhang, et al., Statin use and risk of bladder cancer: a meta-analysis, 24 CANCER CAUSES & CONTROL 769 (2013).

[20] S. Albert & J. Morley, Testosterone therapy, association with age, initiation and mode of therapy with cardiovascular events: a systematic review, 95 CLIN. ENDOCRINOL. 436 (2016).

[21] GGCE at 963 n.178 (citing In re Testosterone Replacement Therapy Prods. Liab. Litig., No. 14 C 1748, 2018 WL 4030585, at *8 (N.D. Ill. Aug. 23, 2018), and explaining that the court had denied a “motion to exclude testimony of expert ‘whose Bayesian critiques of epidemiological studies’ were similar to those of another expert whose testimony ‘the Court has previously found admissible’.”).

[22] In re Testosterone Replacement Therapy Prods. Liab. Litig., No. 14 C 1748, 2017 WL 1833173, at *4 (N.D. Ill. May 8, 2017).

[23] Id.

[24] This is the same Martin Wells found to be a methodological shapeshifter in the paraquat parkinsonism litigagion. In re Paraquat Prods. Prods. Liab. Litig., Case No. 3:21-md-3004-NJR, MDL No. 3004, 730 F.Supp.3d 793, 838 (2024) (S.D. Ill. 2024). See also Schachtman, Paraquat Shape-Shifting Expert Witness Quashed, TORTINI (Apr. 24, 2024).

 

[25] Defendants’ Motion to Exclude Plaintiffs’ Expert Testimony on the Issue of Causation, and for Summary Judgment, and Mem. of Law in Support, No. 1:14-CV-01748, MDL 2545, 2017 WL 1104501, at *69–70 (N.D. Ill. Feb. 20, 2017) (citing Reference Manual 259 (3rd ed. 2011), for the proposition that “‘subjective Bayesians are a well-established minority’ of scientists whose methods ‘have rarely been used in court.’”). See also Plaintiffs’ Mem. of Law in Opp. to Motion of AbbVie Defendants to Exclude Plaintiffs’ Expert Testimony on Causation, and for Summary Judgment, MDL No. 2545, Dkt. No. 1753 (N.D. Ill. Mar. 23, 2017).

[26] See David H. Kaye & David Freedman, Reference Guide on Statistics, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 529 (3rd ed. 2011).

[27] In re Testosterone Replacement Therapy Prods. Liab. Litig., MDL No. 2545, MDL No. 2545, 2018 WL 4030585, at *8 (N.D. Ill. Aug. 23, 2018).

[28] John Carlin, Meta-analysis for 2 x 2 tables: a Bayesian approach, 11 STAT. MED. 141 (1992) [Carlin]

[29] Carlin at 157.

[30] Id.

[31] See P. Lambert et al., How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS, 24 STATS. MED. 2401, 2402 (2005). See also Andrew Gelman, Prior distributions for variance parameters in hierarchical models, 1 BAYESIAN ANALYSIS 515

(2006); E. Pullenayegum, An informed reference prior for between-study heterogeneity in meta-analyses of binary outcomes, 30 STATS. MED. 3082 (2010).

[32] Deposition of Martin Wells, in Martin v. Actavis, Inc., No. 15-cv-4292, 2018 WL 7350886 (N.D. Ill. Apr. 2, 2018).

[33] Martin v. Actavis, Inc., Case No. 15 C 4292, MDL No. 2545, 430 F. Supp.3d 516, 534 (2019).

[34] A. Lincoff et al., Cardiovascular Safety of Testosterone-Replacement Therapy, 389 NEW ENGL. J. MED. 107, 114 (2023).

[35] Jonathan Swift, The Examiner No. 14 (Nov. 9, 1710), in THE EXAMINER & OTHER PIECES WRITTEN IN 1710-11 at 8, 11-12 (Herbert Davis, ed. 1966).

How Science Works in the New Reference Manual on Scientific Evidence

March 12th, 2026

The Second and Third Editions of the Reference Manual on Scientific Evidence contained a chapter, “How Science Works,” by Professor David Goodstein. This chapter ambitiously set out to cover philosophy and sociology of science to help orient judges as strangers in a strange land. Goodstein’s chapter had been a useful introduction to scientific methodology, and it countered some of the antic ideas seen in some judicial opinions, as well as in some other chapters of the Manual. Goodstein brought a good deal of experience and expertise to the task. He was a distinguished professor of physics and Vice Provost at the California Institute of Technology, and he had written engagingly about scientific discovery and the pathology of science.[1] Sadly, Goodstein died in April 2024. His death may have had some role in the delayed publication of the Fourth Edition of the Manual,[2] and the improvident replacement of his chapter with a new chapter written by authors less articulate about how science works.

The substitute chapter on “How Science Works” was written by two authors considerably less accomplished than the late Professor Goodstein.[3] Michael Weisberg is a professor of philosophy at the University of Pennsylvania, where he is the deputy director of Perry World House, which “analyzes global policy challenges through the realms of climate, democracy, global justice and human rights, and security.” The connection with Perry House may explain the new chapter’s heavy reliance upon the development of the chlorofluorocarbon (CFC) connection to ozone layer depletion as an exemplar of scientific discovery and knowledge. The University of Pennsylvania webpage describes Weisberg as “educat[ing] the next generation of environmental leaders in the classroom, at the negotiating table, and in the field, ensuring that their voices have maximal impact on addressing the climate crisis.”[4] So we have a philosopher of advocacy science, as it were. Some readers might think those credentials are not optimal for preparing a nuts-and-bolts description of how science works. Reading sections of the new chapter will not diminish their concerns.

Joining with Weisberg on this new version of “How Science Works,” is Anastasia Thanukos, who works at the University of California Museum of Paleontology. Thanukos has her masters degree in integrative biology, and her doctorate in science education.[5] 

The new “method” chapter has some virtues. As did Goodstein’s chapter, the new authors put peer review into a realistic perspective that should keep judges from being snoockered into admitting weak or bogus evidence because it had been published in a peer reviewed journal.[6] The authors should have gone much farther in pointing out that the rise of predatory and pay-to-play journals, as well as journals controlled by advocacy groups, have undermined much of the publishing model of modern science.

Weisberg and Thanukos discuss “expertise” in a way that is interesting but irrelevant to legal cases.  They seem blithely unaware that the standard for qualifying an expert witness is extremely low. Who will disbuse them when they argue that “[i]t is worth evaluating the closeness of a scientist’s disciplinary expertise to a scientific topic on which expert testimony is delivered”?[7] In what emerges as a consistent pattern of giving anti-manufacturing industry examples, the authors point to Richard Scorer as an accomplished scientist, who had no specific expertise in CFC ozone depletion. Notwithstanding the lack of specific expertise, an industry-backed group promoted Scorer’s views that criticized the CFC-ozone depletion hypothesis.[8] Citing Naomi Oreskes, the new Manual chapter states that “[t]he problem of scientists with legitimate expertise in one field weighing in on a scientific question outside their area of expertise is a pernicious one that has affected public acceptance of science and policy on issues such as climate change and tobacco exposure.”[9] Later, when Weisberg and Thanukos discuss the Milward case, they miss the pernicious influence that flowed from allowing Martyn Smith, a toxicologist, to give methodologically muddled opinion testimony on epidemiology. Pernicious is where you find it, and the authors of the new chapter find virtually all untoward instances of poor scientific method and conduct to originate from manufacturing industry.

Weisberg and Thanukos introduce a discussion of the “replication crisis,” a phrase and concept absent from the third edition of the Reference Manual.[10] The authors express some skepticism that there is an actual crisis over replication,[11] but their focus on climate science may mean that they are simply blinded by groupthink in that discipline. Their discussion of retractions omits the steep rise in retraction rates in most scientific disciplines,[12] and the authors ignore the proliferation of poor quality journals. Positively, the authors introduce a discussion of study preregistration, a notion absent from the third edition of the Manual, and they explain that such preregistration may serve as a bulwark against data dredging post hoc analyses.[13] Negatively, the authors ignore how frequently preregistered protocols are not used, or are used and then violated.

Weisberg and Thanukos appropriately ignore “weight of the evidence” (WOE) and “inference to the best explanation” (IBE). Readers might (mistakenly) think that the new chapter implicitly rejects WOE, as put forth by Carl Cranor and credulously accepted by the First Circuit in Milward, when the chapter authors insist that 

“the judge’s task requires a deeper examination of the available evidence and methods by which it was arrived at, as well as an assessment of how the community of experts in this area has evaluated or would evaluate the evidence and reasoning in question.”[14]

Contrary to the Milward decision from 2011, the new authors are not shy about stating the obvious; there is good science, and there is bad science.  Not all “judgment” about causality is acceptable and fit for submission to juries.[15] Given the judicial resistance to Rule 702, the obvious here requires stating. Weisberg and Thanukos acknowledge that some scientific judgment is unreliable or invalid because it was based upon work that was not carried out in accordance with current standards for scientific investigation and inference.[16] It should not surprise anyone that most of their examples of bad science are the product of manufacturing industry; the authors are oblivious to bad science sponsored by the lawsuit industry or by non-governmental advocacy organizations (NGOs).

Weisberg and Thanukos frame scientific disagreements and debates as governed by both data and ethical norms. Science is not infinitely contestable. There are identifiable norms, including a norm that scientists should “seek relevant information,” and “scrutinize ideas and evidence.”[17] Contrary to Milward’s standard of judicial abstention and credulity in the face of dodgy causal claims, these authors state what should be obvious, that scientific scrutiny involves, among other things, “an evaluation of methods, considering potential biases and oversights.”[18]

The chapters’ authors, non-lawyers, get closer to the heart of the error in Milward’s abstention doctrine with their recognition of what should have been obvious to the authors of the law chapter (Richter & Capra):

“When research relevant to a trial has not yet been scrutinized by a community with the appropriate technical expertise, a judge may be placed in the position of providing or requesting this scrutiny.”[19]  

Rather than some vague, subjective, and content-free WOE standard, Weisberg and Thanukos urge scientists, and by implication judges as well, to engage in serious efforts to “identify and avoid bias” and abide by ethical guidelines.[20] In other (my) words, the new authors agree that there is a standard of care reflected in the norms of science, and consequently there can be deviations from that standard. For Weisberg and Thanukos, compliance with the normative structure of scientific investigations is at the heart of building up accurate and predictive conclusions from data.[21] As part of their communitarian and normative conception of the scientific process, the authors appear to accept the reality and necessity for judges to act as gatekeepers.[22]

And while this recognition of standards and the need to police against deviations from standards is commendable, Weisberg and Thanukos proceed to give an abridgment of scientific method and process that is distorted and erroneous. They steadfastly ignore the concept of hierarchy of evidence, and thus provide illegitimate cover for levelers of evidence. In discussing randomized controlled trials, for instance, they note that such trials are often taken as “the gold standard,” but then they counter, without citation, support, or argument, that such trials “are just one line of evidence among many.”[23] The authors elide discussion and reconciliation of when that “just one line of evidence” conflicts with observational studies.

Notwithstanding their helpful comments about the need to evaluate studies for bias and other errors, these authors enter into the Milward controversy with an observation that assessing many lines of evidence is required and can be difficult for courts, and has led to “controversy.” Citing to papers including one  by the late Margaret Berger at her notorious lawsuit industry SKAPP-funded Coronado Conference, Weisberg and Thanukos float the observation that:

“In science, the available evidence (some of which may come from other research programs not designed to test the hypothesis under consideration) is evaluated as a body, along with the strengths, weaknesses, and caveats relating to each type of data, an approach which, some scholars have argued, the judiciary has not always followed.98[24]

This claim that the available evidence is evaluated as “a body” is presented as a fact about how science works, without any citation or argument. Several comments are in order. First, the claim is at odds with the authors’ own statements that scientific norms require evaluating each study for biases and other disqualifying flaws. Second, the claim is at odds with the authors’ own reference to systematic reviews and meta-analyses,[25] which are governed by protocols with inclusionary and exclusionary criteria for individual studies, and which require consideration of individual study validity before it enters the “body” of evidence that is quantitatively or qualitatively evaluated. In the authors’ words, “authors delineate both the criteria that studies must meet for inclusion in the review and the methods that will be used to assess the studies.”[26] The Milward case involved an expert witness who had proffered the very opposite of a systematic review in the form of post hoc rejiggering of studies and their data to fit a pre-conceived litigation goal. In the context of addressing the replication crisis, Weisberg and Thanukos correctly observe “peer review alone cannot ensure that the conclusions of published studies are actually correct, highlighting the responsibility judges bear in evaluating the validity of the methodologies that contributed to a particular piece of research.”[27] Of course, the Milward case involved a hired expert witness whose unprincipled re-analysis of studies was never peer reviewed or published.

Third, the authors could easily have found additional support for the contrary proposition that individual studies must be evaluated before being considered as part of the entire evidentiary display. The IARC Preamble, which roughly describes how that agency arrives at its so-called hazard classifications of human carcinogenicity, specifies that individual studies within each of three streams of evidence are evaluated for validity and soundness before contributing to a sub-conclusion with respect to (1) epidemiology, (2) toxicology, and (3) mechanistic lines of evidence.[28] Each of those three lines of evidence is adjudged “sufficient,” “limited,” or “inadequate,” by specialists in the three respective areas, before an overall evaluation is reached. There is much that is objectionable in the IARC working group procedures, but this division of labor and the need to consider disparate lines of evidence and studies within each line separately before attempting a synthesis, is present in all systematic review methodology. The suggestion from Weisberg and Thanukos that “the available evidence” in science is “evaluated as a body” is not only unsupported, but it is demonstrably false and misleading.

This claim about holistic evaluation is a fairly transparent but failed attempt to support a claim made in the chapter on the admissibility of expert witness evidence by Liesa Richter and Daniel Capra, who present an exposition of the notorious Milward case, without criticism, in a way to suggest that the case represents appropriate judicial gatekeeping under Rule 702, and that the case is consistent with scientific norms.[29] The chapter on how science works, after  having stated a false claim about scientific methodology for synthesis and integrating disparate lines of evidence, attempts to provide a gloss on the similar and equally benighted claim of Richter and Capra, in footnote 98:

“98. Some scholars have raised concerns that the courts have on occasion unfairly dismissed numerous individual lines of evidence as being flawed or insufficiently conclusive and concluded that evidence is lacking, when in fact the body of evidence, taken as a whole, points to a clear conclusion. For more, see discussion of Milward v. Acuity Specialty Products Group, Inc.; see also Liesa L. Richter & Daniel J. Capra, The Admissibility of Expert Testimony, in this manual; Berger 2005, supra note 97; and Steve C. Gold, A Fitting Vision of Science for the Courtroom, 3 Wake Forest J.L. & Pol’y 1 (2013).”

Some “scholars” have indeed said such things in their more unscholarly moments; some scholars have criticized Milward, but they are not cited in this new methods chapter. The footnote is accurate, but highly misleading by omission. The First Circuit in Milward also said as much, also without support or justification, and Richter and Capra, in their chapter of the Manual, fourth edition, parrot the Milward case. Weisberg and Thanukos cite to two articles, by Margaret Berger and by Steven Gold, both law professors, not scientists, and both ideologically hostile to Rule 702 gatekeeping. The Berger article was from a lawsuit-industry SKAPP funded symposium known as the Coronado Conference, and the Gold paper comes out of a symposium sponsored by the lawsuit industry itself and the Center for Progressive Reform, an advocacy NGO to which one of Mr. Milward’s expert witnesses, Carl Cranor, belongs. So the authors of the new science methodology chapter failed to cite any scientific source, but cited to papers by lawyers in the capture of the lawsuit industry, and a single (infamous) decision that ignored Rules 702 and 703, as well as the extensive literature on systematic reviews.  Weisberg and Thanukos could have cited many sources that contradicted their claim, and the claim of the lawsuit industry sponsored lawyers, but they did not. This is what biased and subversive scholarship looks like.

Funding Bias – The New McCarthyism

The selective citation to articles sponsored by the lawsuit industry is ironic in the context of what Weisberg and Thanukos have to say elsewhere about the “funding effect.” Some of what the authors say about personal bias is almost reasonable. For instance, they suggest that funding source is a “valid consideration” in evaluating methodologies and conclusions of expert testimony, and presumably of published studies as well, but not a sufficient reason to exclude such testimony or reliance.[30] Interestingly, these authors ignored the funding and the ideological interests of the symposia they cited in support of the repudiated Milward abstention doctrine.

Over three decades ago, Kenneth Rothman, the founder of Epidemiology, the official journal of the International Society for Environmental Epidemiology (ISEE), wrote his protest against the obsession with funding in article that should have been cited in the new chapter, for balance. Rothman described the fixation on funding as the “new McCarthyism in science,” which manifested as intolerance toward industry-sponsored studies, and strict scrutiny of “conflict-of-interest” (COI) disclosures.[31] The new McCarthyites amplify the gamesmanship over COI disclosures by excusing or justifying non-disclosure of COIs from scientists who have positional conflicts, or who are aligned with advocacy groups or with the lawsuit industry.

This asymmetrical standard for adjudging conflicts is on full display in the Weisberg and Thanukos chapter, when they claim that “in pharmaceuticals, there is a strong tendency for industry-sponsored trials to favor the industry’s product.”[32] The chapter authors, and their cited source, ignore the context in which the pharmaceutical industry scientists publish clinical trial results.  A successful clinical trial that showed efficacy with minimal adverse events is the result of years of prior research, including phase I and II trials, and preclinical testing. If the research fails to show efficacy, or shows unreasonable harm, in any of this prior research, the phase III trial is never done and so never published. If the medication is never licensed, the phase III trial will generally not be published. The selection effects are obvious and overwhelming in determining that the published results of phase III trials will be work that favors the sponsor. The “failed” phase III trial may result in a securities class action against the pharmaceutical company. In the realm of observational studies, some work commissioned by manufacturing industry has its origins in the poorly conducted, flawed work of environmental zealots and NGOs. Manufacturing industry has an obvious interest in correcting the scientific record, and again, any carefully done study would rebut that of the zealots and favor the industry sponsor.

Elsewhere, the authors offer a more balanced assessment when they observe that “[a]ll research is potentially influenced by bias, and every funder of research has the potential to introduce a source of bias.”[33] Similarly, the fourth edition chapter notes that “[a]ll scientists have some sort of motivation for their work, and this does not preclude scientific knowledge building, so long as biased methodologies and interpretations are avoided.”[34] Their recognition that motivated reasoning is everywhere suggests that all research should receive scrutiny regardless of apparent or disclosed funding source.[35]

When it comes to providing examples of funding-effect distortions of science, Weisberg and Thanukos seem to blank on instances created by the lawsuit industry or by environmental NGOs. The reader should contrast how readily and stridently the authors point to bias in industry-sponsored research with how the authors tie themselves up with double negatives when making the same point about NGOs:

“That is not to suggest that government-or nongovernmental organization (NGO)-sponsored research is necessarily free from bias.”[36]

The cognitive dissonance is palpable. The only conclusion that could be drawn from such a locution is that Weisberg and Thanukos have not worked very hard to identify and disclose their own biases.

STATISTICS DONE POORLY

When it comes to explaining and discussing the role of statistical methods in the scientific process, Weisberg and Thanukos go off the rails. The new chapter is an unmitigated disaster, which should have been corrected in the peer review and oversight process. The first sign of trouble became apparent upon checking the definition of “p-value” in the chapter’s glossary:

p-value. A statistic that gives the calculated probability that the null hypothesis could be true even given the observed differences between conditions.”[37]

This definition is the transposition fallacy on steroids. Obviously, a p-value cannot be the probability that the null hypothesis “could be true” when the procedure for calculating a p-value must assume that the null hypothesis is true, along with a specified probability model. Equally important, the p-value does not describe a probability in connection with the null hypothesis because it describes the probability of observing data as different from the null, or more so, as seen in this particular sample.  The statistics chapter in the Manual by Hall and Kaye states the meaning correctly.  The coverage of statistical concepts by Weisberg and Thanukos should be studiously ignored.

The outrageously incorrect definition of p-value in the glossary is not an isolated error.  The authors are clearly statistically challenged. In the text of their chapter, they incorrectly describe the p-value, consistently with their aberrant glossary entry:

“the commonly used p-value approach, scientists compare a test hypothesis (e.g., that drug X is effective) to a null (e.g., that there is no difference in cure rates between those who took drug X and those who took a placebo). Scientists then calculate the probability that the null hypothesis could be true even with the observed difference between conditions (e.g., the cure rate of patients taking drug X compared to that of those taking a placebo).”[38]

Weisberg and Thanukos thus conflate frequentist and Bayesian statistics. They also obliterate the meaning of the confidence interval, an important concept for judges and lawyers to understand. Here is how the authors describe the confidence interval in their chapter:

Evaluating estimates: In science (and in contrast to their lay meanings), the terms uncertainty and error refer to the variability of a set of data that is intended to estimate a single number. Uncertainty and error are generally expressed as a range, within which we are confident that, if the study were repeated, the new result would fall. Scientists often use a 95% confidence interval for this purpose.”[39]

Describing the confidence interval in the same sentence as “uncertainty and error” is bound to induce uncertainty and error. The confidence interval provides a range of estimates based upon random error, and uncertainty only in the form of imprecision in the point estimate. There are of course myriad other kinds of uncertainty and error not captured by the confidence interval. The most important of the authors’ errors is that they assert incorrectly that the confidence interval provides a range within which new results from the study repeated would fall.  This is, again, a variant on the transposition fallacy that the authors commit in their definition of the p-value. The confidence interval provides a range of results that would not be rejected as alternative null hypotheses by the data in the obtained sample. Because of random error, future samples would give different results, with different confidence intervals, which would not be co-extensive with the first obtained confidence interval. To be sure, the statistics chapter states the matter correctly, and the epidemiology chapter finally gets it correct in its text (after having mangled the concept in the second and third editions), but the epidemiology chapter perpetuates its previous errors in defining confidence intervals in its glossary. This sort of issue, and it is a serious one, could have been eliminated had there been meaningful peer review and editorial oversight for consistency and accuracy of the Manual as a whole.

Weisberg and Thanukos address statistical power in a way that may also mislead readers. They tell us that “[p]ower refers to a test’s ability to reject a hypothesis that is indeed false.” W&T at 88. If only were it so. The authors omit that power is a probability that at a specified level of significance (say p < 0.05), and a specified alternative hypothesis, sample size, and probability model, the sample result will reject the null hypothesis in favor of the alternative hypothesis. Then the authors suggest confusingly that “[w]ell-designed studies have sufficient power to detect the differences of interest, but it may not be apparent when a test lacks power.”[40]

If the study at issue presents a confidence interval around a point estimate of interest, then it will be clear what alternative null hypotheses are statistically compatible with the sample result at the pre-specified level of alpha (significance). Any point outside the interval would be rejected by such a test of significance, and so the casual reader will have a rather good idea of what could and could not be rejected by the sample data. And of course, virtually every study will have low power to detect extremely small increased risks, say relative risk of 1.00001. And most studies will have high power to detect risk ratios of over 1,000.

This new chapter on “How Science Works” also propagates some well-known fallacies about statistical significance testing. Implicit in the authors’ committing the transposition fallacy, is a conceptual and mathematical confusion between the coefficient of confidence (1-α) and the posterior probability of an hypothesis.

The authors’ mistake comes in their insistence upon labeling precision in a test result as “certainty.” In the quote below, the authors’ confusion is clear and obvious:

“Note that the 95% and 5% cutoffs are somewhat arbitrary, and a higher degree of confidence might be required if more certainty were desired—for example if an impactful policy decision depended on the conclusion.”[41]

An impactful [sic] policy decision might well call for more certainty, or a higher posterior probability, but a higher coefficient of confidence will not necessarily map to hypothesis probability at all. The authors’ confusion and conflation of the probability of alpha and the Bayesian posterior probability arises elsewhere within the chapter:

“(1) A p-value lower than 0.05 does not prove that a null hypothesis is false. It is strong evidence, but there is a small chance that the difference observed could be the result of chance alone.

(2) Using a low p-value (e.g., 0.05) as a criterion for significance sets a high bar for rejecting the null hypothesis, minimizing the chance of getting a false positive… .”[42]

Again, a p-value less than five percent is hardly strong evidence in the context of large database studies, especially when there are multiple comparisons and the outcome is not the pre-specified outcome of the analysis. The authors’ confusion is on full display when they discuss the Zoloft birth defects litigation, where the Third Circuit affirmed the exclusion of plaintiffs’ expert witnesses’ causation opinions and the grant of summary judgment to the defendants. According to the authors’ narrative:

“plaintiffs’ expert’s testimony would have argued that multiple, nonsignificant associations between Zoloft use and birth defects indicated a causal relationship. The testimony was excluded because these results were consistent with a weak causal relationship (a small effect size), one that is ‘so weak that one cannot conclude that the risk is greater than that seen in the general population’.”[43]

Of course, in the Zoloft litigation, the excluded plaintiffs’ expert witnesses were caught red-handed – at cherry picking – and attempting to circumvent the lack of significance with a methodologically incorrect meta-analyses.[44]

If the risk of birth defects among children born to mothers who used Zoloft in pregnancy was no greater than seen in the general population, then there would be no risk, not risk “so weak” it cannot be seen. Locutions such as the “results were consistent with a weak causal relationship,” when the results were equally consistent with no causal relationship suggest that the writers cannot bring themselves to say that the causal hypothesis was simply not supported at all. Of course, no study may exclude an increased risk of 0.01 percent, or a relative risk of 1.01, but at some point, when multiple attempts fail to reveal an increased risk, we may conclude that the proponents of the causal claim have failed to make their case.

META-SHMETA-ANALYSIS

Weisberg and Thanukos address meta-analysis incompletely in the context of systematic reviews. The authors do not provide any insights into how meta-analyses are done, and more glaringly, they fail to mention that not all systematic reviews can or should result in quantitative syntheses of estimates of association. On the positive side, they state that meta-analyses are important in litigation, and that the application of rigorous methodologies should be required.[45] With clearly unintended irony, Weisberg and Thanukos offer, as support for their statement, the Paoli Railroad Yard case, “in which the exclusion of a contested meta-analysis was overturned.”[46]

Weisberg and Thanukos have stepped into the wet corner of a pigsty. The issue in the Paoli case arose from a meta-analysis of mortality rates associated with polychlorobiphenyl (PCB) exposures. The district court excluded the proponent of the meta-analysis, not because it was unreliable, but because it was novel. Holding it up in conjunction with a statement about application of rigorous or reliable methodologies was way off the relevant legal point.

The expert witness who proffered the meta-analysis in Paoli was William  Nicholson, who was a physicist with no professional training in epidemiology. For his opinion that PCBs were causally associated with human liver cancer, Nicholson relied upon a non-peer-reviewed, unpublished report he wrote for the Ontario Ministry of Labor.[47] Nicholson described his report as a “study of the data of all the PCB worker epidemiological studies that had been published,” from which he concluded that there was “substantial evidence for a causal association between excess risk of death from cancer of the liver, biliary tract, and gall bladder and exposure to PCBs.”[48]

The defense challenged Nicholson’s opinion, not on Rule 702, but on case law that pre-dated the Daubert decision.[49] The challenge included pointing out the unreliability of the Nicholson’s meta-analysis, but also asserted (incorrectly) the novelty of meta-analysis generally. The district court sustained the defense objection on the grounds of “novelty,” without reaching the reliability analysis.[50] The Third Circuit appropriately reversed and remanded for consideration of the reliability of Nicholson’s meta-analysis.[51]

The consideration of Nicholson’s “meta-analysis” never occurred on remand; plaintiffs’ counsel and their expert witnesses withdrew their reliance upon Nicholson’s analysis. Their about face was highly prudent. Nicholson’s report presented SMRs (standardized mortality ratios); for the all cancers statistic, he reported an SMR of 95. What Nicholson did, in this analysis, and in all other instances, was simply divide the observed number of deaths by the expected, and multiply by 100. This crude, simplistic calculation fails to present a standardized mortality ratio, which requires taking into account the age distribution of the exposed and the unexposed groups, and a weighting of the contribution of cases within each age stratum. Nicholson’s presentation of data was nothing short of a fraud.

Nicholson’s Report was replete with many other methodological sins. He used a composite of three organs (liver, gall bladder, bile duct) without any biological rationale. His analysis combined male and female results, and still his analysis of the composite outcome was based upon only seven cases. Of those seven cases, some of the cases were not confirmed as primary liver cancer, and at least one case was confirmed as not being a primary liver cancer.[52]

As noted, Nicholson failed to standardize the analysis for the age distribution of the observed and expected cases, and he failed to present meaningful analysis of random or systematic error. When he did present p-values, he presented one-tailed values, and he made no corrections for his many comparisons from the same set of data.

Finally, and most egregiously, Nicholson’s meta-analysis was meta-analysis in name only. What he had done was simply to add “observed” and “expected” events across studies to arrive at totals, and to recalculate a bogus risk ratio, which he fraudulently called a standardized mortality ratio. Adding events across studies, without weighting by the inverse of study variance, is not a valid meta-analysis; indeed, it is a well-known example of how to generate the error known as Simpson’s Paradox, which can change the direction or magnitude of any association.[53]

In citing to the Paoli case as a reversal of exclusion of a contested meta-analysis, Weisberg and Thanukos give a truncated analysis that misleads readers, judges, and lawyers. There never was a proper consideration of the reliability vel non of Nicholson’s meta-analysis in the Paoli litigation, and in the final analysis, the Paoli plaintiffs abandoned reliance upon Nicholson’s ill-conceived meta-analysis.

VIRTUE SIGNALING

Although there are no land acknowledgments for the property on which Federal Judicial Center building is located, Weisberg and Thanukos miss few opportunities to let us know that they are woke scholars. There is the gratuitous and triggering “pregnant people,”[54] which begs any number of biological questions. Then there is the authors’ statement that they are limiting their focus to the “Western conception of science,” which begs another question, why would we call any other epistemically valid approach, from any corner of the globe, as something other than “science.”[55]

Equally gratuitous are the authors’ endorsements of DEI and “diversity,” with overbroad generalizations that diversity per se advances science,[56] and a claim that “women, people of color, other historically oppressed groups, and non-Western people” are not taken seriously as scientists.[57] In over 40 years of litigating technical and scientific issues, I have never seen a judge or a lawyer disrespect an expert witness based upon sex, race, ethnicity, or national origin. Of course, I have seen expert witnesses treated roughly for propounding bad science, and that seems perfectly appropriate.


[1] See David Goodstein, ON FACT AND FRAUD: CAUTIONARY TALES FROM THE FRONT LINES OF SCIENCE (2010).

[2] Weisberg and Thanukos frequently refer to other chapters in the Manual, which suggests that their chapter was written late in the development of the Fourth Edition, and perhaps contributed to the delayed publication.

[3] Michael Weisberg & Anastasia Thanukos, How Science Works, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 47 (4th ed. 2025) [cited as W&T].

[4] See Michael Weisberg, University of Pennsylvania Philosophy, at https://philosophy.sas.upenn.edu/people/michael-weisberg.

[5] Anna Thanukos, Staff, available at https://ucmp.berkeley.edu/people/anna-thanukos/#:~:text=Her%20background%3A%20Anna%20has%20a,Education%2C%20both%20from%20UC%20Berkeley

[6] W&T at 72-75.

[7] W&T at 81.

[8] W&T at 81.

[9] W&T at 81 & n.85 (emphasis added), citing Naomi Oreskes & Erik M. Conway, MERCHANTS OF DOUBT: HOW A HANDFUL OF SCIENTISTS OBSCURED THE TRUTH ON ISSUES FROM TOBACCO SMOKE TO GLOBAL WARMING (2010).

[10] W&T at 94-96.

[11] W&T at 95 n.120.

[12] Richard Van Noorden, More than 10,000 research papers were retracted in 2023 — a new record, 624 NATURE 479 (2023).

[13] W&T at 95.

[14] W&T at 55.

[15] W&T at 63, 68.

[16] W&T at 68.

[17] W&T at 65.

[18] W&T at 70.

[19] W&T at 71.

[20] W&T at 66.

[21] W&T at 75.

[22] W&T at 49.

[23] W&T at 83.

[24] W&T at 86 (citing Richter and Capra’s discussion of Milward in chapter one of the Manual, and Professor Gold’s article from the lawsuit industry celebratory conference on the Milward case).

[25] W&T at 99-100.

[26] W&T at 99.

[27] W&T 96 (emphasis added).

[28] IARC MONOGRAPHS ON THE IDENTIFICATION OF CARCINOGENIC HAZARDS TO HUMANS – PREAMBLE (2019), available at https://monographs.iarc.who.int/wp-content/uploads/2019/07/Preamble-2019.pdf

[29] Liesa L. Richter & Daniel J. Capra, The Admissibility of Expert Testimony, National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 1, 32-33 (4th ed. 2025).

[30] W&T at 76.

[31] Kenneth J. Rothman, “Conflict of interest: the new McCarthyism in science,” 269 J. AM. MED. ASS’N 2782 (1993). See Schachtman, The Rhetoric and Challenge of Conflicts of Interest, TORTINI (July 30, 2013).

[32] W&T at 76 & n.67, citing Sergio Sismondo, Pharmaceutical Company Funding and Its Consequences: A Qualitative Systematic Review, 29 CONTEMP. CLINICAL TRIALS 109 (2008).

[33] W&T at 77.

[34] W&T at 59-60.

[35] W&T at 59-60.

[36] W&T at 76.

[37] W&T at 111.

[38] W&T at 87.

[39] W&T at 90.

[40] W&T at 88.

[41] W&T at 90 (emphasis added).

[42] W&T at 88.

[43] W&T at 90 (internal citations omitted).

[44] In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., 26 F. Supp. 3d 449 (E.D. Pa. 2014); No. 12-md-2342, 2015 WL 314149, at *3 (E.D. Pa. Jan. 23, 2015) (rejecting proffered expert witness opinion based upon “cherry-picking of studies and data within studies”), aff’d, 858 F.3d 787 (3rd Cir. 2017).

[45] W&T at 99.

[46] W&T at 99 & n.134, citing In re Paoli R.R. Yard PCB Litig., 916 F.2d 829 (3d Cir. 1990).

[47] William Nicholson, Report to the Workers’ Compensation Board on Occupational Exposure to PCBs and Various Cancers, for the Industrial Disease Standards Panel (ODP); IDSP Report No. 2 (Toronto Dec. 1987) [Report].

[48] Id. at 373.

[49] See United States v. Downing, 753 F.2d 1224 (3d Cir.1985).

[50] In re Paoli RR Yard Litig., 706 F. Supp. 358, 372-73 (E.D. Pa. 1988).

[51] In re Paoli RR Yard PCB Litig., 916 F.2d 829 (3d Cir. 1990), cert. denied sub nom. General Elec. Co. v. Knight, 499 U.S. 961 (1991).

[52] Report, Table 22.

[53] See James A. Hanley, et al., Simpson’s Paradox in Meta-Analysis, 11  EPIDEMIOLOGY 613 (2000); H. James Norton & George Divine, Simpson’s paradox and how to avoid it, SIGNIFICANCE 40 (Aug. 2015); George Udny Yule, Notes on the theory of association of attributes in statistics, 2 BIOMETRIKA 121 (1903).

[54] W&T at 84.

[55] W&T at 50.

[56] W&T at 71 n. 52-54.

[57] W&T at 102.

Reference Manual 4th Edition Corrects Some, Not All, Mistakes on Confidence Intervals

January 9th, 2026

So now that the new, fourth, edition of the Reference Manual on Scientific Evidence,[1] has been released, inquiring minds may want to know whether it has corrected errors in the previous, third, edition.[2] The authors of the new edition have had 14 years to ponder and reflect upon errors and to correct them.

Judges and lawyers look to the Manual for guidance and understanding of basic concepts, and the first three editions contained significant errors in addressing statistical concepts. There is probably no better place to jump in to see whether the new edition has corrected the prevalent mistakes in defining the statistical concept of a confidence interval, which was botched in several chapters in the third edition.[3] The concept of a confidence interval is important in many statistical applications, but it is especially important in the interpretation of epidemiologic studies.

Contrition is good for the soul. The new edition, in places, evinces an awareness that earlier editions had misled readers, and that the fourth edition needed to do better.  And in several key places, including in particular the chapter, the fourth edition has improved in its discussion of confidence intervals.

Professor David Kaye has two chapters in the new edition, one on DNA evidence, and another chapter, with Professor Hal Stern, on statistical evidence.[4] Kaye is a careful writer with substantial statistical expertise. His contributions to the third edition were anodyne treatments of statistical concepts, and his chapters in the new edition seem excellent as well upon first reading. In his chapter on DNA evidence, Kaye alludes to the misunderstandings and misrepresentations of the confidence interval,[5] and in his chapter on statistical evidence, Kaye, along with Stern, gives careful definitions and explications of confidence intervals.

Kaye and Stern call out several cases, frequently cited, for having given clearly incorrect definitions of confidence intervals. This sort of candor to the court is necessary if judges, and lawyers, are going to correct bad practices.[6] The statistics chapter in the fourth edition also does not shy away from calling out the authors of another chapter [epidemiology] in the Reference Manual’s third edition for having given erroneous definitions:

“Language from another reference guide in the previous edition of this Reference Manual that is often quoted may inadvertently convey the incorrect impression that a confidence coefficient such as 95% refers to the percentage of results in (hypothetically) repeated studies that would be expected to lie within the interval reported in the study before the court.”[7]

A very gentle criticism indeed; the epidemiology chapter was manifestly incorrect, and we can all agree that its error was negligent, not intentional. The epidemiology chapter from the third edition did not merely convey the incorrect impression; that chapter contained erroneous definitions of confidence intervals.

Kaye and Sterne correctly note that a given confidence interval “does not give the probability that the unknown parameter lies within the confidence interval.”[8] And they helpfully point out that there is no tendency for the point estimate near the center of a confidence interval to be closer to the true value than any other value within the interval.[9]

The authors of the new edition’s chapter on epidemiology obviously got the message from Professors Kaye and Sterne.[10] Fourth time is a charm. The epidemiology chapter in the third edition had been a mess on statistical issues.[11] Without any acknowledgment or confession of error committed in the first three editions, the authors of the epidemiology chapter in the fourth edition now note:

“Just as the p-value does not provide the probability that the risk estimate found in a study is correct, the confidence interval does not provide the range within which the true risk is likely to lie. In other words, it is a misconception to interpret a 95% confidence interval as representing an interval within which the true value has a 95% probability of being found.”[12]

Unfortunately, in the glossary at the end of the new edition’s epidemiology chapter, the erroneous definition of confidence interval was carried forward from the third edition, without change or correction:

confidence interval. A range of values that reflects random error. Thus, if a confidence level of 0.95 is selected for a study, 95% of similar studies would result in the true relative risk falling within the confidence interval.”[13]

What the authors no doubt meant to write was that:

“95% of similar studies would result in the true relative risk falling within the confidence intervals.”

By putting “interval” in the singular, the authors fell into the trap described by Professors Kaye and Hall, and into the error that the previous chapters on epidemiology committed.

The new edition of the Reference Manual appears to suffer, at least on this statistical issue, from the lack of high-level editing across chapters.  The interaction between authors of the statistics and the epidemiology chapters sorted out a serious error, but the error pops up in new chapters. Michael Weisberg and Anastasia Thanukos have an introductory chapter on How Science Works, which crudely and incorrectly describes confidence intervals:

“Uncertainty and error are generally expressed as a range, within which we are confident that, if the study were repeated, the new result would fall. Scientists often use a 95% confidence interval for this purpose.”[14]

Confidence intervals model only random error, and the “range” around one point estimate does not give us “confidence” that the next point estimate would fall into that range.

The chapter on regression analyses in third edition of the Reference Manual incorrectly defined confidence intervals.[15] Alas the fourth edition did not auto-correct:

“Loosely speaking, a confidence interval represents an interval of values in which the true value of a regression coefficient falls within some pre-specified probability (where the true value is the estimate that would be obtained from the same model with a very large sample).”[16]

Why the authors of a highly technical chapter chose to speak loosely, rather than accurately, is a mystery. All the authors of the regression chapter had to do was refer to the accurate, helpful definitions in the statistics chapter.

Why should we care about the Reference Manual’s misleading, incorrect definitions of confidence intervals (or p-values for that matter)? The erroneous definitions and misuses typically place a Bayesian interpretation upon the confidence interval by claiming that the coefficient of confidence (typically 95% when alpha is set at 0.05) states the probability that the parameter, the true population measure, falls within the interval around the point estimate. This misinterpretation might suffice for a Bayesian 95% credible interval, but almost invariably the calculation under discussion is the point estimate ± 1.96 standard errors. Good statistics, like good grammar, costs nothing.

Whether the conflation of confidence intervals with credible intervals results from ignorance or willful efforts to mislead, it is wrong.  And the conflation is part of a long-running rhetorical campaign to mislead about the meaning of the burden of proof and statistical significance in order to abandon statistical tests, and to green-light precautionary principle judgments as “scientific.”[17]

In past posts, I have cited and quoted any number of scientists and lawyers who have engaged in the effort, either intentional or negligent, to mislead readers about the nature of science, by idealizing and falsely elevating the burden of proof in science, and declaring it to be different from the legal and regulatory burden of proof.[18]

To pick one particularly notorious author, consider junk science writer Naomi Oreskes.[19] In her 2010 book, Oreskes declares:

“The 95 percent confidence standard means that there is only 1 chance in 20 that you believe something that isn’t true.

* * * * *

That is a very high bar. It reflects a scientific worldview in which skepticism is a virtue, credulity is not.”[20]

In fact, statistics, science, and law, the confidence interval has nothing to do with the burden of proof; rather it reflects the precision of a single point estimate. Truth is a virtue that may be lost on the likes of Naomi Oreskes, but it is essential to litigating scientific issues. Given that many lawyers in the past had cited the Reference Manual’s chapter on epidemiology for its incorrect definitions of the statistical confidence interval, we should rejoice that this one error has been corrected.


[1] National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE (4th ed. 2025) (cited as RMSE 4th ed.)

[2] National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE (3rd ed. 2011) (cited as RMSE 3rd ed.)

[3] See Nathan Schachtman, Reference Manual – Desiderata for 4th Edition – Part IV – Confidence Intervals, TORTINI (Feb. 10, 2023).

[4] In RMSE 3rd ed., Professor Kaye, along with David Freedman, wrote the chapter on statistical evidence; the two gave careful definitions and explications of confidence intervals.  Professor Freedman sadly died before the third edition was released, and he is replaced by Hal Stern in the chapter on statistics in the fourth edition.

[5] David H. Kaye, Reference Guide on Human DNA Identification Evidence in RMSE 4th ed. at 261, (noting that “the meaning of a confidence interval is subtle, and the estimate commonly is misconstrued”).

[6] See Kaye & Sterne, RMSE 4th ed. at 511 n.125 (citing Turpin v. Merrell Dow Pharm., Inc., 959 F.2d 1349, 1353 (6th Cir. 1992) (“If a confidence interval of ‘95 percent between 0.8 and 3.10 is cited, this means that random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10.”); Garcia v. Tyson Foods, Inc., 890 F. Supp. 2d 1273, 1285 (D. Kan. 2012) (“Dr. Radwin testified that his study was conducted within a confidence interval of 95 — that is ‘if I did this study over and over again, 95 out of a hundred times I would  expect to get an average between that interval.’”); In re Silicone Gel Breast Implants Prods. Liab. Litig., 318 F. Supp. 2d 879, 897 (C.D. Cal. 2004) (“a margin of error between 0.5 and 8.0 at the 95% confidence level . . . means that 95 times out of 100 a study of that type would yield a relative risk value somewhere between 0.5 and 8.0”)).

[7] See Kaye & Sterne, RMSE 4th ed. at 511 n.125 (citing Rhyne v. U.S. Steel  Corp., 474 F. Supp. 3d 733, 744 (W.D.N.C. 2020) (“‘If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the population.’ Reference Guide on Epidemiology, at 580.”).

[8] Kaye & Sterne, RMSE 4th ed. at 512 & n. 126 (citing additional errant judicial decisions, and Geoff Cumming & Robert Maillardet, Confidence Intervals and Replication: Where Will the Next Mean Fall?, 11 PSYCH. METHODS 217 (2006).)

[9] Id. at 512.

[10] Steve C. Gold, Michael D. Green, Jonathan Chevrier, & Brenda Eskenazi, Reference Guide on Epidemiology, in RMSE 4th ed. at 897

[11] Michael D. Green, D. Michal Freedman & Leon Gordis, Reference Guide on Epidemiology, 549, 573, 580, in RMSE 3rd ed.

[12] Steve C. Gold, Michael D. Green, Jonathan Chevrier, & Brenda Eskenazi, Reference Guide on Epidemiology, RMSE 4th ed. at 897, 939.

[13] Id. at 1011.

[14] Michael Weisberg & Anastasia Thanukos, How Science Works , in RMSE 4th ed. at 47, 90.

[15] Daniel Rubinfeld, Reference Guide on Multiple Regression, RMSE 3rd ed. at 303, 342, 352.

[16] Daniel Rubinfeld & David Card, Reference Guide on Multiple Regression and Advanced Statistical Models, in RMSE 4th ed. at 577, 613.

[17] Schachtman, Rhetorical Strategy in Characterizing Scientific Burdens of Proof, TORTINI (Nov. 11, 2014);

[18] See, e.g., Kevin C. Elliott & David B. Resnik, Science, Policy, and the Transparency of Values, 122 ENVT’L HEALTH PERSP. 647 (2014) (exemplifying the rhetorical strategy that idealizes and elevates a burden of proof in science, and then declaring it to be different from legal and regulatory burdens of proof).

[19] Schachtman, Playing Dumb on Statistical Significance, TORTINI (Jan. 4, 2015); The Rhetoric of Playing Dumb on Statistical Significance – Further Comments on Oreskes, TORTINI (Jan. 17, 2015).

[20] Naomi Oreskes & Erik M. Conway, MERCHANTS OF DOUBT: HOW A HANDFUL OF SCIENTISTS OBSCURED THE TRUTH ON ISSUES FROM TOBACCO SMOKE TO GLOBAL WARMING at 156-57 (2010).

Judging Science Symposium

May 25th, 2025

While waiting for the much delayed fourth edition of the Reference Manual on Scientific Evidence, you may want to take a look at a recent law review issue on expert witnesses issues. Back in November 2024, the Columbia Science & Technology Law Review held its symposium at the Columbia Law Review on “Judging Science.” The symposium explored current judicial practice for, and treatment of, scientific expert witness testimony in the United States. Because the symposium took place at Columbia, we can expect any number of antic proposals for reform, as well.

Among the commentators on the presentations were Hon. Jed S. Rakoff, Judge on the Southern District of New York,[1] and the notorious Provost David Madigan, from Northeastern University.[2]

The current issue (vol. 26, no.2) of the Columbia Science and Technology Law Review, released on May 23, 2025, contains papers originally presented at the symposium:

Edith Beerdsen, “Unsticking Litigation Science.”

Edward Cheng, “Expert Histories.”

Shari Seidman Diamond & Richard Lempert, “How Experts View the Legal System’s Use of Scientific Evidence.”

David Faigman, “Overcoming Judicial Innumeracy.”

Maura Grossman & Paul Grimm, “Judicial Approaches to Acknowledged and Unacknowledged AI-Generated Evidence.”

Valerie Hans, “Juries Judging Science.”

Enjoy the beach reading!


[1] See Schachtman, “Scientific illiteracy among the judiciary,” Tortini (Feb. 29, 2012).

[2] See, e.g., In re Accutane Litig., No. 271(MCL), 2015 WL 753674 (N.J. Super., Law Div., Atlantic Cty., Feb. 20, 2015) (excluding plaintiffs’ expert witness David Madigan); In re Incretin-Based Therapies Prods. Liab. Litig., 524 F. Supp. 3d 1007 (S.D. Cal. 2021), aff’d, No. 21-55342, 2022 WL 898595 (9th Cir. Mar. 28, 2022) (per curiam). Provost Madigan is stepping down from his position next month. Sonel Cutler, Zoe MacDiarmid & Kate Armanini, “Northeastern Provost David Madigan to step down in June,” The Huntington News (Jan. 16, 2025).

Professor Lahav’s Radically Misguided Treatment of Chancy Tort Causation

September 27th, 2024

In the 19th and early 20th century, scientists and lay people usually conceptualized causation as “deterministic.” Their model of science was perhaps what was called Newtonian, in which observations were invariably described in terms of identifiable forces that acted upon antecedent phenomena. The universe was akin to a pool table, with the movement of the billiard balls fully explained by their previous positions, mass, and movements. There was little need for probability to describe events or outcomes in such a universe.

The 20th century ushered in probabilistic concepts and models in physics and biology. Because tort law is so focused on claims of bodily integrity and harms, I am focused here on claimed health effects. Departing from the Koch-Henle postulates and our understanding of pathogen-based diseases, the latter half of the 20th century saw the rise of observational epidemiology and scientific conclusions about stochastic processes and effects that could best be understood in terms of probabilities, with statistical inferences from samples of populations. The language of deterministic physics failed to do justice to epidemiologic evidence or conclusions. Modern medicine and biology invoked notions of base rates for chronic diseases, which rates might be modified by environmental exposures.

In the wake of the emerging science of epidemiology, the law experienced a new horizon on which many claimed tortogens did not involve exposures uniquely tied to the harms alleged. Rather, the harms asserted were often diseases of ordinary life, but with that suggested the harms were quantitatively more prevalent or incident among people exposed to the alleged tortogen. Of course, the backwaters of tort law saw reactionary world views on trial, as with claims of trauma-induced cancer cases, which are with us still. Nonetheless, slowly but not always steadily, the law came to grips with probability and statistical evidence.

In law, as in science, a key component of causal attribution is counterfactual analysis. If A causes B, then if in the same world, ceteris paribus, we do not have A, then we don’t have B. Counterfactual analysis applies as much to stochastic processes that are causally influenced by rate changes, as they apply to the Newtonian world of billiard balls. Some writers in the legal academy, however, would opportunistically use the advent of probabilistic analyses of health effects to dispose of science altogether. No one has more explicitly exploited the opportunity than Professor Alexandra Lahav.

In an essay published in 2022, Professor Lahav advanced extraordinary claims about probabilistic causation, or what she called “chancy causation.”[1] The proffered definition of chancy causation is bumfuzzling. Lahav provides an example of an herbicide that is “associated” with the type of cancer that the heavily exposed plaintiff developed. She tells us that:

“[t]here is a chance that the exposure caused his cancer, and a chance that it did not. Probability follows certain rules, or tendencies, but these regular laws do not abolish chance. This is a common problem in modern life, where much of what we know about medicines, interventions, and the chemicals to which we are exposed is probabilistic. Following the philosophical literature, I call this phenomenon chancy causation.”[2]

So the rules of probability do not abolish chance? It is hard to know what Lahav is trying to say here. Probability quantifies chance, and gives us an understanding of phenomena and their predictability. When we can model an empirical process with a probability distribution, such as one that is independent and identically distributed, we can often make and test quantitative inferences about the anticipated phenomena.

Lahav vaguely acknowledges that her term, “chancy causation” is borrowed, but she does not give credit to the many authors who have used it before.[3] Lahav does note that the concept of probabilistic causation used in modern-day risk factor epidemiology is different from the deterministic causal claims that dominated tort law in the 19th and the first half of the 20th century. Lahav claims that chancy causation is inconsistent with counterfactual analysis, but she cites no support for her claim, which is demonstrably false. If we previously saw the counterfactual of if A then B, as key to causality, we can readily restate the counterfactual as a probability: A probably causes B. On a counterfactual analysis, then if we do not have A as an antecedent, then we probably do not have B. For a classic tortogen such as tobacco smoking, we can say confidently that tobacco smoking probably causes lung cancer. And for a given instance of lung cancer, we can say based upon the entire evidentiary display, that if a person did not smoke tobacco, he would probably not have developed lung cancer. Of course, the correspondence is not 100 percent, which is only to say that it is probabilistic. There are highly penetrant genetic mutations that may be the cause of a given lung cancer case. We know, however, that such mutations do not cause or explain the large majority of lung cancer cases.

Contrary to Lahav’s ipse dixits, tort law can incorporate, and has accommodated, both general and specific causation in terms of probabilistic counterfactuals. The modification requires us, of course, to address the baseline situation as a rate or frequency of events, and the post-exposure world as one with a modified rate or frequency. Without confusion or embarrassment, we can say that the exposure is the cause of the change in event rates. Modern physics similarly addresses whether we must be content with probability statements, rather than precise deterministic “billiard ball” physics, which is so useful in a game of snooker, but less so in describing the position of sub-atomic particles. In the first half of the 20th century, the biological sciences learned with some difficulty that it must embrace probabilistic models, in genetic science, as well as in epidemiology. Many biological causation models are completely stated in terms of probabilities that are modified by specified conditions.

Lahav intends for her rejection of counterfactual causality to do a lot of work in her post-modern program. By falsely claiming that chancy causation has no factual basis, Lahav jumps to the conclusion that what the law calls for is nothing but “policy,”[4] and “normative decision.”[5] Having fabricated the demise of but-for causation in the context of probabilistic relationships, Lahav suggests that tort law can pretend that the causation question is nothing more than a normative analysis of the defendant’s conduct. (Perhaps it is more than a tad revealing that she does not see that the plaintiff’s conduct is involved in the normative judgment.) Of course, tort law already has ample room for policy and normative considerations built into the concepts of duty and breach of duty.

As we saw with the lung cancer example above, the claim that tobacco smoking probably caused the smoker to develop lung cancer can be entirely factual, and supported by a probabilistic judgment. Lahav calls her erroneous move “pragmatic,” although it has no relationship to the philosophical pragmatism of Peirce or Quine. Lahav’s move is an incorrect misrepresentation of probability and of epidemiologic science in the name of compensation free-for-alls. Obtaining a heads in the flip of a fair coin has a probability of 50%; that is a fact, not a normative decision, even though it is, to use Lahav’s vocabulary, “chancy.”

Lahav’s argument is not always easy to follow. In one place, she uses “chancy” to refer to the posterior probability of the correctness of the causal claim:

“the counterfactual standard can be successfully defended against by the introduction of chance. The more conflicting studies, the “more chancy” the causation. By that I do not mean proving a lower probability (although this is a good result from a defense point of view) but rather that more, different study results create the impression of irreducible chanciness, which in turn dictates that the causal relation cannot be definitively proven.”[6]

This usage, which clearly refers to the posterior probability of a claim, is not necessarily limited to so-called non-deterministic phenomena. People could refer to any conclusion, based upon conflicting evidence of deterministic phenomena, as “chancy.”

Lurking in her essay is a further confusion between the posterior probability we might assign to a claim, or to an inference from probabilistic evidence, and the probability of random error. In an interview conducted by Felipe Jiménez,[7] Lahav was more transparent in her confusion, and she explicitly commited the transpositional fallacy with her suggestion that customary statistical standards (statistical significance) ensure that even small increased risks, say of 30%, are known to a high degree of certainty.

Despite these confusions, it seems fairly clear that Lahav is concerned with stochastic causal processes, and most of her examples evidence that concern. Lahav poses a hypothetical in which epidemiologic studies show smokers have a 20% increased risk of developing lung cancer compared with non-smokers.[8] Given that typical smoking histories convey relative risks of 20 to 30, or increased risks of 2,000 to 3,000%, Lahav’s hypothetical may readers think she is shilling for tobacco compaies. In any event, in the face of a 20% increased risk (or relative rsk of 1.2), Lahav acknowledges that the probability of a smoker’s developing lung cancer is higher than that of a non-smoker, but “in any particular case the question whether a patient’s lung cancer was caused by smoking is uncertain.” This assertion, however, is untrue; the question is not “uncertain.” She has provided a certain quantification of the increased risk. Furthermore, her hypothetical gives us a good deal of information on which we can say that smoking probably did not result in the patient’s lung cancer. Causation may be chancy because it is based upon a probabilistic inference, but the chances are actual known, and they are low.

Lahav posits a more interesting hypothetical when she considers a case in which there is an 80% chance that a person’s lung cancer is attributable to smoking.[9] We can understand this hypothetical better if we reframe it as classic urn probability problem. In a given (large) population of non-smokers, we expect 100 lung cancers per year. In a population of smokers, otherwise just like the population of non-smokers, we observe 500 lung cancers. Of the observed number, 100 were “expected” because they happen without exposure to the putative causal agent, and 400 are “excess.”The relative risk would be 5, or 400% increased risk, and still well below the actual measure of risk from long-term smoking, but the attributable risk would be [(RR-1)/RR] or 0.8 (or 80%). If we imagine an urn with 100 white “expected” balls, and 400 red “excess” balls added, then any given draw from the urn, with replacement, yields an 80% probability of a red ball, or an excess case. Of course, if we can see the color, we may come to a consensus judgment that the ball is actually red. But on our analogy to discerning the cause of a given lung cancer, we have at present nothing by way of evidence with which to call the question, and so it remains “chancy” or probabilistic. The question is not, however, in any way normative. The answer is different quantitatively in the 20% and in the 400% hypotheticals.

Lahav asserts that we are in a state of complete ignorance once a smoker has lung cancer.[10] This is not, however, true. We have the basis for a probabilistic judgment that will probably be true. It may well be true that the probability of attribution will be affected by the probability that the relative risk = 5 is correct. If the posterior probability for the claim that smoking causes lung cancer by increasing its risk 400% is only 30%, then of course, we could not make the attribution in a given case with an 80% probability of correctness. In actual litigation, the argument is often framed on an assumption arguendo that the increased risk is greater than two, so that only the probability of attribution is involved. If the posterior probability of the claim that exposure to the tortogen increased risk by 400% or 20,000% was only 0.49, then the plaintiff would lose. If the posterior probability of the increased risk was greater than 0.5, the finder of fact could find that the specific causation claim had been carried if the magnitude of the relative risk, and the attributable risk, were sufficiently large. This inference on specific causation would not be a normative judgment; it would be guided by factual evidence about the magnitude of the relevant increased risk.

Lahav advances a perverse skepticism that any inferences about individuals can be drawn from information about rates or frequencies in groups of similar individuals.  Yes, there may always be some debate about what is “similar,” but successive studies may well draw the net tighter around what is the appropriate class. Lahav’s skepticism and her outright denialism about inferences from general causation to specific causation, are common among some in the legal academy, but it ignores that group to individual inferences are drawn in epidemiology in multiple contexts. Regressions for disease prediction are based upon individual data within groups, and the regression equations are then applied to future individuals to help predict those individuals’ probability of future disease (such as heart attack or breast cancer), or their probability of cancer-free survival after a specific therapy. Group to individual inferences are, of course, also the basis for prescribing decisions in clinical medicine.  These are not normative inferences; they are based upon evidence-based causal thinking about probabilistic inferences.

In the early tobacco litigation, defendants denied that tobacco smoking caused lung cancer, but they argued that even if it did, and the relative risk were 20, then the specific causation inference in this case was still insecure because the epidemiologic study tells us nothing about the particular case. Lahav seems to be channeling the tobacco-company argument, which has long since been rejected on the substantive law of causation. Indeed, as noted, epidemiologists do draw inferences about individual cases from population-based studies when they invoke clinical prediction models such as the Framingham cardiovascular risk event model, or the Gale breast cancer prediction model. Physicians base important clinical interventions, both pharmacologic and surgical, for individuals upon population studies. Lahav asserts, without evidence, that the only difference between an intervention based upon an 80% or a 30% probability is a “normative implication.”[11] The difference is starkly factual, not normative, and describes a long-term likelihood of success, as well as an individual probability of success.

Post-Modern Causation

What we have in Lahav’s essay is the ultimate post-modern program, which asserts, without evidence, that when causation is “chancy,” or indeterminate, courts leave the realm of science and step into the twilight zone of “normative decisions.” Lahav suggests that there is an extreme plasticity to the very concept of causation such that causation can be whatever judges want it to be. I for one sincerely doubt it. And if judges make up some Lahav-inspired concept of normative causation, the scientific community would rightfully scoff.

Establishing causation can be difficult, and many so-called mass tort litigations have failed for want of sufficient, valid evidence supporting causal claims. The late Professor Margaret Berger reacted to this difficulty in a more forthright way by arguing for the abandonment of general causation, or cause-in-fact, as an element of tort claims under the law.[12] Berger’s antipathy to requiring causation manifested in her hostility to judicial gatekeeping of the validity of expert witness opinions. Her animus against requiring causation and gatekeeping under Rule 702 was so strong that it exceeded her lifespan. Berger’s chapter in the third edition of the Reference Manual on Scientific Evidence, which came out almost one year after her death, embraced the First Circuit’s notorious anti-Daubert decision in Milward, which also post-dated her passing.[13]

Professor Lahav has previously expressed a distain for the causation requirement in tort law. In an earlier paper, “The Knowledge Remedy,” Lahav argued for an extreme, radical precautionary principle approach to causation.[14] Lahav believes that the likes of David Michaels have “demonstrated” that manufactured uncertainty is a genuine problem, but not one that affects her main claims. Remarkably, Lahav sees no problem with manufactured certainty in the advocacy science of many authors or the lawsuit industry.[15] In “Chancy Causation,” Lahav thus credulously repeats Michaels’ arguments, and goes so far as to describe Rule 702 challenges to causal claims as having the “negative effect” of producing “incentives to sow doubt about epidemiologic studies using methodological battles, a strategy pioneered by the tobacco companies … .”[16] Lahav’s agenda is revealed by the absence of any corresponding concern about the negative effect of producing incentives to overstate the findings, or the validity of inferences, in order to obtain an unwarranted and unsafe verdicts for claimants.


[1] Alexandra D. Lahav, “Chancy Causation in Tort,” 15 J. Tort L. 109 (2022) [hereafter Chancy Causation].

[2] Chancy Causation at 110.

[3] See, e.g., David K. Lewis, Philosophical Papers: Volume 2 175 (1986); Mark Parascandola, “Evidence and Association: Epistemic Confusion in Toxic Tort Law,” 63 Phil. Sci. S168 (1996).

[4] Chancy Causation at 109.

[5] Chancy Causation at 110-11.

[6] Chancy Causation at 129.

[7] Felipe Jiménez, “Alexandra Lahav on Chancy Causation in Tort,” The Private Law Podcast (Mar. 29, 2021).

[8] Chancy Causation at 115.

[9] Chancy Causation at 116-17.

[10] Chancy Causation at 117.

[11] Chancy Causation at 119.

[12] Margaret A. Berger, “Eliminating General Causation: Notes towards a New Theory of Justice and Toxic Torts,” 97 Colum. L. Rev. 2117 (1997).

[13] Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11 (1st Cir. 2011), cert. denied sub nom., U.S. Steel Corp. v. Milward, 132 S. Ct. 1002 (2012).

[14] Alexandra D. Lahav, “The Knowledge Remedy,” 98 Texas L. Rev. 1361 (2020). See “The Knowledge Remedy ProposalTortini (Nov. 14, 2020).

[15] Chancy Causation at 118 (citing plaintiffs’ expert witness David Michaels, The Triumph of Doubt: Dark Money and the Science of Deception (2020), among others).

[16] Chancy Causation at 129.

Zhang’s Glyphosate Meta-Analysis Succumbs to Judicial Scrutiny

August 5th, 2024

Back in March 2015, the International Agency for Research on Cancer (IARC) issued its working group’s monograph on glyphosate weed killer. The report classified glyphosate as a “probable carcinogen,” which is highly misleading. For IARC, the term “probable” does not mean more likely than not, or for that matter, probable does not have any quantitative meaning. The all-important statement of IARC methods, “The Preamble,” makes this clear.[1] 

In the case of glyphosate, the IARC working group concluded that the epidemiologic evidence for an association between glyphosate exposure and cancer (specifically non-Hodgkins lymphoma (NHL)), was limited, which is IARC’s euphemism for insuffcient. Instead of epidemiology, IARC’s glyphosate conclusion was based largely upon rodent studies, but even the animal evidence relied upon by IARC was dubious. The IARC working group cherry picked a few arguably “positive” rodent study results with increases in tumors, while ignoring exculpatory rodent studies with decreasing tumor yield.[2]

Although the IARC hazard classification was uncritically embraced by the lawsuit industry, most regulatory agencies, even indulging precautionary principle reasoning, rejected the claim of carcinogenicity. The United States  Environmental Protection Agency (EPA), European Food Safety Authority, Food and Agriculture Organization (in conjunction with World Health Organization, European Chemicals Agency, Health Canada, German Federal Institute for Risk Assessment, among others, found that the scientific evidence did not support the claim that glyphosate causes NHL. The IARC monograph very quickly after publication became the proximate cause of a huge litigation effort by the lawsuit industry against Monsanto.

The personal injury cases against Monsanto, filed in federal court, were aggregated for pre-trial hearing, before Judge Vince Chhabria, of the Northern District of California, as MDL 2741. Judge Chhabria denied Monsanto’s early Rule 702 motions, and thus cases proceeded to trial, with mixed results.

In 2019, the Zhang study, a curious meta-analysis of some of the available glyphosate epidemiologic studies appeared in Mutation Research / Reviews in Mutation Research, a toxicology journal that seemed an unlikely venue for a meta-analysis of epidemiologic studies. The authors combined selected results from one large cohort study, the Agricultural Health Study, along with five case-control studies, to reach a summary relative risk of 1.41 (95% confidence interval 1.13-1.75).[3] According to the authors, their “current meta-analysis of human epidemiological studies suggests a compelling link between exposures to GBHs [glyphosate] and increased risk for NHL.”

The Zhang meta-analysis was not well reviewed in regulatory and scientific circles. The EPA found that Zhang used inappropriate methods in her meta-analysis.[4] Academic authors also panned the Zhang meta-analysis in both scholarly,[5] and popular articles.[6] The senior author of the Zhang paper, Lianne Sheppard, a Professor in the University of Washington Departments of Environmental  and  Occupational Health Sciences, and Biostatistics, attempted to defend the study, in Forbes.[7] Professor Geoffrey Kabat very adeptly showed that this defense was futile.[8] Despite the very serious and real objections to the validity of the Zhang meta-analysis, plaintiffs’ expert witnesses, such as Beate Ritz, an epidemiologist with U.C.L.A. testified that she trusted and relied upon the analysis.[9]

For five years, the Zhang study was a debating point for lawyers and expert witnesses in the glyphosate litigation, without significant judicial gatekeeping. It took the entrance of Luoping Zhang herself as an expert witness in the glyphosate litigation, and the procedural oddity of her placing exclusive reliance upon her own meta-analysis, to bring the meta-analysis into the unforgiving light of judicial scrutiny.

Zhang is a biochemist and toxicologist, in the University of California, Berkeley. Along with two other co-authors of her 2019 meta-analysis paper, she had been a board member of the EPA’s 2016 scientific advisory panel on glyphosate. After plaintiffs’ counsel disclosed Zhang as an expert witness, she disclosed her anticipated testimony, as is required by Federal Rule of Civil Procedure 26, by attaching and adopting by reference the contents of two of her published papers. The first paper was her 2019 meta-analysis; the other paper discussed putative mechanisms. Neither paper concluded that glyphosate causes NHL. Zhang’s disclosure did not add materially to her 2019 published analysis of six epidemiologic studies on glyphosate and NHL.

The defense challenged the validity of Dr. Zhang’s proffered opinions, and her exclusive reliance upon her own 2019 meta-analysis required the MDL court to pay attention to the failings of that paper, which had previously escaped critical judicial scrutiny. In June 2024, after an oral hearing in Bulone v. Monsanto, at which Dr. Zhang testified, Judge Chhabria ruled that Zhang’s proffered testimony, and her reliance upon her own meta-analysis was “junk science.”[10]

Judge Chhabria, perhaps encouraged by the recently fortifying amendment to Rule 702, issued a remarkable opinion that paid close attention to the indicia of validity of an expert witness’s opinion and the underlying meta-analysis. Judge Chhabria quickly spotted the disconnect between Zhang’s published papers and what is required for an admissible causation opinion. The mechanism paper did not address the extant epidemiology, and both sides in the MDL had emphasized that the epidemiology was critically important for determining whether there was, or was not, causation.

Zhang’s meta-analysis did evaluate some, but not all, of the available epidemiology, but the paper’s conclusion stopped considerably short of the needed opinion on causation. Zhang and colleagues had concluded that there was a “compelling link” between exposures to [glyphosate-based herbicides] and increased risk for NHL. In their paper’s key figure, show casing the summary estimate of relative risk of 1.41 (95% C.I., 1.13 -1.75), Zhang and her co-authors concluded only that exposure was “associated with an increased risk of NHL.” According to Judge Chhabria, in incorporating her 2019 paper into her Rule 26 report, Zhang failed to add a proper holistic causation analysis, as had other expert witnesses who had considered the Bradford Hill predicates and considerations.

Judge Chhabria picked up on another problem that has both legal and scientific implications. A meta-analysis is out of date as soon as a subsequent epidemiologic study becomes available, which would have satisfied the inclusion criteria for the meta-analysis. Since publishing her meta-analysis in 2019, additional studies had in fact been published. At the hearing, Dr. Zhang acknowledged that several of them would qualify for inclusion in the meta-analysis, per her own stated methods. Her failure to update the meta-analysis made her report incomplete and inadmissible for a court matter in 2024.

Judge Chhabria might have stopped there, but he took a closer look at the meta-analysis to explore whether it was a valid analysis, on its own terms. Much as Chief Judge Nancy Rosenstengel had done with the made-for-litigation meta-analysis concocted by Martin Wells in the paraquat litigation,[11] Judge Chhabria examined whether Zhang had been faithful to her own stated methods. Like Chief Judge Rosenstengel’s analysis, Judge Chhabria’s analysis stands as a strong rebuttal to the uncharitable opinion of Professor Edward Cheng, who has asserted that judges lack the expertise to evaluate the “expert opinions” before them.[12]

Judge Chhabria accepted the intellectual challenge that Rule 702 mandates. With the EPA memorandum lighting the way, Judge Chhabria readily discerned that “the challenged meta-analysis was not reliably performed.” He declared that the Zhang meta-analysis was “junk science,” with “deep methodological problems.”

Zhang claimed that she was basing the meta-analysis on the subgroups of six studies with the heaviest glyphosate exposure. This claim was undermined by the absence of any exposure-response gradient in the study deemed by Zhang to be of the highest quality. Furthermore, of the remaining five studies, three studies failed to provide any exposure-dependent analysis other than a comparison of NHL rates among “ever” versus “never” glyphosate exposure. As a result of this heterogeneity, Zhang used all the data from studies without exposure characterizations, but only limited data from the other studies that analyzed NHL by exposure levels. And because the highest quality study was among those that provided exposure level correlations, Zhang’s meta-analysis used only some of the data from it.

The analytical problems created by Zhang’s meta-analytical approach were compounded by the included studies’ having measured glyphosate exposures differently, with different cut-points for inclusion as heavily exposed. Some of the excluded study participants would have heavier exposure than those included in the summary analysis.

In the universe of included studies, some provided adjusted results from multi-variate analyses that included other pesticide exposures. Other studies reported only unadjusted results. Even though Zhang’s method stated a preference for adjusted analyses, she inexplicably failed to use adjusted data in the case of one study that provided both adjusted and unadjusted results.

As shown in Judge Chhabria’s review, Zhang’s methodological errors created an incoherent analysis, with methods that could not be justified. Even accepting its own stated methodology, the meta-analysis was an exercise in cherry picking. In the court’s terms, it was, without qualification, “junk science.”

After the filing of briefs, Judge Chhabria provided the parties an oral hearing, with an opportunity for viva voce testimony. Dr. Zhang thus had a full opportunity to defend her meta-analysis. The hearing, however, did not go well for her. Zhang could not talk intelligently about the studies included, or how they defined high exposure. Zhang’s lack of familiarity with her own opinion and published paper was yet a further reason for excluding her testimony.

As might be expected, plaintiffs’ counsel attempted to hide behind peer review. Plaintiffs’ counsel attempted to shut down Rule 702 scrutiny of the Zhang meta-analysis by suggesting that the trial court had no business digging into validity concerns given that Zhang had published her meta-analysis in what apparently was a peer reviewed journal. Judge Chhabria would have none of it. In his opinion, publication in a peer-reviewed journal cannot obscure the glaring methodological defects of the relied upon meta-analysis. The court observed that “[p]re-publication editorial peer review, just by itself, is far from a guarantee of scientific reliability.”[13] The EPA memorandum was thus a more telling indicator of the validity issues than the publication in a nominally peer-reviewed journal.

Contrary to some law professors who are now seeking to dismantle expert witness gatekeeping as beyond a judge’s competence, Judge Chhabria dismissed the suggestion that he lacked the expertise to adjudicate the validity issues. Indeed, he displayed a better understanding of the meta-analytic process than did Dr. Zhang. As the court observed, one of the goals of MDL assignments was to permit a single trial judge to have time to engage with the scientific issues and to develop “fluency” in the relevant scientific studies. Indeed, when MDL judges have the fluency in the scientific concepts to address Rule 702 or 703 issues, it would be criminal for them to ignore it.

The Bulone opinion should encourage lawyers to get “into the weeds” of expert witness opinions. There is nothing that a little clear thinking – and glyphosate – cannot clear away. Indeed, now that the weeds of Zhang’s meta-analysis are cleared away, it is hard to fathom that any other expert witness can rely upon it without running afoul of both Federal Rules of Evidence 702 and 703.

There were a few issues not addressed in Bulone. As seen in her oral hearing testimony, Zhang probably lacked the qualifications to proffer the meta-analysis. The bar for qualification as an expert witness, however, is sadly very low. One other issue that might well have been addressed is Zhang’s use of a fixed effect model for her meta-analysis. Considering that she was pooling data from cohort and case-control studies, some with and some without adjustments for confounders, with different measures of exposure, and some with and some without exposure-dependent analyses, Zhang and her co-authors were not justified in using a fixed effect model for arriving at a summary estimate of relative risk. Admittedly, this error could easily have been lost in the flood of others.

Postscript

Glyphosate is not merely a scientific issue. Its manufacturer, Monsanto, is the frequent target of media outlets (such as Telesur) from autocratic countries, such as Communist China and its client state, Venezuela.[14]

天安门广场英雄万岁


[1]The IARC-hy of Evidence – Incoherent & Inconsistent Classifications of Carcinogenicity,” Tortini (Sept. 19, 2023).

[2] Robert E Tarone, “On the International Agency for Research on Cancer classification of glyphosate as a probable human carcinogen,” 27 Eur. J. Cancer Prev. 82 (2018).

[3] Luoping Zhang, Iemaan Rana, Rachel M. Shaffer, Emanuela Taioli, Lianne Sheppard, “Exposure to glyphosate-based herbicides and risk for non-Hodgkin lymphoma: A meta-analysis and supporting evidence,” 781 Mutation Research/Reviews in Mutation Research 186 (2019).

[4] David J. Miller, Acting Chief Toxicology and Epidemiology Branch Health Effects Division, U.S. Environmental Protection Agency, Memorandum to Christine Olinger, Chief Risk Assessment Branch I, “Glyphosate: Epidemiology Review of Zhang et al. (2019) and Leon et al. (2019) publications for Response to Comments on the Proposed Interim Decision” (Jan. 6, 2020).

[5] Geoffrey C. Kabat, William J. Price, Robert E. Tarone, “On recent meta-analyses of exposure to glyphosate and risk of non-Hodgkin’s lymphoma in humans,” 32 Cancer Causes & Control 409 (2021).

[6] Geoffrey Kabat, “Paper Claims A Link Between Glyphosate And Cancer But Fails To Show Evidence,” Science 2.0 (Feb. 18, 2019).

[7] Lianne Sheppard, “Glyphosate Science is Nuanced. Arguments about it on the Internet? Not so much,” Forbes (Feb. 20, 2020).

[8] Geoffrey Kabat, “EPA Refuted A Meta-Analysis Claiming Glyphosate Can Cause Cancer And Senior Author Lianne Sheppard Doubled Down,” Science 2.0 (Feb. 26, 2020).

[9] Maria Dinzeo, “Jurors Hear of New Study Linking Roundup to Cancer,” Courthouse News Service (April 8, 2019).

[10] Bulone v. Monsanto Co., Case No. 16-md-02741-VC, MDL 2741 (N.D. Cal. June 20, 2024). See Hank Campbell, “Glyphosate legal update: Meta-study used by ambulance-chasing tort lawyers targeting Bayer’s Roundup as carcinogenic deemed ‘junk science nonsense’ by trial judge,” Genetic Literacy Project (June 24, 2024).

[11] In re Paraquat Prods. Liab. Litig., No. 3:21-MD-3004-NJR, 2024 WL 1659687 (S.D. Ill. Apr. 17, 2024) (opinion sur Rule 702 motion), appealed sub nom., Fuller v. Syngenta Crop Protection, LLC, No. 24-1868 (7th Cir. May 17, 2024). SeeParaquat Shape-Shifting Expert Witness Quashed,” Tortini (April 24, 2024).

[12] Edward K. Cheng, “The Consensus Rule: A New Approach to Scientific Evidence,” 75 Vanderbilt L. Rev. 407 (2022). SeeCheng’s Proposed Consensus Rule for Expert Witnesses,” Tortini (Sept. 15, 2022); “Further thoughts on Cheng’s Consensus Rule,” Tortini (Oct. 3, 2022).

[13] Bulone, citing Valentine v. Pioneer Chlor Alkali Co., 921 F. Supp. 666, 674-76 (D. Nev. 1996), for its distinction between “editorial peer review” and “true peer review,” with the latter’s inclusion of post-publication assessment of a paper as really important for Rule 702 purposes.

[14] Anne Applebaum, Autocracy, Inc.: The Dictators Who Want to Run the World 66 (2024).

STEM-ing the Tide of Scientific & Mathematical Illiteracy in the Law

July 31st, 2024

I will blame the heat of this summer for reducing my blog posts to a trickle, but I have written elsewhere. The James G. Martin Center for Academic Renewal invited me to write a piece about the need for science and mathematics literacy in law school, and among lawyers and judges. I have touched on the subject before, but I agreed to submit a short essay that is now published as “STEM-ing the Tide of Scientific and Mathematical Illiteracy in the Law: Attorneys need to understand how numbers work. It’s time we teach them,” James G. Martin Center (July 26, 2024).

Although I was delighted to receive the invitation, I was initially skeptical of the organization. The James G. Martin Center for Academic Renewal (previously known as the Pope Center for Higher Education Policy) is a group that has criticized and sought reforms in higher education from a right-of-center perspective. I have given up calling such groups “conservative,” because the term no longer has any clear meaning. While I continue to view Burke and Oakeshott as having something important to say about our current crises, their counsel has no sway among self-styled conservatives in the Republican party, where neo-cons, theo-cons, paleo-cons, fascismo-cons, crypto-cons, techno-cons, ignoratio-cons, and plain ol’ con-cons have pitched one large ignominious tent.

Still, as a group that describes itself as particularly concerned especially with free markets, limited constitutional government, and personal responsibility, the Martin Center has much to commend it. Although I cannot agree with everything promoted on the Center’s website, which has articles from many authors, the group’s publications seemed sufficiently heterodox to me to consider it as a publishing venue.

My piece focused on the need for a modicum of scientific and statistical acumen and training among lawyers, and the ethical lapses that can result from lack of training. I chose to use real world examples of lawyers whose pronouncements in public and in court caused them to look either untutored or unethical. There were no lack of examples, but perhaps as a test of the Martin Center’s bona fides, I focused on three high-profile “conservative” lawyers, Alan Dershowitz, Ken Paxton, and John Eastman. Of course, the last of these lawyers has now had his license suspended, pending an appeal to the California Supreme Court. I was gratified that the Martin Center received and enthusiastically published my short article, which is now online at the Center’s website.

Paraquat Shape-Shifting Expert Witness Quashed

April 24th, 2024

Another multi-district litigation (MDL) has hit a jarring speed bump. Claims for Parkinson’s disease (PD), allegedly caused by exposure to paraquat dichloride (paraquat), were consolidated, in June 2021, for pre-trial coordination in MDL No. 3004, in the Southern District of Illinois, before Chief Judge Nancy J. Rosenstengel. Like many health-effects litigation claims, the plaintiffs’ claims in these paraquat cases turn on epidemiologic evidence. To make their causation case in the first MDL trial cases, plaintiffs’ counsel nominated a statistician, Martin T. Wells, to present their causation case. Last week, Judge Rosenstengel found Wells’ opinion so infected by invalid methodologies and inferences as to be inadmissible under the most recent version of Rule 702.[1] Summary judgment in the trial cases followed.[2]

Back in the 1980s, paraquat gained some legal notoriety in one of the most retrograde Rule 702 decisions.[3] Both the herbicide and Rule 702 survived, however, and they both remain in wide use. For the last two decades, there has been a widespread challenges to the safety of paraquat, and in particular there have been claims that paraquat can cause PD or parkinsonism under some circumstances.  Despite this background, the plaintiffs’ counsel in MDL 3004 began with four problems.

First, paraquat is closely regulated for agricultural use in the United States. Under federal law, paraquat can be used to control the growth of weeds only “by or under the direct supervision of a certified applicator.”[4] The regulatory record created an uphill battle for plaintiffs.[5] Under the Federal Insecticide, Fungicide, and Rodenticide Act (“FIFRA”), the U.S. EPA has regulatory and enforcement authority over the use, sale, and labeling of paraquat.[6] As part of its regulatory responsibilities, in 2019, the EPA systematically reviewed available evidence to assess whether there was an association between paraquat and PD. The agency’s review concluded that “there is limited, but insufficient epidemiologic evidence at this time to conclude that there is a clear associative or causal relationship between occupational paraquat exposure and PD.”[7] In 2021, the EPA issued its Interim Registration Review Decision, and reapproved the registration of paraquat. In doing so, the EPA concluded that “the weight of evidence was insufficient to link paraquat exposure from pesticidal use of U.S. registered products to Parkinson’s disease in humans.”[8]

Second, beyond the EPA, there were no other published reviews, systematic or otherwise, which reached a conclusion that paraquat causes PD.[9]

Third, the plaintiffs claims faced another serious impediment. Their counsel placed their reliance upon Professor Martin Wells, a statistician on the faculty of Cornell University. Unfortunately for plaintiffs, Wells has been known to operate as a “cherry picker,” and his methodology has been previously reviewed in an unfavorable light. Another MDL court, which reviewed a review and meta-analysis propounded by Wells, found that his reports “were marred by a selective review of data and inconsistent application of inclusion criteria.”[10]

Fourth, the plaintiffs’ claims were before Chief Judge Nancy J. Rosenstengel, who was willing to do the hard work required under Rule 702, specially as it has been recently amended for clarification and emphasis of the gatekeeper’s responsibilities to evaluate validity issues in the proffered opinions of expert witnesses. As her 97 page decision evinces, Judge Rosenstengel conducted four days of hearings, which included viva voce testimony from Martin Wells, and she obviously read the underlying papers, reviews, as well as the briefs and the Reference Manual on Scientific Evidence, with great care. What followed did not go well for Wells or the plaintiffs’ claims.[11] Judge Rosenstengel has written an opinion that may be the first careful judicial consideration of the basic requirements of systematic review.

The court noted that systematic reviewers carefully define a research question and what kinds of empirical evidence will be reviewed, and then collect, summarize, and, if feasible, synthesize the available evidence into a conclusion.[12] The court emphasized that systematic reviewers should “develop a protocol for the review before commencement and adhere to the protocol regardless of the results of the review.”[13]

Wells proffered a meta-analysis, and a “weight of the evidence” (WOE) review from which he concluded that paraquat causes PD and nearly triples the risk of the disease among workers exposed to the herbicide.[14] In his reports, Wells identified a universe of at least 36 studies, but included seven in his meta-analysis. The defense had identified another two studies that were germane.[15]

Chief Judge Rosenstengel’s opinion is noteworthy for its fine attention to detail, detail that matters to the validity of the expert witness’s enterprise. Martin Wells set out to do a meta-analysis, which was all fine and good. With a universe of 36 studies, with sub-findings, alternative analyses, and changing definitions of relevant exposure, the devil lay in the details.

The MDL court was careful to point out that it was not gainsaying Wells’ decision to limit his meta-analysis to case-control studies, or to his grading of any particular study as being of low quality. Systematic reviews and meta-analyses are generally accepted techniques that are part of a scientific approach to causal inference, but each has standards, predicates, and requirements for valid use. Expert witnesses must not only use a reliable methodology, Rule 702(d) requires that they must reliably apply their chosen methodology to the facts at hand in reaching their conclusions.[16]

The MDL court concluded that Wells’ meta-analysis was not sufficiently reliable under Rule 702 because he failed faithfully and reliably to apply his own articulated methodology. The court followed Wells’ lead in identifying the source and content of his chosen methodology, and simply examined his proffered opinion for compliance with that methodology.[17] The basic principles of validity for conducting meta-analyses were not, in any event, really contested. These principles and requirements were clearly designed to ensure and enhance the reliability of meta-analyses by pre-empting results-driven, reverse-engineered summary estimates of association.

The court found that Wells failed clearly to pre-specify his eligibility criteria. He then proceeded to redefine exposure criteria and study inclusion or eligibility criteria, and study quality criteria, after looking at the evidence. He also inconsistently applied his stated criteria, all in an apparently desired effort to exclude less favorable study outcomes. These ad hoc steps were some of Wells’ deviations from the standards to which he played lip service.

The court did not exclude Wells because it disagreed with his substantive decisions to include or exclude any particular study, or his quality grading of any study. Rather, Dr. Wells’ meta-analysis does not pass muster under Rule 702 because its methodology was unclear, inconsistently applied, not replicable, and at times transparently reverse-engineered.[18]

The court’s evaluation of Wells was unflinchingly critical. Wells’ proffered opinions “required several methodological contortions and outright violations of the scientific standards he professed to apply.”[19] From his first involvement in this litigation, Wells had violated the basic rules of conducting systematic reviews and meta-analyses.[20] His definition of “occupational” exposure meandered to suit his desire to include one study (with low variance) that might otherwise have been excluded.[21] Rather than pre-specifying his review process, his study inclusion criteria, and his quality scores, Wells engaged in an unwritten “holistic” review process, which he conceded was not objectively replicable. Wells’ approach left him free to include studies he wanted in his meta-analysis, and then provide post hoc justifications.[22] His failure to identify his inclusion/exclusion criteria was a “methodological red flag” in Dr. Wells’ meta-analysis, which suggested his reverse engineering of the whole analysis, the “very antithesis of a systematic review.”[23]

In what the court described as “methodological shapeshifting,” Wells blatantly and inconsistently graded studies he wanted to include, and had already decided to include in his meta-analysis, to be of higher quality.[24] The paraquat MDL court found, unequivocally, that Wells had “failed to apply the same level of intellectual rigor to his work in the four trial selection cases that would be required of him and his peers in a non-litigation setting.”[25]

It was also not lost upon the MDL court that Wells had shifted from a fixed effect to a random effects meta-analysis, between his principal and rebuttal reports.[26] Basic to the meta-analytical enterprise is a predicate systematic review, properly done, with pre-specification of inclusion and exclusion criteria for what studies would go into any meta-analysis. The MDL court noted that both sides had cited Borenstein’s textbook on meta-analysis,[27] and that Wells had himself cited the Cochrane Handbook[28] for the basic proposition that that objective and scientifically valid study selection criteria should be clearly stated in advance to ensure the objectivity of the analysis.

There was of course legal authority for this basic proposition about prespecification. Given that the selection of studies that go into a systematic review and meta-analysis can be dispositive of its conclusion, undue subjectivity or ad hoc inclusion can easily arrange a desired outcome.[29] Furthermore, meta-analysis carries with it the opportunity to mislead a lay jury with a single (and inflated) risk ratio,[30] which is obtained by the operator’s manipulation of inclusion and exclusion criteria. This opportunity required the MDL court to examine the methodological rigor of the proffered meta-analysis carefully to evaluate whether it reflects a valid pooling of data or it was concocted to win a case.[31]

Martin Wells had previously acknowledged the dangers of manipulation and subjective selectivity inherent in systematic reviews and meta-analyses. The MDL court quoted from Wells’ testimony in Martin v. Actavis:

QUESTION: You would certainly agree that the inclusion-exclusion criteria should be based upon objective criteria and not simply because you were trying to get to a particular result?

WELLS: No, you shouldn’t load the – sort of cook the books.

QUESTION: You should have prespecified objective criteria in advance, correct?

WELLS: Yes.[32]

The MDL court also picked up on a subtle but important methodological point about which odds ratio to use in a meta-analysis when a study provides multiple analyses of the same association. In his first paraquat deposition, Wells cited the Cochrane Handbook, for the proposition that if a crude risk ratio and a risk ratio from a multivariate analysis are both presented in a given study, then the adjusted risk ratio (and its corresponding measure of standard error seen in its confidence interval) is generally preferable to reduce the play of confounding.[33] Wells violated this basic principle by ignoring the multivariate analysis in the study that dominated his meta-analysis (Liou) in favor of the unadjusted bivariate analysis. Given that Wells accepted this basic principle, the MDL court found that Wells likely selected the minimally adjusted odds ratio over the multiviariate adjusted odds ratio for inclusion in his meta-analysis in order to have the smaller variance (and thus greater weight) from the former. This maneuver was disqualifying under Rule 702.[34]

All in all, the paraquat MDL court’s Rule 702 ruling was a convincing demonstration that non-expert generalist judges, with assistance from subject-matter experts, treatises, and legal counsel, can evaluate and identify deviations from methodological standards of care.


[1] In re Paraquat Prods. Prods. Liab. Litig., Case No. 3:21-md-3004-NJR, MDL No. 3004, Slip op., ___ F.3d ___ (S.D. Ill. Apr. 17, 2024) [Slip op.]

[2] In re Paraquat Prods. Prods. Liab. Litig., Op. sur motion for judgment, Case No. 3:21-md-3004-NJR, MDL No. 3004 (S.D. Ill. Apr. 17, 2024). See also Brendan Pierson, “Judge rejects key expert in paraquat lawsuits, tosses first cases set for trial,” Reuters (Apr. 17, 2024); Hailey Konnath, “Trial-Ready Paraquat MDL Cases Tossed After Testimony Axed,” Law360 (Apr. 18, 2024).

[3] Ferebee v. Chevron Chem. Co., 552 F. Supp. 1297 (D.D.C. 1982), aff’d, 736 F.2d 1529 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984). SeeFerebee Revisited,” Tortini (Dec. 28, 1017).

[4] See 40 C.F.R. § 152.175.

[5] Slip op. at 31.

[6] 7 U.S.C. § 136w; 7 U.S.C. § 136a(a); 40 C.F.R. § 152.175. The agency must periodically review the registration of the herbicide. 7 U.S.C. § 136a(g)(1)(A). See Ruckelshaus v. Monsanto Co., 467 U.S. 986, 991-92 (1984).

[7] See Austin Wray & Aaron Niman, Memorandum, Paraquat Dichloride: Systematic review of the literature to evaluate the relationship between paraquat dichloride exposure and Parkinson’s disease at 35 (June 26, 2019).

[8] See also Jeffrey Brent and Tammi Schaeffer, “Systematic Review of Parkinsonian Syndromes in Short- and Long-Term Survivors of Paraquat Poisoning,” 53 J. Occup. & Envt’l Med. 1332 (2011) (“An analysis the world’s entire published experience found no connection between high-dose paraquat exposure in humans and the development of parkinsonism.”).

[9] Douglas L. Weed, “Does paraquat cause Parkinson’s disease? A review of reviews,” 86 Neurotoxicology 180, 180 (2021).

[10] In re Incretin-Based Therapies Prods. Liab. Litig., 524 F.Supp. 3d 1007, 1038, 1043 (S.D. Cal. 2021), aff’d, No. 21-55342, 2022 WL 898595 (9th Cir. Mar. 28, 2022) (per curiam). SeeMadigan’s Shenanigans and Wells Quelled in Incretin-Mimetic CasesTortini (July 15, 2022).

[11] The MDL court obviously worked hard to learn the basics principles of epidemiology. The court relied extensively upon the epidemiology chapter in the Reference Manual on Scientific Evidence. Much of that material is very helpful, but its exposition on statistical concepts is at times confused and erroneous. It is unfortunate that courts do not pay more attention to the more precise and accurate exposition in the chapter on statistics. Citing the epidemiology chapter, the MDL court gave an incorrect interpretation of the p-value: “A statistically significant result is one that is unlikely the product of chance. Slip op. at 17 n. 11. And then again, citing the Reference Manual, the court declared that “[a] p-value of .1 means that there is a 10% chance that values at least as large as the observed result could have been the product of random error. Id.” Id. Similarly, the MDL court gave an incorrect interpretation of the confidence interval. In a footnote, the court tells us that “[r]esearchers ordinarily assert a 95% confidence interval, meaning that ‘there is a 95% chance that the “true” odds ratio value falls within the confidence interval range’. In re Zoloft (Sertraline Hydrochloride) Prod. Liab. Litig., MDL No. 2342, 2015 WL 7776911, at *2 (E.D. Pa. Dec. 2, 2015).” Slip op. at 17n.12.  Citing another court for the definition of a statistical concept is a risky business.

[12] Slip op. at 20, citing Lisa A. Bero, “Evaluating Systematic Reviews and Meta-Analyses,” 14 J.L. & Pol’y 569, 570 (2006).

[13] Slip op. at 21, quoting Bero, at 575.

[14] Slip op. at 3.

[15] The nine studies at issue were as follows: (1) H.H. Liou, et al., “Environmental risk factors and Parkinson’s disease; A case-control study in Taiwan,” 48 Neurology 1583 (1997); (2) Caroline M. Tanner, et al.,Rotenone, Paraquat and Parkinson’s Disease,” 119 Envt’l Health Persps. 866 (2011) (a nested case-control study within the Agricultural Health Study (“AHS”)); (3) Clyde Hertzman, et al., “A Case-Control Study of Parkinson’s Disease in a Horticultural Region of British Columbia,” 9 Movement Disorders 69 (1994); (4) Anne-Maria Kuopio, et al., “Environmental Risk Factors in Parkinson’s Disease,” 14 Movement Disorders 928 (1999); (5) Katherine Rugbjerg, et al., “Pesticide exposure and risk of Parkinson’s disease – a population-based case-control study evaluating the potential for recall bias,” 37 Scandinavian J. of Work, Env’t & Health 427 (2011); (6) Jordan A. Firestone, et al., “Occupational Factors and Risk of Parkinson’s Disease: A Population-Based Case-Control Study,” 53 Am. J. of Indus. Med. 217 (2010); (7) Amanpreet S. Dhillon,“Pesticide / Environmental Exposures and Parkinson’s Disease in East Texas,” 13 J. of Agromedicine 37 (2008); (8) Marianne van der Mark, et al., “Occupational exposure to pesticides and endotoxin and Parkinson’s disease in the Netherlands,” 71 J. Occup. & Envt’l Med. 757 (2014); (9) Srishti Shrestha, et al., “Pesticide use and incident Parkinson’s disease in a cohort of farmers and their spouses,” Envt’l Research 191 (2020).

[16] Slip op. at 75.

[17] Slip op. at 73.

[18] Slip op. at 75, citing In re Mirena IUS Levonorgestrel-Related Prod. Liab. Litig. (No. II), 341 F. Supp. 3d 213, 241 (S.D.N.Y. 2018) (“Opinions that assume a conclusion and reverse-engineer a theory to fit that conclusion are . . . inadmissible.”) (internal citation omitted), aff’d, 982 F.3d 113 (2d Cir. 2020); In re Zoloft (Sertraline Hydrochloride) Prod. Liab. Litig., No. 12-md-2342, 2015 WL 7776911, at *16 (E.D. Pa. Dec. 2, 2015) (excluding expert’s opinion where he “failed to consistently apply the scientific methods he articulat[ed], . . . deviated from or downplayed certain well established principles of his field, and . . . inconsistently applied methods and standards to the data so as to support his a priori opinion.”), aff’d, 858 F.3d 787 (3d Cir. 2017).

[19] Slip op. at 35.

[20] Slip op. at 58.

[21] Slip op. at 55.

[22] Slip op. at 41, 64.

[23] Slip op. at 59-60, citing In re Lipitor (Atorvastatin Calcium) Mktg., Sales Pracs. & Prod. Liab. Litig., 892 F.3d 624, 634 (4th Cir. 2018) (“Result-driven analysis, or cherry-picking, undermines principles of the scientific method and is a quintessential example of applying methodologies (valid or otherwise) in an unreliable fashion.”).

[24] Slip op. at 67, 69-70, citing In re Zoloft (Sertraline Hydrochloride) Prod. Liab. Litig., 858 F.3d 787, 795-97 (3d Cir. 2017) (“[I]f an expert applies certain techniques to a subset of the body of evidence and other techniques to another subset without explanation, this raises an inference of unreliable application of methodology.”); In re Bextra and Celebrex Mktg. Sales Pracs. & Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1179 (N.D. Cal. 2007) (excluding an expert witness’s causation opinion because of his result-oriented, inconsistent evaluation of data sources).

[25] Slip op. at 40.

[26] Slip op. at 61 n.44.

[27] Michael Borenstein, Larry V. Hedges, Julian P. T. Higgins, and Hannah R. Rothstein, Introduction to Meta-Analysis (2d ed. 2021).

[28] Jacqueline Chandler, James Thomas, Julian P. T. Higgins, Matthew J. Page, Miranda Cumpston, Tianjing Li, Vivian A. Welch, eds., Cochrane Handbook for Systematic Reviews of Interventions (2ed 2023).

[29] Slip op. at 56, citing In re Zimmer Nexgen Knee Implant Prod. Liab. Litig., No. 11 C 5468, 2015 WL 5050214, at *10 (N.D. Ill. Aug. 25, 2015).

[30] Slip op. at 22. The court noted that the Reference Manual on Scientific Evidence cautions that “[p]eople often tend to have an inordinate belief in the validity of the findings when a single number is attached to them, and many of the difficulties that may arise in conducting a meta-analysis, especially of observational studies such as epidemiological ones, may consequently be overlooked.” Id., quoting from Manual, at 608.

[31] Slip op. at 57, citing Deutsch v. Novartis Pharms. Corp., 768 F. Supp. 2d 420, 457-58 (E.D.N.Y. 2011) (“[T]here is a strong risk of prejudice if a Court permits testimony based on an unreliable meta-analysis because of the propensity for juries to latch on to the single number.”).

[32] Slip op. at 64, quoting from Notes of Testimony of Martin Wells, in In re Testosterone Replacement Therapy Prod. Liab. Litig., Nos. 1:14-cv-1748, 15-cv-4292, 15-cv-426, 2018 WL 7350886 (N.D. Ill. Apr. 2, 2018).

[33] Slip op. at 70.

[34] Slip op. at 71-72, citing People Who Care v. Rockford Bd. of Educ., 111 F.3d 528, 537-38 (7th Cir. 1997) (“[A] statistical study that fails to correct for salient explanatory variables . . . has no value as causal explanation and is therefore inadmissible in federal court.”); In re Roundup Prod. Liab. Litig., 390 F. Supp. 3d 1102, 1140 (N.D. Cal. 2018). Slip op. at 17 n. 12.

Peer Review, Protocols, and QRPs

April 3rd, 2024

In Daubert, the Supreme Court decided a legal question about the proper interpretation of a statute, Rule 702, and then remanded the case to the Ninth Circuit of the Court of Appeals for further proceedings. The Court did, however, weigh in with dicta about some several considerations in admissibility decisions.  In particular, the Court identified four non-dispositive factors: whether the challenged opinion has been empirically tested, published and peer reviewed, and whether the underlying scientific technique or method supporting the opinion has an acceptable rate of error, and has gained general acceptance.[1]

The context in which peer review was discussed in Daubert is of some importance to our understanding its holding peer review out as a consideraton. One of the bases for the defense challenges to some of the plaintiffs’ expert witnesses’ opinions in Daubert was their reliance upon re-analyses of published studies to suggest that there was indeed an increased risk of birth defects if only the publication authors had used some other control group, or taken some other analytical approach. Re-analyses can be important, but these reanalyses of published Bendectin studies were post hoc, litigation driven, and obviously result oriented. The Court’s discussion of peer review reveals that it was not simply creating a box to be checked before a trial court could admit an expert witness’s opinions. Peer review was suggested as a consideration because:

“submission to the scrutiny of the scientific community is a component of “good science,” in part because it increases the likelihood that substantive flaws in methodology will be detected. The fact of publication (or lack thereof) in a peer reviewed journal thus will be a relevant, though not dispositive, consideration in assessing the scientific validity of a particular technique or methodology on which an opinion is premised.”[2]

Peer review, or the lack thereof, for the challenged expert witnesses’ re-analyses was called out because it raised suspicions of lack of validity. Nothing in Daubert, or in later decisions, or more importantly in Rule 702 itself, supports admitting expert witness testimony just because the witness relied upon peer-reviewed studies, especially when the studies are invalid or are based upon questionable research practices. The Court was careful to point out that peer-reviewed publication was “not a sine qua non of admissibility; it does not necessarily correlate with reliability, … .”[3] The Court thus showed that it was well aware that well-ground (and thus admissible) opinions may not have been previously published, and that the existence of peer review was simply a potential aid in answering the essential question, whether the proponent of a proffered opinion has shown “the scientific validity of a particular technique or methodology on which an opinion is premised.[4]

Since 1993, much has changed in the world of bio-science publishing. The wild proliferation of journals, including predatory and “pay-to-play” journals, has disabused most observers that peer review provides evidence of validity of methods. Along with the exponential growth in publications has come an exponential growth in expressions of concern and out-right retractions of articles, as chronicled and detailed at Retraction Watch.[5] Some journals encourage authors to nominate the peer reviewers for their manuscripts; some journals let authors block some scientists as peer reviewers of their submitted manuscripts. If the Supreme Court were writing today, it might well note that peer review is often a feature of bad science, advanced by scientists who know that peer-reviewed publication is the price of admission to the advocacy arena.

Since the Supreme Court decided Daubert, the Federal Judicial Center and National Academies of Science have provided a Reference Manual for Scientific Evidence, now in its third edition, and with a fourth edition on the horizon, to assist judges and lawyers involved in the litigation of scientific issues. Professor Goodstein, in his chapter “How Science Works,” in the third edition, provides the most extensive discussion of peer review in the Manual, and emphasizes that peer review “works very poorly in catching cheating or fraud.”[6]  Goodstein invokes his own experience as a peer reviewer to note that “peer review referees and editors limit their assessment of submitted articles to such matters as style, plausibility, and defensibility; they do not duplicate experiments from scratch or plow through reams of computer-generated data in order to guarantee accuracy or veracity or certainty.”[7] Indeed, Goodstein’s essay in the Reference Manual characterizes the ability of peer review to warrant study validity as a “myth”:

Myth: The institution of peer review assures that all published papers are sound and dependable.

Fact: Peer review generally will catch something that is completely out of step with majority thinking at the time, but it is practically useless for catching outright fraud, and it is not very good at dealing with truly novel ideas. … It certainly does not ensure that the work has been fully vetted in terms of the data analysis and the proper application of research methods.[8]

Goodstein’s experience as a peer reviewer is hardly idiosyncratic. One standard text on the ethical conduct of research reports that peer review is often ineffective or incompetent, and that it may not even catch simple statistical or methodological errors.[9] According to the authors, Shamoo and Resnik:

“[p]eer review is not good at detecting data fabrication or falsification partly because reviewers usually do not have access to the material they would need to detect fraud, such as the original data, protocols, and standard operating procedures.”[10]

Indeed, without access to protocols, statistical analysis plans, and original data, peer review often cannot identify good faith or negligent deviations from the standard of scientific care. There is some evidence to support this negative assessment of peer review from testing of the counter-factual. Reviewers were able to detect questionable, selective reporting when they had access to the study authors’ research protocols.[11]

Study Protocol

The study protocol provides the scientific rationale for a study, clearly defines the research question, the data collection process, defines the key exposure and outcomes, and describes the methods to be applied, before commencing data collection.[12] The protocol also typically pre-specifies the statistical data analysis. The epidemiology chapter of the current edition of the Reference Manual for Scientific Evidence offers blandly only that epidemiologists attempt to minimize bias in observational studies with “data collection protocols.”[13] Epidemiologists and statisticians are much clearer in emphasizing the importance, indeed the necessity, of having a study protocol before commencing data collection. Back in 1988, John Bailar and Frederick Mosteller explained that it was critical in reporting statistical analyses to inform readers about how and when the authors devised the study design, and whether they set the design criteria out in writing before they began to collect data.[14]

The necessity of a study protocol is “self-evident,”[15] and essential to research integrity.[16] The International Society of Pharmacoepidemiology has issued Guidelines for “Good Pharmacoepidemiology Practices,”[17] which calls for every study to have a written protocol. Among the requirements set out in this set of guidelines are descriptions of the research method, study design, operational definitions of exposure and outcome variables, and projected study sample size. The Guidelines provide that a detailed statistical analysis plan may be specified after data collection begins, but before any analysis commences.

Expert witness opinions on health effects are built upon studies, and so it behooves legal counsel to identify the methodological strengths and weaknesses of key studies through questioning whether they have protocols, whether the protocols were methodologically appropriate, and whether the researchers faithfully followed their protocols and their statistical analysis plans. Determining the peer review status of a publication, on the other hand, will often not advance a challenge based upon improvident methodology.

In some instances, a published study will have sufficiently detailed descriptions of methods and data that readers, even lawyers, can evaluate their scientific validity or reliability (vel non). In some cases, however, readers will be no better off than the peer reviewers who were deprived of access to protocols, statistical analysis plans, and original data. When a particular study is crucial support for an adversary’s expert witness, a reasonable litigation goal may well be to obtain the protocol and statistical analysis plan, and if need be, the original underlying data. The decision to undertake such discovery is difficult. Discovery of non-party scientists can be expensive and protracted; it will almost certainly be contentious. When expert witnesses rely upon one or a few studies, which telegraph internal validity, this litigation strategy may provide the strongest evidence against the study’s being reasonably relied upon, or its providing “sufficient facts and data” to support an admissible expert witness opinion.


[1] Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 593-594 (1993).

[2] Id. at 594 (internal citations omitted) (emphasis added).

[3] Id.

[4] Id. at 593-94.

[5] Retraction Watch, at https://retractionwatch.com/.

[6] Reference Manual on Scientific Evidence at 37, 44-45 (3rd ed. 2011) [Manual].

[7] Id. at 44-45 n.11.

[8] Id. at 48 (emphasis added).

[9] Adil E. Shamoo and David B. Resnik, Responsible Conduct of Research 133 (4th ed. 2022).

[10] Id.

[11] An-Wen Chan, Asbjørn Hróbjartsson, Mette T. Haahr, Peter C. Gøtzsche, and David G. Altman, D. G. “Empirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles,” 291 J. Am. Med. Ass’n 2457 (2004).

[12] Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 477 (2nd ed. 2014).

[13] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 573 (3rd ed. 2011) 573 (“Study designs are developed before they begin gathering data.”).

[14] John Bailar & Frederick Mosteller, “Guidelines for Statistical Reporting in Articles for Medical Journals,” 108 Ann. Intern. Med. 2266, 268 (1988).

[15] Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 477 (2nd ed. 2014).

[16] Sandra Alba, et al., “Bridging research integrity and global health epidemiology statement: guidelines for good epidemiological practice,” 5 BMJ Global Health e003236, at p.3 & passim (2020).

[17] See “The ISPE Guidelines for Good Pharmacoepidemiology Practices (GPP),” available at <https://www.pharmacoepi.org/resources/policies/guidelines-08027/>.

Reference Manual – Desiderata for 4th Edition – Part IV – Confidence Intervals

February 10th, 2023

Putting aside the idiosyncratic chapter by the late Professor Berger, most of the third edition of the Reference Manual presented guidance on many important issues.  To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges. On several issues, especially statistical in nature, the fourth edition could benefit from an editor to ensure that the individual chapters, written by different authors, actually agree on key concepts.  One such example is the third edition’s treatment of confidence intervals.[1]

The “DNA Identification” chapter noted that the meaning of a confidence interval is subtle,[2] but I doubt that the authors, David Kaye and George Sensabaugh, actually found it subtle or difficult. In the third edition’s chapter on statistics, David Kaye and co-author, the late David A. Freedman, gave a reasonable definition of confidence intervals in their glossary:

confidence interval. An estimate, expressed as a range, for a parameter. For estimates such as averages or rates computed from large samples, a 95% confidence interval is the range from about two standard errors below to two standard errors above the estimate. Intervals obtained this way cover the true value about 95% of the time, and 95% is the confidence level or the confidence coefficient.”[3]

Intervals, not the interval, which is correct. This chapter made clear that it was the procedure of obtaining multiple samples with intervals that yielded the 95% coverage. In the substance of their chapter, Kaye and Freedman are explicit about how intervals are constructed, and that:

“the confidence level does not give the probability that the unknown parameter lies within the confidence interval.”[4]

Importantly, the authors of the statistics chapter named names; that is, they cited some cases that butchered the concept of the confidence interval.[5] The fourth edition will have a more difficult job because, despite the care taken in the statistics chapter, many more decisions have misstated or misrepresented the meaning of a confidence interval.[6] Citing more cases perhaps will disabuse federal judges of their reliance upon case law for the meaning of statistical concepts.

The third edition’s chapter on multiple regression defined confidence interval in its glossary:

confidence interval. An interval that contains a true regression parameter with a given degree of confidence.”[7]

The chapter avoided saying anything obviously wrong only by giving a very circular definition. When the chapter substantively described a confidence interval, it ended up giving an erroneous one:

“In general, for any parameter estimate b, the expert can construct an interval around b such that there is a 95% probability that the interval covers the true parameter. This 95% confidence interval is given by: b ± 1.96 (SE of b).”[8]

The formula provided is correct, but the interpretation of a 95% probability that the interval covers the true parameter is unequivocably wrong.[9]

The third edition’s chapter by Shari Seidman Diamond on survey research, on the other hand, gave an anodyne example and a definition:

“A survey expert could properly compute a confidence interval around the 20% estimate obtained from this sample. If the survey were repeated a large number of times, and a 95% confidence interval was computed each time, 95% of the confidence intervals would include the actual percentage of dentists in the entire population who would believe that Goldgate was manufactured by the makers of Colgate.

                 *  *  *  *

Traditionally, scientists adopt the 95% level of confidence, which means that if 100 samples of the same size were drawn, the confidence interval expected for at least 95 of the samples would be expected to include the true population value.”[10]

Similarly, the third edition’s chapter on epidemiology correctly defined the confidence interval operationally as a process of iterative intervals that collectively cover the true value in 95% of all the intervals:

“A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”[11]

Not content to leave it well said, the chapter’s authors returned to the confidence interval and provided another, more problematic definition, a couple of pages later in the text:

“A confidence interval is a range of possible values calculated from the results of a study. If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[12]

The first sentence refers to “a study”; that is, one study, one range of values. The second sentence then tells us that “the range” (singular, presumably referring back to the single “a study”), will capture 95% of the results from many resamplings from the same population. Now the definition is not framed with respect to the true population parameter, but the results from many other samples. The authors seem to have given the first sample’s confidence interval the property of including 95% of all future studies, and that is incorrect. From reviewing the case law, courts remarkably have gravitated to the second, incorrect definition.

The glossary to the third edition’s epidemiology chapter clearly, however, runs into the ditch:

“confidence interval. A range of values calculated from the results of a study within which the true value is likely to fall; the width of the interval reflects random error. Thus, if a confidence level of .95 is selected for a study, 95% of similar studies would result in the true relative risk falling within the confidence interval.”[13]

Note that the sentence before the semicolon talked of “a study” with “a range of values,” and that there is a likelihood of that range including the “true value.” This definition thus used the singular to describe the study and to describe the range of values.  The definition seemed to be saying, clearly but wrongly, that a single interval from a single study has a likelihood of containing the true value. The second full sentence ascribed a probability, 95%, to the true relative risk’s falling within “the interval.” To point out the obvious, “the interval,” is singular, and refers back to “a study,” also singular. At best, this definition was confusing; at worst, it was wrong.

The Reference Manual has a problem beyond its own inconsistencies, and the refractory resistance of the judiciary to statistical literacy. There are any number of law professors and even scientists who have held out incorrect definitions and interpretations of confidence intervals.  It would be helpful for the fourth edition to caution its readers, both bench and bar, to the prevalent misunderstandings.

Here, for instance, is an example of a well-credentialed statistician, who gave a murky definition in a declaration filed in federal court:

“If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[14]

The expert witness correctly identifies the repeated sampling, but specifies a 95% probability to “the range,” which leaves unclear whether it is the range of all intervals or “a 95% confidence interval,” which is in the antecedent of the statement.

Much worse was a definition proffered in a recent law review article by well-known, respected authors:

“A 95% confidence interval, in contrast, is a one-sided or two-sided interval from a data sample with 95% probability of bounding a fixed, unknown parameter, for which no nondegenerate probability distribution is conceived, under specified assumptions about the data distribution.”[15]

The phrase “for which no nondegenerate probability distribution is conceived,” is unclear as to whether the quoted phrase refers to the confidence interval or to the unknown parameter. It seems that the phrase modifies the noun closest to it in the sentence, the “fixed, unknown parameter,” which suggests that these authors were simply trying to emphasize that they were giving a frequentist interpretation and not conceiving of the parameter as a random variable as Bayesians would. The phrase “no nondegenerate” appears to be a triple negative, since a degenerate distribution is one that does not have a variation. The phrase makes the definition obscure, and raises questions what is being excluded by the phrase.

The more concerning aspect of the quoted footnote is its obfuscation of the important distinction between the procedure of repeatedly calculating confidence intervals (which procedure has a 95% success rate in the long run) and the probability that any given instance of the procedure, in a single confidence interval, contains the parameter. The latter probability is either zero or one.

The definition’s reference to “a” confidence interval, based upon “a” data sample, actually leaves the reader with no way of understanding the definition to be referring to the repeated process of sampling, and the set of resulting intervals. The upper and lower interval bounds are themselves random variables that need to be taken into account, but by referencing a single interval from a single data sample, the authors misrepresent the confidence interval and invite a Bayesian interpretation.[16]

Sadly, there is a long tradition of scientists and academics in giving errant definitions and interpretations of the confidence interval.[17] Their error is not harmless because they invite the attribution of a high level of probability to the claim that the “true” population measure is within the reported confidence interval. The error encourages readers to believe that the confidence interval is not conditioned upon the single sample result, and it misleads readers into believing that not only random error, but systematic and data errors are accounted for in the posterior probability.[18] 


[1]Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012).

[2] David H. Kaye & George Sensabaugh, “Reference Guide on DNA Identification Evidence” 129, 165 n.76.

[3] David H. Kaye & David A. Freedman, “Reference Guide on Statistics” 211, 284-5 (Glossary).

[4] Id. at 247.

[5] Id. at 247 n.91 & 92 (citing DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993); SmithKline Beecham Corp. v. Apotex Corp., 247 F. Supp. 2d 1011, 1037 (N.D. Ill. 2003), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005); In re Silicone Gel Breast Implants Prods. Liab. Litig, 318 F. Supp. 2d 879, 897 (C.D. Cal. 2004) (“a margin of error between 0.5 and 8.0 at the 95% confidence level . . . means that 95 times out of 100 a study of that type would yield a relative risk value somewhere between 0.5 and 8.0.”).

[6] See, e.g., Turpin v. Merrell Dow Pharm., Inc., 959 F.2d 1349, 1353–54 & n.1 (6th Cir. 1992) (erroneously describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10”); American Library Ass’n v. United States, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), rev’d on other grounds, 539 U.S. 194 (2003); Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc., 482 F.Supp. 2d 478, 495 (D.N.J.2007) (“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”) (apparently relying party’s expert witness’s report), aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd., 344 Fed.Appx. 595 (Fed. Cir. 2009); Eli Lilly & Co. v. Teva Pharms, USA, 2008 WL 2410420, *24 (S.D. Ind. 2008) (stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”); Benavidez v. City of Irving, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009) (interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”); Pritchard v. Dow Agro Sci., 705 F. Supp. 2d 471, 481, 488 (W.D. Pa. 2010) (excluding Dr. Bennet Omalu who assigned a 90% probability that an 80% confidence interval excluded relative risk of 1.0), aff’d, 430 F. App’x 102 (3d Cir.), cert. denied, 132 S. Ct. 508 (2011); Estate of George v. Vermont League of Cities and Towns, 993 A.2d 367, 378 n.12 (Vt. 2010) (erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”); Garcia v. Tyson Foods, 890 F. Supp. 2d 1273, 1285 (D. Kan. 2012) (quoting expert witness Robert G. Radwin, who testified that a 95% confidence interval in a study means “if I did this study over and over again, 95 out of a hundred times I would expect to get an average between that interval.”); In re Chantix (Varenicline) Prods. Liab. Litig., 889 F. Supp. 2d 1272, 1290n.17 (N.D. Ala. 2012); In re Zoloft Products, 26 F. Supp. 3d 449, 454 (E.D. Pa. 2014) (“A 95% confidence interval means that there is a 95% chance that the ‘‘true’’ ratio value falls within the confidence interval range.”), aff’d, 858 F.3d 787 (3d Cir. 2017); Duran v. U.S. Bank Nat’l Ass’n, 59 Cal. 4th 1, 36, 172 Cal. Rptr. 3d 371, 325 P.3d 916 (2014) (“Statisticians typically calculate margin of error using a 95 percent confidence interval, which is the interval of values above and below the estimate within which one can be 95 percent certain of capturing the ‘true’ result.”); In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832, 842 (2017) (correctly quoting an incorrect definition from the third edition at p.580), rev’d on other grounds, 235 N.J. 229, 194 A.3d 503 (2018); In re Testosterone Replacement Therapy Prods. Liab., No. 14 C 1748, MDL No. 2545, 2017 WL 1833173, *4 (N.D. Ill. May 8, 2017) (“A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time.”); Maldonado v. Epsilon Plastics, Inc., 22 Cal. App. 5th 1308, 1330, 232 Cal. Rptr. 3d 461 (2018) (“The 95 percent ‘confidence interval’, as used by statisticians, is the ‘interval of values above and below the estimate within which one can be 95 percent certain of capturing the “true” result’.”); Escheverria v. Johnson & Johnson, 37 Cal. App. 5th 292, 304, 249 Cal. Rptr. 3d 642 (2019) (quoting uncritically and with approval one of plaintiff’s expert witnesses, Jack Siemiatycki, who gave the jury an example of a study with a relative risk of 1.2, with a “95 percent probability that the true estimate is between 1.1 and 1.3.” According to the court, Siemiatycki went on to explain that this was “a pretty tight interval, and we call that a confidence interval. We call it a 95 percent confidence interval when we calculate it in such a way that it covers 95 percent of the underlying relative risks that are compatible with this estimate from this study.”); In re Viagra (Sildenafil Citrate) & Cialis (Tadalafil) Prods. Liab. Litig., 424 F.Supp.3d 781, 787 (N.D. Cal. 2020) (“For example, a given study could calculate a relative risk of 1.4 (a 40 percent increased risk of adverse events), but show a 95 percent “confidence interval” of .8 to 1.9. That confidence interval means there is 95 percent chance that the true value—the actual relative risk—is between .8 and 1.9.”); Rhyne v. United States Steel Corp., 74 F. Supp. 3d 733, 744 (W.D.N.C. 2020) (relying upon, and quoting, one of the more problematic definitions given in the third edition at p.580: “If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the population.”); Wilant v. BNSF Ry., C.A. No. N17C-10-365 CEB, (Del. Super. Ct. May 13, 2020) (citing third edition at p.573, “a confidence interval provides ‘a range (interval) within which the risk likely would fall if the study were repeated numerous times’.”; “[s]o a 95% confidence interval indicates that the range of results achieved in the study would be achieved 95% of the time when the study is replicated from the same population.”); Germaine v. Sec’y Health & Human Servs., No. 18-800V, (U.S. Fed. Ct. Claims July 29, 2021) (giving an incorrect definition directly from the third edition, at p.621; “[a] “confidence interval” is “[a] range of values … within which the true value is likely to fall[.]”).

[7] Daniel Rubinfeld, “Reference Guide on Multiple Regression” 303, 352.

[8] Id. at 342.

[9] See Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman, “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” 31 Eur. J. Epidemiol. 337, 343 (2016).

[10] Shari Seidman Diamond, “Reference Guide on Survey Research” 359, 381.

[11] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” 549, 573.

[12] Id. at 580.

[13] Id. at 621.

[14] In re Testosterone Replacement Therapy Prods. Liab. Litig., Declaration of Martin T. Wells, Ph.D., at 2-3 (N.D. Ill., Oct. 30, 2016). 

[15] Joseph Sanders, David Faigman, Peter Imrey, and A. Philip Dawid, “Differential Etiology: Inferring Specific Causation in the Law from Group Data in Science,” 63 Arizona L. Rev. 851, 898 n.173 (2021).

[16] The authors are well-credentialed lawyers and scientists. Peter Imrey, was trained in, and has taught, mathematical statistics, biostatistics, and epidemiology. He is a professor of medicine in the Cleveland Clinic Lerner College of Medicine. A. Philip Dawid is a distinguished statistician, an Emeritus Professor of Statistics, Cambridge University, Darwin College, and a Fellow of the Royal Society. David Faigman is the Chancellor & Dean, and the John F. Digardi Distinguished Professor of Law at the University of California Hastings College of the Law. Joseph Sanders is the A.A. White Professor, at the University of Houston Law Center. I have previously pointed this problem in these authors’ article. “Differential Etiologies – Part One – Ruling In” (June 19, 2022).

[17] See, e.g., Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004) (“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”); Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 60-61 n. 17 (2007) (quoting Clapp and Ozonoff with obvious approval); Déirdre DwyerThe Judicial Assessment of Expert Evidence 154-55 (Cambridge Univ. Press 2008) (“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”); Frank C. Woodside, III & Allison G. Davis, “The Bradford Hill Criteria: The Forgotten Predicate,” 35 Thomas Jefferson L. Rev. 103, 110 (2013) (“A confidence interval provides both the relative risk found in the study and a range (interval) within which the risk would likely fall if the study were repeated numerous times.”); Christopher B. Mueller, “Daubert Asks the Right Questions:  Now Appellate Courts Should Help Find the Right Answers,” 33 Seton Hall L. Rev. 987, 997 (2003) (describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant & Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 Judicature 12, 19 (2003) (“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, Modern Epidemiology 190 (Philadelphia 1998);  John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New Reference Manual on Scientific Evidence,” 74 N.C.L.Rev. 1183, 1212 n.172 (1996) (“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

[18] See Brock v. Merrill Dow Pharm., Inc., 874 F.2d 307, 311–12 (5th Cir. 1989) (incorrectly stating that the court need not resolve questions of bias and confounding because “the studies presented to us incorporate the possibility of these factors by the use of a confidence interval”). Bayesian credible intervals can similarly be misleading when the interval simply reflects sample results and sample variance, but not the myriad other ways the estimate may be wrong.