TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

A Bayesian Toehold in the New Reference Guide to Epidemiology

April 4th, 2026

The most recent edition of the Reference Manual on epidemiology distinguishes more carefully between Bayesian and frequentist approaches to statistical analyses than did its previous iterations. In past editions, the authors conflated confidence and credible intervals, an error that is studiously avoided in the text of the chapter on epidemiology, in the fourth edition.[1]

The chapter acknowledges that “most published research does not” use Bayesian credible intervals of posterior probabilities. The authors then offer a largely unsupported conclusion about a “toehold”:

“Epidemiologic studies assessed by Bayesian statistical analyses have begun to gain a toehold in litigation, although court opinions are still dominated by discussion of traditional significance testing.”[2]

The authors do not define what a toehold is; nor do they specify whether it is a big toe or pinky toe. The new chapter cites three cases, which out of the universe of cases, seems like a tiny toe. The three cases cited by the Reference Manual as a toehold raise serious questions about the legitimacy of using Bayesian analyses, at least to date.

  1. Langrell.

In Langrell,[3] one of the three cases cited by the Manual, an expert witness claimed to have used a “Bayesian approach,” but in reality no Bayesian statistics were involved. The Manual describes the result in Langrell as admitting the testimony of a specific causation expert witness who had used a Bayesian approach for specific causation of a cancer “so rare that it was “unlikely or impossible for epidemiological studies to be performed.”[4]

Citing Langrell for the stated proposition was questionable scholarship at best. The case was one of several cancer claims against railroad employers, in which Robert Peter Gale served as an expert witness. Dr. Robert Peter Gale is a well-credentialed clinician whose career has focused on lymphopoietic cancers.[5] He has no apparent expertise in statistics or epidemiology.

In one reported decision, Byrd, Dr. Gale attempted to offer a “Bayesian” opinion that railroad yard exposures caused a worker’s lung cancer. The claimant had also been a two-pack per day smoker for many years.[6] The published opinion refers to Dr. Gale’s having used Bayesian methods, but there is nothing in the published opinion to suggest that such methods had been used.[7] Gale appeared to equate Bayesian analysis with a non-quantitative differential etiology. Given the claimant’s extensive smoking history, the trial court excluded Dr. Gale’s proffered opinion on the cause of the claimant’s lung cancer, as unreliable.

In another railroad case brought by Saul Hernandez, Gale also claimed to use Bayesian methods to assess the causation of the claimant’s stomach cancer. There is only one mention, however, of Bayes in Gale’s report:

“My opinion is based in Bayesian probabilities which consider the interdependence of individual probabilities. This process is sometimes referred to as differential diagnosis or differential causation determination or differential etiology. Differential diagnosis is a method of reasoning widely-accepted in medicine.”[8]

To be explicit, there was no discussion of prior or posterior probabilities or odds, no discussion of likelihood ratios, or Bayes factors. There was absolutely nothing in Dr. Gale’s report that would warrant his claim that he had done a Bayesian analysis of specific causation or of the “interdependence of individual probabilities” of putative specific causes. The court excluded Dr. Gale’s proffered opinion in Hernandez, with its scant reference to a Bayesian analysis.[9]

The third instance of Gale’s purported use of a Bayesian analysis occurred in the Langrell case, cited by the Manual. The authors of the new Manual do not specify what kind of rare cancer was involved in the Langrell case. For the record, Mr. Langrell developed squamous cell carcinoma of the tonsils, which is the most common type of oropharyngeal cancer, which has been studied for many decades. Alcohol, tobacco, and human papillomavirus (HPV), have long been associated with the occurrence of such cancers. Mr. Langrell had a history of exposure to all three risk factors. Contrary to Gale’s poor-mouthing about lack of data, there are many large cohort studies of railroad yard workers with diesel fume exposure.[10]

The full extent of the district court’s exposition about Gale’s “Bayesian” method was to state that:

“He testified he used a Bayesian approach, allowing him to ‘consider interdependence of individual probabilities’ and to render an opinion as to ‘whether the weight of the evidence indicates it is more likely than not to a reasonable degree of medical probability that exposure to the carcinogens discussed was a cause of tonsil cancer in Mr. Langrell’.”[11]

There is no evidence that Dr. Gale had the competence to conduct a Bayesian analysis, or that he actually did one. Dr. Gale’s participation in the Langrell, Byrd, and Hernandez cases seems like poor evidence of a toehold for Bayesian methods. Not even a pinky toe.

We might forgive the credulity of the judicial officers in these cases, but why would Dr. Gale state that he had done a Bayesian analysis? The only reason that suggests itself is that Dr. Gale was bloviating in order to give his specific causation opinions an aura of scientific and mathematical respectability.  Falsus in uno, falsus in omnibus.[12] In two of the three related cases, his opinion was rejected. The Manual cites only the case in which Gale’s opinion was admitted. The cited opinion offers no support for Gale’s having actually conducted a Bayesian analysis of any sort.

  1. In re Abilify.

The second cited example of toe holds was the use of a Bayesian analysis by a statistician, David Madigan, in the Abilify litigation. Madigan has published on Bayesian statistics, but his litigation activities have repeatedly raised issues whether Madigan’s Bayesian analyses are reliable.

The Abilify litigation involved claims that the anti-psychotic medication caused impulsive gambling, eating, shopping, and sex. Of course, psychotic behavior itself involves those impulsive behaviors and many others. The Manual cited a decision of the multi-district litigation court that noted that “[n]umerous federal courts have found Dr. Madigan’s methodology of detecting safety signals using a combination of frequentist and Bayesian algorithms to be reliable under Rule 702 and Daubert.”[13]

The “signals” to which the Manual citation refers are suggestions of possible causal associations; they are hypotheses generated from pharmacovigilance studies of adverse event reports, not tests of those hypotheses. Signals are not causes; they may not rise even to the level of associations. The particular analyses proffered by Madigan in Abilify, and in many other litigations, for plaintiffs, involves comparing the rate of reporting specific adverse events for the drug with the reporting rate for all drugs, or for comparator drugs. The outcome of these analyses is a reporting rate ratio, not an incidence ratio.

The following 2 x 2 table illustrates how adverse event data are using to create “signals” of disproportional reporting.

The FDA provides very clear guidance on the meaning and use of such signal-finding algorithms or disproportionality analyses (DPAs):

“In the context of spontaneous report systems, some authors use the term “signal of disproportionate reporting” (SDR) when discussing associations highlighted by DPA methods. In reality, most SDRs that emerge from spontaneous report databases represent non-causal effects because the reports are associated with treatment indications (i.e., confounding by indication), co-prescribing patterns, co-morbid illnesses, protopathic bias, channeling bias, or other reporting artifacts, or, the reported adverse events are already labeled or are medically trivial.”[14]

Disproportionality analyses are not part of analytical epidemiology, but Madigan has tried to pass them off as such in any number of litigations. More discerning courts have excluded his attempts. In the Accutane litigation in Atlantic County, New Jersey, Judge Johnson conducted an extensive pre-trial hearing on challenges to Madigan’s causation opinions, and found them wanting under the New Jersey analogue of Federal Rule of Evidence 702.[15] On appeal, the New Jersey Supreme Court reviewed and affirmed the exclusion of Madigan’s litigation opinions that isotretinoin causes Crohn’s disease.[16]

The pattern of adverse event report filing in connection with isotretinoin has been carefully studied; it illustrates the FDA’s point about artifacts. One such study of isotretinoin adverse event reporting showed that attorneys reported  87.8% cases, while physicians reported 6.0%, and consumers reported only 5.1% cases. For the entire FAERS database, only 3.6% reports for all drug reactions during the same time period were reported by attorneys (p value < .01).[17]

In other areas less affected by litigation-created reporting bias, the results of DPAs have been compared with analytical epidemiology. A DPA of statin use and bladder cancer suggested a reporting odds ratio of 1.48, 95% CI; 1.36-1.61. The authors, in a peer-reviewed publication, reported the result with clearly inappropriate causal language: “Multi-methodological approaches suggest that statins are associated with an increased risk for bladder cancer.”[18] An appropriate meta-analysis of analytical epidemiologic studies reported an actual odds ratio of 1.07, 95 % CI (0.95, 1.21), which finding was interpreted as suggesting “that there was no association between statin use and risk of bladder cancer.”[19]

Dr. Madigan’s use of Bayesian methods to analyze reporting ratios and his passing them off as evidence that can support causal inference is a paradigmatic instance of an inappropriate methodology. Dr. Madigan’s use of Bayesian methods to analyze reporting rates seems like poor evidence of a toehold.

  1. In re Testosterone.

The third case cited by the Manual for the toehold proposition arose in the multi-district litigation created for claims against manufacturers of testosterone. This MDL aggregated cases based upon a speculative Public Citizen petition that transdermal testosterone used by men who have low testosterone levels causes heart attacks and strokes. The plaintiffs adopted what appeared to be a strategy of deploying complex arguments and analyses to obfuscate and defeat Rule 702 gatekeeping. As part of this strategy, two of the plaintiffs’ expert witness conducted a Bayesian “hypothesis test,” by which they took an out-of-date meta-analysis,[20] removed some of the studies that they incorrectly decided were duplicative, and recalculated a credible interval instead of a confidence interval.

This Bayesian hypothesis test came up in several decisions of the MDL court. The Manual cited only to a decision dated August 23, 2018, which it characterized as denying a motion to exclude expert witness testimony that advanced a Bayesian critique of epidemiologic studies.[21]

Looking at the cited decision of August 23, 2018, we see a reference to a previous ruling in May 2017, when the court held that an expert witness’s failure and inability to “quantify the cardiovascular risk he finds in his Bayesian analysis … is an issue affecting the weight to be accorded to his analysis, not its admissibility.”[22] On its face, this opinion does not quite make sense given that a Bayesian analysis would necessarily involve a quantification of posterior probability. The referenced May 2017 opinion also demonstrates the court’s failure to understand basic frequentist concepts, when it recited incorrect definitions of p-value and confidence intervals:

“According to conventional statistical practice, such a result—that is, a finding of a positive association between smoking and development of the disease—would be considered statistically significant if there is a 95% probability, also expressed as a “p-value” of <0.05, that the observed association is not the product of chance. If, however, the p-value were greater than 0.05, the observed association would not be regarded as statistically significant, according to prevailing conventions, because there is a greater than 5% probability that the association observed was the result of chance.

* * *

Statistical significance can also be expressed equivalently in terms of a confidence interval. A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time.”[23]

There is, however, also a discussion in the May 2017 decision to the Bayesian hypothesis test, which had been developed by plaintiffs’ expert witnesses,

Burt Gerstman and Martin Wells.[24] The new Manual’s citation to the testosterone MDL case seems to be to this Bayesian analysis.

While the testosterone MDL case cited by the Manual refers only obliquely to a putative Bayesian analysis that had no quantification, the May 2017 decision, not cited by the Manual, actually involved a Bayesian analysis that supposedly yielded a posterior probability of 85% that there was some increased risk for a composite of heart attack and stroke outcomes from use of testosterone therapies.

In the May 2017 decision, the MDL court rejected AbbVie’s Rule 702 motion to exclude Gerstman’s opinion based upon the Bayesian hypothesis test. AbbVie’s approach to the challenge to the Gerstman-Wells’ Bayesian analysis seemed to avoid the complexity inherent in the analysis. The AbbVie motion included several grounds, not all discussed in the court’s decision of May 2017, for excluding the Bayesian analysis, including:

“1) the plaintiffs’ witnesses’ failure to publish their analysis;

2) the challenged witness’s having never published a significant Bayesian analysis previously;

3) the absence of Bayesian analyses in the relevant studies on testosterone;

4) the rarity of Bayesian analyses in product liability cases;

5) the witnesses’ failure to state what the actual risk was, as opposed to the probability that it exceeded 1.0; and

6) the defense expert witness’s calculation that the “Increased [cardiovascular] risk meets only a 70% level of evidence, which is far below the 95% level required.”[25]

Grounds one through four were extremely weak as stated, and ground five did not affect the relevancy of the analysis to general causation. Ground six was the shot in the foot, with the defense’s falling into the trap of conflating the coefficient of confidence (95%) with the posterior probability of a Bayesian analysis.

According to the district court’s opinion, AbbVie challenged Gerstman’s Bayesian analysis because Gerstman never used or published on Bayesian statistics, and thus he lacked expertise in Bayesian analysis. This part of the challenge was readily dismissed because the level of qualifications for an expert witness is very low. A somewhat more substantive objection complained that the Bayesian analysis was “inappropriately based on subjective assumptions.”

The MDL court refused to exclude Gerstman’s Bayesian analysis, relying in part upon the suggestion in the statistics chapter of the Reference Manual third edition that Bayesians constitute a “a well-established minority” in the field of statistics.[26]

On AbbVie’s claim that Bayesian methods are excessively “subjective,” the court declared that AbbVie had failed to explain how the subjective aspect of Bayesian analysis made the proffered Bayesian analysis “any less reliable than frequentist approaches to statistics, which also involve subjective judgments in interpretation of study results.”

Unfortunately, important issues raised by the plaintiffs’ Bayesian meta-analysis were not raised by counsel or addressed by the MDL court’s initial gatekeeping opinion of May 2017. The court briefly revisited the Bayesian analysis as proffered by Martin Wells, with the same lack of specificity, in August 2018.[27] The Bayesian analysis had been prepared jointly by Gerstman and Wells, and the August 2018 decision followed the earlier decision from 2017, without adding any analysis or explanation.

A third challenge to Wells’ Bayesian analysis was filed in 2019, by a different defendant in the testosterone MDL. This challenge was supported by an expert witness report that carefully identified the invalidity of the proffered Bayesian analysis.

Bayes’ Rule is a theorem that provides a posterior probability for a claim or proposition based upon a prior probability and the strength of the evidence at hand. Unlike frequentist statistics, which treat the population value (mean or risk ratio) as having a fixed, but unknown value, Bayesian analyses treat both prior and posterior probabilities as probability distributions. Every Bayesian analysis must start with a prior probability, and therein lies a serious methodological problem, not addressed by the MDL testosterone court in May 2017.

In the Bayesian hypothesis test advanced by the plaintiffs’ expert witnesses in the testosterone cases was based on a method described by John Carlin.[28] The analysis invokes a prior risk ratio of 1.0, which standing alone might seem like a perfectly fair and disinterested prior. The chosen variance around 1.0, which makes up the prior probability distribution, however, was extremely wide and flat, essentially encompassing no risk at the low end, and absolute risk, at the high end. A flat distribution implies that the priors of testosterone causing all heart attacks and strokes, preventing all such outcomes, and having no effect at all, were roughly equally likely as a starting point. Given that we start with a very good understanding that testosterone does not prevent all heart attacks and strokes; nor does it cause all such events, we know that these starting points are unrealistic. The starting assumptions of the plaintiffs’ meta-analysis were, therefore, completely unrealistic and counterfactual.

Carlin’s method used in the proffered Bayesian meta-analysis in the testosterone cases further assumed a “hierarchical normal model.” Carlin described his assumption as reasonable “as long as the studies are large and observed counts are not too small.”[29] In the dataset used by plaintiffs’ expert witnesses, however, virtually all the studies had very low event counts, often zero or one, in either the TRT or placebo arm, or both. Carlin acknowledged that it was difficult to assess the validity of the normal model, and emphasized that

“[a] study of the sensitivity of conclusions to the choice of prior would be important.”[30]

Subsequent simulation studies of Carlin’s approach have shown that so-called “vague” or “non-informative” priors, such as were used by plaintiffs’ expert witnesses, can exercise an “unintentionally large degree of influence on any inferences.”[31]

AbbVie’s earlier challenges to Gerstman and Wells failed to note that they had offered no tests of the validity of Carlin’s method in the context of meta-analyzing clinical trials for sparse safety outcomes. The challenge filed in the Martin case, in 2019, challenged the unsupported assumptions of the proffered Bayesian hypothesis test. This Rule 702 challenge pointed out not only the subjectivity of the assumed prior probability distribution, but its counter-factual nature, and the failure of the proffered Bayesian analysis to comply with the methodological requirements of Carlin’s method.

There were additional problems with the Bayesian hypothesis test as put forward by plaintiffs’ expert witnesses. First, advancing of a causal claim with an 85% posterior probability was bound to be confused with the plaintiffs’ burden of proof of greater than 50%, notwithstanding that the calculated posterior probability did not take into account uncertainty from bias and other non-random errors in the aggregated clinical trial data, which were out-of-date and which had questionable inclusionary and exclusionary criteria. Second, the posterior probability was based upon a composite end point that combined heart attack and stroke. As a later deposition of one of the Bayesian analysts, Martin Wells, showed, had the Carlin method been applied to just the heart attack summary point estimate, then the posterior probability that TRT causes heart attack would have been less than 50%, and thus greater than 50% that testosterone does not cause heart attack.[32]

Notwithstanding the plaintiffs’ failure to rebut the very specific methodological challenges to their witnesses’ Bayesian analysis, the MDL court denied the third Rule 702 motion to exclude, without meaningful analysis.[33] The case (Martin) was later tried to a jury that returned a verdict for the defense. Neither in Martin nor in any other testosterone case that was tried did plaintiffs actually present their Bayesian analysis to the trier of fact. The likely interpretation of this failure is that the Bayesian analysis was always meant to obfuscate the weaknesses of their causation case and to help deflect Rule 702 challenges.

The ultimate verdict on the plaintiffs’ case and the Bayesian hypothesis test with its ill-informed non-inormative priors was returned only after most of the MDL cases were tried or had settled. In 2023, a “mega-trial,” a large, well-conducted randomized controlled trial was concluded and published with findings of no increased risk of heart or stroke after long-term use of TRT in men who resembled the TRT plaintiffs.  The trial enrolled over 5,000 men, about whom the researchers reported that a primary composite cardiovascular end-point event occurred in 182 men (7.0%) on testosterone therapy, and in 190 men (7.3%) receiving placebo, with a hazard ratio below one (HR = 0.96, 95% C.I., 0.78 – 1.17). None of the components of the composite (heart attack, stroke) showed an increased risk.[34]

“Falshood flies, and Truth comes limping after it; so that when Men come to be undeceived, it is too late, the Jest is over, and the Tale has had its Effect: Like a Man who has thought of a good Repartee, when the Discourse is changed, or the Company parted: Or, like a Physician who hath found out an infallible Medicine after the Patient is dead.”[35]

CONCLUSION

The Reference Manual’s chapter on epidemiology claims that Bayesian analyses have gained a toehold in litigation. The authors cited three cases, all involving the evaluation of health effects. One of the cases (Langrell) cited a claim of specific causation, and the case cited showed no evidence of an actual Bayesian analysis. The cited case was one of three in which the same expert witness, Dr. Gale, claimed to use Bayesian analysis. The other two cases, not cited, rejected the admissibility of Dr. Gale’s proffered testimony.

The second case cited (In re Ability) actually involved a Bayesian analysis, but for a so-called disproportionality analysis, which is a technique for interpreting a signal of possible health effect. The misuse of the analysis by the Bayesian analyst (David Madigan) was overlooked by the court, and by the Reference Manual.

The third case cited by the Manual also involved an actual Bayesian analysis, In re Testosterone, in the form of a Bayesian hypothesis test. The proffered analysis actually did, in theory, speak to a material issue of general causation. The Manual’s credulous citation, and the MDL court’s gatekeeper, however, overlooked that the methodology was misspecified and misapplied in multiple ways.

If these three citations are a toehold, then we need a tow-truck for these wrecks!


[1] Steve C. Gold, Michael D. Green, Jonathan Chevrier, & Brenda Eskenazi, Reference Guide on Epidemiology, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 939 (4th ed. 2025) [cited as GGCE]

[2] GGCE at 963 n.178.

[3] Langrell v. Union Pac. Ry. Co., No. 8:18CV57, 2020 WL 3037271, at *3 (D. Neb. June 5, 2020).

[4] Id.

[5] See, e.g., Robert Peter Gale, et al., Fetal Liver Transplantation (1987); Robert Peter Gale & Thomas Hauser, CHERNOBYL: THE FINAL WARNING (1988); Kenneth A. Foon, Robert Peter Gale, et al., IMMUNOLOGIC APPROACHES TO THE CLASSIFICATION AND MANAGEMENT OF LYMPHOMAS AND LEUKEMIAS (1988); Eric Lax & Robert Peter Gale, RADIATION: WHAT IT IS, WHAT YOU NEED TO KNOW (2013).

[6] Byrd v. Union Pacific RR, 453 F. Supp. 3d 1260 (D. Neb. 2020).

[7] Id. at 1270 (“Dr. Gale states that his opinion is based on Bayesian probabilities which consider the interdependence ofindividual probabilities. This process is sometimes referred to as differential diagnosis or differential etiology.”).

[8] Report of Robert Peter Gale in Saul Hernandez at 13 (July 23, 2019)[on file with author]. There was no evidence that Mr. Hernandez was tested for infection by helicobacter pylori.

[9] Hernandez v. Union Pacific RR, No. 8: 18CV62 (D. Neb. Aug. 14, 2020).

[10] See, e.g., Monireh Sadat Seyyedsalehi, Giulia Collatuzzo, Federica Teglia & Paolo Boffetta, Occupational exposure to diesel exhaust and head and neck cancer: a systematic review and meta-analysis of cohort studies, 33 EUR. J. CANCER PREV. 435 (2024).

[11] Langrell v. Union Pac. Ry. Co., No. 8:18CV57, 2020 WL 3037271, at *3-4 (D. Neb. June 5, 2020).

[12] Dr. Gale’s testimony has not fared well elsewhere. See, e.g., In re Incretin-Based Therapies Prods. Liab. Litig., 524 F.Supp.3d 1007 (S.D. Cal. 2021) (excluding Gale); Wilcox v. Homestake Mining Co., 619 F. 3d 1165 (10th Cir. 2010); June v. Union Carbide Corp., 577 F. 3d 1234 (10th Cir. 2009) (affirming exclusion of Dr. Gale and entry of summary judgment); Finestone v. Florida Power & Light Co., 272 F. App’x 761 (11th Cir. 2008); In re Rezulin Prods. Liab. Litig., 309 F.Supp.2d 531 (S.D.N.Y. 2004) (excluding Dr. Gale from offering ethical opinions); Cundy v. BNSF Ry, No. 40095-6-III.  Wash. Ct. App. (Mar. 5, 2026) (affirming dismissal of case; Gale was one of plaintiffs expert witnesses); Russo v. Metro-North RR., Index No. 159201/2019, 2025 NY Slip Op 34659(U), N.Y.S.Ct., N.Y. Cty. (Dec. 5, 2025); Saverino v. Metro-North RR, 2024 NY Slip Op 31326(U), Index No. 161353/2019, N.Y. S. Ct., N.Y. Cty. (Apr. 8, 2024).

[13] In re Abilify (Arpiprazole) Prods. Liab. Litig., No. 3:16MD2734, 2021 WL 4951944, at *5 (N.D. Fla. July 15, 2021).

[14] FDA Adverse Event Reporting System (FAERS) (Last updated Sept. 8, 2014), available at <http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm>.

[15] In re Accutane Litig., No. 271(MCL), 2015 WL 753674, at *15 (N.J. Super. Law Div., Feb. 20, 2015) (Hon. Nelson C. Johnson, also known as the author of Boardwalk Empire).

[16] In re Accutane, 234 N.J. 340 (2018) (affirming exclusion of David Madigan).

[17] Derrick J. Stobaugh, et al., Alleged isotretinoin-associated inflammatory bowel disease: Disproportionate reporting by attorneys to the Food and Drug Administration Adverse Event Reporting System, 69 J. AM. ACAD. DERMATOL. 393 (2013).

[18] Mai Fujimoto, et al., Association between Statin Use and Bladder Cancer: Data Mining of a Spontaneous Reporting Database and a Claim Database, 1 J. PHARMACOL. & PHARMACOVIGILANCE 1 (2015).

[19] Xiao-long Zhang, et al., Statin use and risk of bladder cancer: a meta-analysis, 24 CANCER CAUSES & CONTROL 769 (2013).

[20] S. Albert & J. Morley, Testosterone therapy, association with age, initiation and mode of therapy with cardiovascular events: a systematic review, 95 CLIN. ENDOCRINOL. 436 (2016).

[21] GGCE at 963 n.178 (citing In re Testosterone Replacement Therapy Prods. Liab. Litig., No. 14 C 1748, 2018 WL 4030585, at *8 (N.D. Ill. Aug. 23, 2018), and explaining that the court had denied a “motion to exclude testimony of expert ‘whose Bayesian critiques of epidemiological studies’ were similar to those of another expert whose testimony ‘the Court has previously found admissible’.”).

[22] In re Testosterone Replacement Therapy Prods. Liab. Litig., No. 14 C 1748, 2017 WL 1833173, at *4 (N.D. Ill. May 8, 2017).

[23] Id.

[24] This is the same Martin Wells found to be a methodological shapeshifter in the paraquat parkinsonism litigagion. In re Paraquat Prods. Prods. Liab. Litig., Case No. 3:21-md-3004-NJR, MDL No. 3004, 730 F.Supp.3d 793, 838 (2024) (S.D. Ill. 2024). See also Schachtman, Paraquat Shape-Shifting Expert Witness Quashed, TORTINI (Apr. 24, 2024).

 

[25] Defendants’ Motion to Exclude Plaintiffs’ Expert Testimony on the Issue of Causation, and for Summary Judgment, and Mem. of Law in Support, No. 1:14-CV-01748, MDL 2545, 2017 WL 1104501, at *69–70 (N.D. Ill. Feb. 20, 2017) (citing Reference Manual 259 (3rd ed. 2011), for the proposition that “‘subjective Bayesians are a well-established minority’ of scientists whose methods ‘have rarely been used in court.’”). See also Plaintiffs’ Mem. of Law in Opp. to Motion of AbbVie Defendants to Exclude Plaintiffs’ Expert Testimony on Causation, and for Summary Judgment, MDL No. 2545, Dkt. No. 1753 (N.D. Ill. Mar. 23, 2017).

[26] See David H. Kaye & David Freedman, Reference Guide on Statistics, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 529 (3rd ed. 2011).

[27] In re Testosterone Replacement Therapy Prods. Liab. Litig., MDL No. 2545, MDL No. 2545, 2018 WL 4030585, at *8 (N.D. Ill. Aug. 23, 2018).

[28] John Carlin, Meta-analysis for 2 x 2 tables: a Bayesian approach, 11 STAT. MED. 141 (1992) [Carlin]

[29] Carlin at 157.

[30] Id.

[31] See P. Lambert et al., How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS, 24 STATS. MED. 2401, 2402 (2005). See also Andrew Gelman, Prior distributions for variance parameters in hierarchical models, 1 BAYESIAN ANALYSIS 515

(2006); E. Pullenayegum, An informed reference prior for between-study heterogeneity in meta-analyses of binary outcomes, 30 STATS. MED. 3082 (2010).

[32] Deposition of Martin Wells, in Martin v. Actavis, Inc., No. 15-cv-4292, 2018 WL 7350886 (N.D. Ill. Apr. 2, 2018).

[33] Martin v. Actavis, Inc., Case No. 15 C 4292, MDL No. 2545, 430 F. Supp.3d 516, 534 (2019).

[34] A. Lincoff et al., Cardiovascular Safety of Testosterone-Replacement Therapy, 389 NEW ENGL. J. MED. 107, 114 (2023).

[35] Jonathan Swift, The Examiner No. 14 (Nov. 9, 1710), in THE EXAMINER & OTHER PIECES WRITTEN IN 1710-11 at 8, 11-12 (Herbert Davis, ed. 1966).

How Science Works in the New Reference Manual on Scientific Evidence

March 12th, 2026

The Second and Third Editions of the Reference Manual on Scientific Evidence contained a chapter, “How Science Works,” by Professor David Goodstein. This chapter ambitiously set out to cover philosophy and sociology of science to help orient judges as strangers in a strange land. Goodstein’s chapter had been a useful introduction to scientific methodology, and it countered some of the antic ideas seen in some judicial opinions, as well as in some other chapters of the Manual. Goodstein brought a good deal of experience and expertise to the task. He was a distinguished professor of physics and Vice Provost at the California Institute of Technology, and he had written engagingly about scientific discovery and the pathology of science.[1] Sadly, Goodstein died in April 2024. His death may have had some role in the delayed publication of the Fourth Edition of the Manual,[2] and the improvident replacement of his chapter with a new chapter written by authors less articulate about how science works.

The substitute chapter on “How Science Works” was written by two authors considerably less accomplished than the late Professor Goodstein.[3] Michael Weisberg is a professor of philosophy at the University of Pennsylvania, where he is the deputy director of Perry World House, which “analyzes global policy challenges through the realms of climate, democracy, global justice and human rights, and security.” The connection with Perry House may explain the new chapter’s heavy reliance upon the development of the chlorofluorocarbon (CFC) connection to ozone layer depletion as an exemplar of scientific discovery and knowledge. The University of Pennsylvania webpage describes Weisberg as “educat[ing] the next generation of environmental leaders in the classroom, at the negotiating table, and in the field, ensuring that their voices have maximal impact on addressing the climate crisis.”[4] So we have a philosopher of advocacy science, as it were. Some readers might think those credentials are not optimal for preparing a nuts-and-bolts description of how science works. Reading sections of the new chapter will not diminish their concerns.

Joining with Weisberg on this new version of “How Science Works,” is Anastasia Thanukos, who works at the University of California Museum of Paleontology. Thanukos has her masters degree in integrative biology, and her doctorate in science education.[5] 

The new “method” chapter has some virtues. As did Goodstein’s chapter, the new authors put peer review into a realistic perspective that should keep judges from being snoockered into admitting weak or bogus evidence because it had been published in a peer reviewed journal.[6] The authors should have gone much farther in pointing out that the rise of predatory and pay-to-play journals, as well as journals controlled by advocacy groups, have undermined much of the publishing model of modern science.

Weisberg and Thanukos discuss “expertise” in a way that is interesting but irrelevant to legal cases.  They seem blithely unaware that the standard for qualifying an expert witness is extremely low. Who will disbuse them when they argue that “[i]t is worth evaluating the closeness of a scientist’s disciplinary expertise to a scientific topic on which expert testimony is delivered”?[7] In what emerges as a consistent pattern of giving anti-manufacturing industry examples, the authors point to Richard Scorer as an accomplished scientist, who had no specific expertise in CFC ozone depletion. Notwithstanding the lack of specific expertise, an industry-backed group promoted Scorer’s views that criticized the CFC-ozone depletion hypothesis.[8] Citing Naomi Oreskes, the new Manual chapter states that “[t]he problem of scientists with legitimate expertise in one field weighing in on a scientific question outside their area of expertise is a pernicious one that has affected public acceptance of science and policy on issues such as climate change and tobacco exposure.”[9] Later, when Weisberg and Thanukos discuss the Milward case, they miss the pernicious influence that flowed from allowing Martyn Smith, a toxicologist, to give methodologically muddled opinion testimony on epidemiology. Pernicious is where you find it, and the authors of the new chapter find virtually all untoward instances of poor scientific method and conduct to originate from manufacturing industry.

Weisberg and Thanukos introduce a discussion of the “replication crisis,” a phrase and concept absent from the third edition of the Reference Manual.[10] The authors express some skepticism that there is an actual crisis over replication,[11] but their focus on climate science may mean that they are simply blinded by groupthink in that discipline. Their discussion of retractions omits the steep rise in retraction rates in most scientific disciplines,[12] and the authors ignore the proliferation of poor quality journals. Positively, the authors introduce a discussion of study preregistration, a notion absent from the third edition of the Manual, and they explain that such preregistration may serve as a bulwark against data dredging post hoc analyses.[13] Negatively, the authors ignore how frequently preregistered protocols are not used, or are used and then violated.

Weisberg and Thanukos appropriately ignore “weight of the evidence” (WOE) and “inference to the best explanation” (IBE). Readers might (mistakenly) think that the new chapter implicitly rejects WOE, as put forth by Carl Cranor and credulously accepted by the First Circuit in Milward, when the chapter authors insist that 

“the judge’s task requires a deeper examination of the available evidence and methods by which it was arrived at, as well as an assessment of how the community of experts in this area has evaluated or would evaluate the evidence and reasoning in question.”[14]

Contrary to the Milward decision from 2011, the new authors are not shy about stating the obvious; there is good science, and there is bad science.  Not all “judgment” about causality is acceptable and fit for submission to juries.[15] Given the judicial resistance to Rule 702, the obvious here requires stating. Weisberg and Thanukos acknowledge that some scientific judgment is unreliable or invalid because it was based upon work that was not carried out in accordance with current standards for scientific investigation and inference.[16] It should not surprise anyone that most of their examples of bad science are the product of manufacturing industry; the authors are oblivious to bad science sponsored by the lawsuit industry or by non-governmental advocacy organizations (NGOs).

Weisberg and Thanukos frame scientific disagreements and debates as governed by both data and ethical norms. Science is not infinitely contestable. There are identifiable norms, including a norm that scientists should “seek relevant information,” and “scrutinize ideas and evidence.”[17] Contrary to Milward’s standard of judicial abstention and credulity in the face of dodgy causal claims, these authors state what should be obvious, that scientific scrutiny involves, among other things, “an evaluation of methods, considering potential biases and oversights.”[18]

The chapters’ authors, non-lawyers, get closer to the heart of the error in Milward’s abstention doctrine with their recognition of what should have been obvious to the authors of the law chapter (Richter & Capra):

“When research relevant to a trial has not yet been scrutinized by a community with the appropriate technical expertise, a judge may be placed in the position of providing or requesting this scrutiny.”[19]  

Rather than some vague, subjective, and content-free WOE standard, Weisberg and Thanukos urge scientists, and by implication judges as well, to engage in serious efforts to “identify and avoid bias” and abide by ethical guidelines.[20] In other (my) words, the new authors agree that there is a standard of care reflected in the norms of science, and consequently there can be deviations from that standard. For Weisberg and Thanukos, compliance with the normative structure of scientific investigations is at the heart of building up accurate and predictive conclusions from data.[21] As part of their communitarian and normative conception of the scientific process, the authors appear to accept the reality and necessity for judges to act as gatekeepers.[22]

And while this recognition of standards and the need to police against deviations from standards is commendable, Weisberg and Thanukos proceed to give an abridgment of scientific method and process that is distorted and erroneous. They steadfastly ignore the concept of hierarchy of evidence, and thus provide illegitimate cover for levelers of evidence. In discussing randomized controlled trials, for instance, they note that such trials are often taken as “the gold standard,” but then they counter, without citation, support, or argument, that such trials “are just one line of evidence among many.”[23] The authors elide discussion and reconciliation of when that “just one line of evidence” conflicts with observational studies.

Notwithstanding their helpful comments about the need to evaluate studies for bias and other errors, these authors enter into the Milward controversy with an observation that assessing many lines of evidence is required and can be difficult for courts, and has led to “controversy.” Citing to papers including one  by the late Margaret Berger at her notorious lawsuit industry SKAPP-funded Coronado Conference, Weisberg and Thanukos float the observation that:

“In science, the available evidence (some of which may come from other research programs not designed to test the hypothesis under consideration) is evaluated as a body, along with the strengths, weaknesses, and caveats relating to each type of data, an approach which, some scholars have argued, the judiciary has not always followed.98[24]

This claim that the available evidence is evaluated as “a body” is presented as a fact about how science works, without any citation or argument. Several comments are in order. First, the claim is at odds with the authors’ own statements that scientific norms require evaluating each study for biases and other disqualifying flaws. Second, the claim is at odds with the authors’ own reference to systematic reviews and meta-analyses,[25] which are governed by protocols with inclusionary and exclusionary criteria for individual studies, and which require consideration of individual study validity before it enters the “body” of evidence that is quantitatively or qualitatively evaluated. In the authors’ words, “authors delineate both the criteria that studies must meet for inclusion in the review and the methods that will be used to assess the studies.”[26] The Milward case involved an expert witness who had proffered the very opposite of a systematic review in the form of post hoc rejiggering of studies and their data to fit a pre-conceived litigation goal. In the context of addressing the replication crisis, Weisberg and Thanukos correctly observe “peer review alone cannot ensure that the conclusions of published studies are actually correct, highlighting the responsibility judges bear in evaluating the validity of the methodologies that contributed to a particular piece of research.”[27] Of course, the Milward case involved a hired expert witness whose unprincipled re-analysis of studies was never peer reviewed or published.

Third, the authors could easily have found additional support for the contrary proposition that individual studies must be evaluated before being considered as part of the entire evidentiary display. The IARC Preamble, which roughly describes how that agency arrives at its so-called hazard classifications of human carcinogenicity, specifies that individual studies within each of three streams of evidence are evaluated for validity and soundness before contributing to a sub-conclusion with respect to (1) epidemiology, (2) toxicology, and (3) mechanistic lines of evidence.[28] Each of those three lines of evidence is adjudged “sufficient,” “limited,” or “inadequate,” by specialists in the three respective areas, before an overall evaluation is reached. There is much that is objectionable in the IARC working group procedures, but this division of labor and the need to consider disparate lines of evidence and studies within each line separately before attempting a synthesis, is present in all systematic review methodology. The suggestion from Weisberg and Thanukos that “the available evidence” in science is “evaluated as a body” is not only unsupported, but it is demonstrably false and misleading.

This claim about holistic evaluation is a fairly transparent but failed attempt to support a claim made in the chapter on the admissibility of expert witness evidence by Liesa Richter and Daniel Capra, who present an exposition of the notorious Milward case, without criticism, in a way to suggest that the case represents appropriate judicial gatekeeping under Rule 702, and that the case is consistent with scientific norms.[29] The chapter on how science works, after  having stated a false claim about scientific methodology for synthesis and integrating disparate lines of evidence, attempts to provide a gloss on the similar and equally benighted claim of Richter and Capra, in footnote 98:

“98. Some scholars have raised concerns that the courts have on occasion unfairly dismissed numerous individual lines of evidence as being flawed or insufficiently conclusive and concluded that evidence is lacking, when in fact the body of evidence, taken as a whole, points to a clear conclusion. For more, see discussion of Milward v. Acuity Specialty Products Group, Inc.; see also Liesa L. Richter & Daniel J. Capra, The Admissibility of Expert Testimony, in this manual; Berger 2005, supra note 97; and Steve C. Gold, A Fitting Vision of Science for the Courtroom, 3 Wake Forest J.L. & Pol’y 1 (2013).”

Some “scholars” have indeed said such things in their more unscholarly moments; some scholars have criticized Milward, but they are not cited in this new methods chapter. The footnote is accurate, but highly misleading by omission. The First Circuit in Milward also said as much, also without support or justification, and Richter and Capra, in their chapter of the Manual, fourth edition, parrot the Milward case. Weisberg and Thanukos cite to two articles, by Margaret Berger and by Steven Gold, both law professors, not scientists, and both ideologically hostile to Rule 702 gatekeeping. The Berger article was from a lawsuit-industry SKAPP funded symposium known as the Coronado Conference, and the Gold paper comes out of a symposium sponsored by the lawsuit industry itself and the Center for Progressive Reform, an advocacy NGO to which one of Mr. Milward’s expert witnesses, Carl Cranor, belongs. So the authors of the new science methodology chapter failed to cite any scientific source, but cited to papers by lawyers in the capture of the lawsuit industry, and a single (infamous) decision that ignored Rules 702 and 703, as well as the extensive literature on systematic reviews.  Weisberg and Thanukos could have cited many sources that contradicted their claim, and the claim of the lawsuit industry sponsored lawyers, but they did not. This is what biased and subversive scholarship looks like.

Funding Bias – The New McCarthyism

The selective citation to articles sponsored by the lawsuit industry is ironic in the context of what Weisberg and Thanukos have to say elsewhere about the “funding effect.” Some of what the authors say about personal bias is almost reasonable. For instance, they suggest that funding source is a “valid consideration” in evaluating methodologies and conclusions of expert testimony, and presumably of published studies as well, but not a sufficient reason to exclude such testimony or reliance.[30] Interestingly, these authors ignored the funding and the ideological interests of the symposia they cited in support of the repudiated Milward abstention doctrine.

Over three decades ago, Kenneth Rothman, the founder of Epidemiology, the official journal of the International Society for Environmental Epidemiology (ISEE), wrote his protest against the obsession with funding in article that should have been cited in the new chapter, for balance. Rothman described the fixation on funding as the “new McCarthyism in science,” which manifested as intolerance toward industry-sponsored studies, and strict scrutiny of “conflict-of-interest” (COI) disclosures.[31] The new McCarthyites amplify the gamesmanship over COI disclosures by excusing or justifying non-disclosure of COIs from scientists who have positional conflicts, or who are aligned with advocacy groups or with the lawsuit industry.

This asymmetrical standard for adjudging conflicts is on full display in the Weisberg and Thanukos chapter, when they claim that “in pharmaceuticals, there is a strong tendency for industry-sponsored trials to favor the industry’s product.”[32] The chapter authors, and their cited source, ignore the context in which the pharmaceutical industry scientists publish clinical trial results.  A successful clinical trial that showed efficacy with minimal adverse events is the result of years of prior research, including phase I and II trials, and preclinical testing. If the research fails to show efficacy, or shows unreasonable harm, in any of this prior research, the phase III trial is never done and so never published. If the medication is never licensed, the phase III trial will generally not be published. The selection effects are obvious and overwhelming in determining that the published results of phase III trials will be work that favors the sponsor. The “failed” phase III trial may result in a securities class action against the pharmaceutical company. In the realm of observational studies, some work commissioned by manufacturing industry has its origins in the poorly conducted, flawed work of environmental zealots and NGOs. Manufacturing industry has an obvious interest in correcting the scientific record, and again, any carefully done study would rebut that of the zealots and favor the industry sponsor.

Elsewhere, the authors offer a more balanced assessment when they observe that “[a]ll research is potentially influenced by bias, and every funder of research has the potential to introduce a source of bias.”[33] Similarly, the fourth edition chapter notes that “[a]ll scientists have some sort of motivation for their work, and this does not preclude scientific knowledge building, so long as biased methodologies and interpretations are avoided.”[34] Their recognition that motivated reasoning is everywhere suggests that all research should receive scrutiny regardless of apparent or disclosed funding source.[35]

When it comes to providing examples of funding-effect distortions of science, Weisberg and Thanukos seem to blank on instances created by the lawsuit industry or by environmental NGOs. The reader should contrast how readily and stridently the authors point to bias in industry-sponsored research with how the authors tie themselves up with double negatives when making the same point about NGOs:

“That is not to suggest that government-or nongovernmental organization (NGO)-sponsored research is necessarily free from bias.”[36]

The cognitive dissonance is palpable. The only conclusion that could be drawn from such a locution is that Weisberg and Thanukos have not worked very hard to identify and disclose their own biases.

STATISTICS DONE POORLY

When it comes to explaining and discussing the role of statistical methods in the scientific process, Weisberg and Thanukos go off the rails. The new chapter is an unmitigated disaster, which should have been corrected in the peer review and oversight process. The first sign of trouble became apparent upon checking the definition of “p-value” in the chapter’s glossary:

p-value. A statistic that gives the calculated probability that the null hypothesis could be true even given the observed differences between conditions.”[37]

This definition is the transposition fallacy on steroids. Obviously, a p-value cannot be the probability that the null hypothesis “could be true” when the procedure for calculating a p-value must assume that the null hypothesis is true, along with a specified probability model. Equally important, the p-value does not describe a probability in connection with the null hypothesis because it describes the probability of observing data as different from the null, or more so, as seen in this particular sample.  The statistics chapter in the Manual by Hall and Kaye states the meaning correctly.  The coverage of statistical concepts by Weisberg and Thanukos should be studiously ignored.

The outrageously incorrect definition of p-value in the glossary is not an isolated error.  The authors are clearly statistically challenged. In the text of their chapter, they incorrectly describe the p-value, consistently with their aberrant glossary entry:

“the commonly used p-value approach, scientists compare a test hypothesis (e.g., that drug X is effective) to a null (e.g., that there is no difference in cure rates between those who took drug X and those who took a placebo). Scientists then calculate the probability that the null hypothesis could be true even with the observed difference between conditions (e.g., the cure rate of patients taking drug X compared to that of those taking a placebo).”[38]

Weisberg and Thanukos thus conflate frequentist and Bayesian statistics. They also obliterate the meaning of the confidence interval, an important concept for judges and lawyers to understand. Here is how the authors describe the confidence interval in their chapter:

Evaluating estimates: In science (and in contrast to their lay meanings), the terms uncertainty and error refer to the variability of a set of data that is intended to estimate a single number. Uncertainty and error are generally expressed as a range, within which we are confident that, if the study were repeated, the new result would fall. Scientists often use a 95% confidence interval for this purpose.”[39]

Describing the confidence interval in the same sentence as “uncertainty and error” is bound to induce uncertainty and error. The confidence interval provides a range of estimates based upon random error, and uncertainty only in the form of imprecision in the point estimate. There are of course myriad other kinds of uncertainty and error not captured by the confidence interval. The most important of the authors’ errors is that they assert incorrectly that the confidence interval provides a range within which new results from the study repeated would fall.  This is, again, a variant on the transposition fallacy that the authors commit in their definition of the p-value. The confidence interval provides a range of results that would not be rejected as alternative null hypotheses by the data in the obtained sample. Because of random error, future samples would give different results, with different confidence intervals, which would not be co-extensive with the first obtained confidence interval. To be sure, the statistics chapter states the matter correctly, and the epidemiology chapter finally gets it correct in its text (after having mangled the concept in the second and third editions), but the epidemiology chapter perpetuates its previous errors in defining confidence intervals in its glossary. This sort of issue, and it is a serious one, could have been eliminated had there been meaningful peer review and editorial oversight for consistency and accuracy of the Manual as a whole.

Weisberg and Thanukos address statistical power in a way that may also mislead readers. They tell us that “[p]ower refers to a test’s ability to reject a hypothesis that is indeed false.” W&T at 88. If only were it so. The authors omit that power is a probability that at a specified level of significance (say p < 0.05), and a specified alternative hypothesis, sample size, and probability model, the sample result will reject the null hypothesis in favor of the alternative hypothesis. Then the authors suggest confusingly that “[w]ell-designed studies have sufficient power to detect the differences of interest, but it may not be apparent when a test lacks power.”[40]

If the study at issue presents a confidence interval around a point estimate of interest, then it will be clear what alternative null hypotheses are statistically compatible with the sample result at the pre-specified level of alpha (significance). Any point outside the interval would be rejected by such a test of significance, and so the casual reader will have a rather good idea of what could and could not be rejected by the sample data. And of course, virtually every study will have low power to detect extremely small increased risks, say relative risk of 1.00001. And most studies will have high power to detect risk ratios of over 1,000.

This new chapter on “How Science Works” also propagates some well-known fallacies about statistical significance testing. Implicit in the authors’ committing the transposition fallacy, is a conceptual and mathematical confusion between the coefficient of confidence (1-α) and the posterior probability of an hypothesis.

The authors’ mistake comes in their insistence upon labeling precision in a test result as “certainty.” In the quote below, the authors’ confusion is clear and obvious:

“Note that the 95% and 5% cutoffs are somewhat arbitrary, and a higher degree of confidence might be required if more certainty were desired—for example if an impactful policy decision depended on the conclusion.”[41]

An impactful [sic] policy decision might well call for more certainty, or a higher posterior probability, but a higher coefficient of confidence will not necessarily map to hypothesis probability at all. The authors’ confusion and conflation of the probability of alpha and the Bayesian posterior probability arises elsewhere within the chapter:

“(1) A p-value lower than 0.05 does not prove that a null hypothesis is false. It is strong evidence, but there is a small chance that the difference observed could be the result of chance alone.

(2) Using a low p-value (e.g., 0.05) as a criterion for significance sets a high bar for rejecting the null hypothesis, minimizing the chance of getting a false positive… .”[42]

Again, a p-value less than five percent is hardly strong evidence in the context of large database studies, especially when there are multiple comparisons and the outcome is not the pre-specified outcome of the analysis. The authors’ confusion is on full display when they discuss the Zoloft birth defects litigation, where the Third Circuit affirmed the exclusion of plaintiffs’ expert witnesses’ causation opinions and the grant of summary judgment to the defendants. According to the authors’ narrative:

“plaintiffs’ expert’s testimony would have argued that multiple, nonsignificant associations between Zoloft use and birth defects indicated a causal relationship. The testimony was excluded because these results were consistent with a weak causal relationship (a small effect size), one that is ‘so weak that one cannot conclude that the risk is greater than that seen in the general population’.”[43]

Of course, in the Zoloft litigation, the excluded plaintiffs’ expert witnesses were caught red-handed – at cherry picking – and attempting to circumvent the lack of significance with a methodologically incorrect meta-analyses.[44]

If the risk of birth defects among children born to mothers who used Zoloft in pregnancy was no greater than seen in the general population, then there would be no risk, not risk “so weak” it cannot be seen. Locutions such as the “results were consistent with a weak causal relationship,” when the results were equally consistent with no causal relationship suggest that the writers cannot bring themselves to say that the causal hypothesis was simply not supported at all. Of course, no study may exclude an increased risk of 0.01 percent, or a relative risk of 1.01, but at some point, when multiple attempts fail to reveal an increased risk, we may conclude that the proponents of the causal claim have failed to make their case.

META-SHMETA-ANALYSIS

Weisberg and Thanukos address meta-analysis incompletely in the context of systematic reviews. The authors do not provide any insights into how meta-analyses are done, and more glaringly, they fail to mention that not all systematic reviews can or should result in quantitative syntheses of estimates of association. On the positive side, they state that meta-analyses are important in litigation, and that the application of rigorous methodologies should be required.[45] With clearly unintended irony, Weisberg and Thanukos offer, as support for their statement, the Paoli Railroad Yard case, “in which the exclusion of a contested meta-analysis was overturned.”[46]

Weisberg and Thanukos have stepped into the wet corner of a pigsty. The issue in the Paoli case arose from a meta-analysis of mortality rates associated with polychlorobiphenyl (PCB) exposures. The district court excluded the proponent of the meta-analysis, not because it was unreliable, but because it was novel. Holding it up in conjunction with a statement about application of rigorous or reliable methodologies was way off the relevant legal point.

The expert witness who proffered the meta-analysis in Paoli was William  Nicholson, who was a physicist with no professional training in epidemiology. For his opinion that PCBs were causally associated with human liver cancer, Nicholson relied upon a non-peer-reviewed, unpublished report he wrote for the Ontario Ministry of Labor.[47] Nicholson described his report as a “study of the data of all the PCB worker epidemiological studies that had been published,” from which he concluded that there was “substantial evidence for a causal association between excess risk of death from cancer of the liver, biliary tract, and gall bladder and exposure to PCBs.”[48]

The defense challenged Nicholson’s opinion, not on Rule 702, but on case law that pre-dated the Daubert decision.[49] The challenge included pointing out the unreliability of the Nicholson’s meta-analysis, but also asserted (incorrectly) the novelty of meta-analysis generally. The district court sustained the defense objection on the grounds of “novelty,” without reaching the reliability analysis.[50] The Third Circuit appropriately reversed and remanded for consideration of the reliability of Nicholson’s meta-analysis.[51]

The consideration of Nicholson’s “meta-analysis” never occurred on remand; plaintiffs’ counsel and their expert witnesses withdrew their reliance upon Nicholson’s analysis. Their about face was highly prudent. Nicholson’s report presented SMRs (standardized mortality ratios); for the all cancers statistic, he reported an SMR of 95. What Nicholson did, in this analysis, and in all other instances, was simply divide the observed number of deaths by the expected, and multiply by 100. This crude, simplistic calculation fails to present a standardized mortality ratio, which requires taking into account the age distribution of the exposed and the unexposed groups, and a weighting of the contribution of cases within each age stratum. Nicholson’s presentation of data was nothing short of a fraud.

Nicholson’s Report was replete with many other methodological sins. He used a composite of three organs (liver, gall bladder, bile duct) without any biological rationale. His analysis combined male and female results, and still his analysis of the composite outcome was based upon only seven cases. Of those seven cases, some of the cases were not confirmed as primary liver cancer, and at least one case was confirmed as not being a primary liver cancer.[52]

As noted, Nicholson failed to standardize the analysis for the age distribution of the observed and expected cases, and he failed to present meaningful analysis of random or systematic error. When he did present p-values, he presented one-tailed values, and he made no corrections for his many comparisons from the same set of data.

Finally, and most egregiously, Nicholson’s meta-analysis was meta-analysis in name only. What he had done was simply to add “observed” and “expected” events across studies to arrive at totals, and to recalculate a bogus risk ratio, which he fraudulently called a standardized mortality ratio. Adding events across studies, without weighting by the inverse of study variance, is not a valid meta-analysis; indeed, it is a well-known example of how to generate the error known as Simpson’s Paradox, which can change the direction or magnitude of any association.[53]

In citing to the Paoli case as a reversal of exclusion of a contested meta-analysis, Weisberg and Thanukos give a truncated analysis that misleads readers, judges, and lawyers. There never was a proper consideration of the reliability vel non of Nicholson’s meta-analysis in the Paoli litigation, and in the final analysis, the Paoli plaintiffs abandoned reliance upon Nicholson’s ill-conceived meta-analysis.

VIRTUE SIGNALING

Although there are no land acknowledgments for the property on which Federal Judicial Center building is located, Weisberg and Thanukos miss few opportunities to let us know that they are woke scholars. There is the gratuitous and triggering “pregnant people,”[54] which begs any number of biological questions. Then there is the authors’ statement that they are limiting their focus to the “Western conception of science,” which begs another question, why would we call any other epistemically valid approach, from any corner of the globe, as something other than “science.”[55]

Equally gratuitous are the authors’ endorsements of DEI and “diversity,” with overbroad generalizations that diversity per se advances science,[56] and a claim that “women, people of color, other historically oppressed groups, and non-Western people” are not taken seriously as scientists.[57] In over 40 years of litigating technical and scientific issues, I have never seen a judge or a lawyer disrespect an expert witness based upon sex, race, ethnicity, or national origin. Of course, I have seen expert witnesses treated roughly for propounding bad science, and that seems perfectly appropriate.


[1] See David Goodstein, ON FACT AND FRAUD: CAUTIONARY TALES FROM THE FRONT LINES OF SCIENCE (2010).

[2] Weisberg and Thanukos frequently refer to other chapters in the Manual, which suggests that their chapter was written late in the development of the Fourth Edition, and perhaps contributed to the delayed publication.

[3] Michael Weisberg & Anastasia Thanukos, How Science Works, in National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 47 (4th ed. 2025) [cited as W&T].

[4] See Michael Weisberg, University of Pennsylvania Philosophy, at https://philosophy.sas.upenn.edu/people/michael-weisberg.

[5] Anna Thanukos, Staff, available at https://ucmp.berkeley.edu/people/anna-thanukos/#:~:text=Her%20background%3A%20Anna%20has%20a,Education%2C%20both%20from%20UC%20Berkeley

[6] W&T at 72-75.

[7] W&T at 81.

[8] W&T at 81.

[9] W&T at 81 & n.85 (emphasis added), citing Naomi Oreskes & Erik M. Conway, MERCHANTS OF DOUBT: HOW A HANDFUL OF SCIENTISTS OBSCURED THE TRUTH ON ISSUES FROM TOBACCO SMOKE TO GLOBAL WARMING (2010).

[10] W&T at 94-96.

[11] W&T at 95 n.120.

[12] Richard Van Noorden, More than 10,000 research papers were retracted in 2023 — a new record, 624 NATURE 479 (2023).

[13] W&T at 95.

[14] W&T at 55.

[15] W&T at 63, 68.

[16] W&T at 68.

[17] W&T at 65.

[18] W&T at 70.

[19] W&T at 71.

[20] W&T at 66.

[21] W&T at 75.

[22] W&T at 49.

[23] W&T at 83.

[24] W&T at 86 (citing Richter and Capra’s discussion of Milward in chapter one of the Manual, and Professor Gold’s article from the lawsuit industry celebratory conference on the Milward case).

[25] W&T at 99-100.

[26] W&T at 99.

[27] W&T 96 (emphasis added).

[28] IARC MONOGRAPHS ON THE IDENTIFICATION OF CARCINOGENIC HAZARDS TO HUMANS – PREAMBLE (2019), available at https://monographs.iarc.who.int/wp-content/uploads/2019/07/Preamble-2019.pdf

[29] Liesa L. Richter & Daniel J. Capra, The Admissibility of Expert Testimony, National Academies of Sciences, Engineering, and Medicine & Federal Judicial Center, REFERENCE MANUAL ON SCIENTIFIC EVIDENCE 1, 32-33 (4th ed. 2025).

[30] W&T at 76.

[31] Kenneth J. Rothman, “Conflict of interest: the new McCarthyism in science,” 269 J. AM. MED. ASS’N 2782 (1993). See Schachtman, The Rhetoric and Challenge of Conflicts of Interest, TORTINI (July 30, 2013).

[32] W&T at 76 & n.67, citing Sergio Sismondo, Pharmaceutical Company Funding and Its Consequences: A Qualitative Systematic Review, 29 CONTEMP. CLINICAL TRIALS 109 (2008).

[33] W&T at 77.

[34] W&T at 59-60.

[35] W&T at 59-60.

[36] W&T at 76.

[37] W&T at 111.

[38] W&T at 87.

[39] W&T at 90.

[40] W&T at 88.

[41] W&T at 90 (emphasis added).

[42] W&T at 88.

[43] W&T at 90 (internal citations omitted).

[44] In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., 26 F. Supp. 3d 449 (E.D. Pa. 2014); No. 12-md-2342, 2015 WL 314149, at *3 (E.D. Pa. Jan. 23, 2015) (rejecting proffered expert witness opinion based upon “cherry-picking of studies and data within studies”), aff’d, 858 F.3d 787 (3rd Cir. 2017).

[45] W&T at 99.

[46] W&T at 99 & n.134, citing In re Paoli R.R. Yard PCB Litig., 916 F.2d 829 (3d Cir. 1990).

[47] William Nicholson, Report to the Workers’ Compensation Board on Occupational Exposure to PCBs and Various Cancers, for the Industrial Disease Standards Panel (ODP); IDSP Report No. 2 (Toronto Dec. 1987) [Report].

[48] Id. at 373.

[49] See United States v. Downing, 753 F.2d 1224 (3d Cir.1985).

[50] In re Paoli RR Yard Litig., 706 F. Supp. 358, 372-73 (E.D. Pa. 1988).

[51] In re Paoli RR Yard PCB Litig., 916 F.2d 829 (3d Cir. 1990), cert. denied sub nom. General Elec. Co. v. Knight, 499 U.S. 961 (1991).

[52] Report, Table 22.

[53] See James A. Hanley, et al., Simpson’s Paradox in Meta-Analysis, 11  EPIDEMIOLOGY 613 (2000); H. James Norton & George Divine, Simpson’s paradox and how to avoid it, SIGNIFICANCE 40 (Aug. 2015); George Udny Yule, Notes on the theory of association of attributes in statistics, 2 BIOMETRIKA 121 (1903).

[54] W&T at 84.

[55] W&T at 50.

[56] W&T at 71 n. 52-54.

[57] W&T at 102.

The FDA Expert Panel on Talc – More Malarky     

June 18th, 2025

On May 20, 2025, as announced, FDA Commissioner Martin Makary held his panel discussion on talc in food and medications.[1] The discussion lasted just under two hours, and is available on YouTube for your viewing and perhaps your amusement. Makary opened and closed the event with what could have been the plaintiffs’ opening and closing statements from one of the many talc trials that have clouded courtrooms across the land. He asked rhetorically: “Why don’t we talk about at our oncology meetings the 1993 National Toxicology Program results that found clear evidence of carcinogenic activity of talc in animal studies?’” Perhaps because the talc findings were questionable at best, and the asbestos findings with respect to gastrointestinal cancers were exculpatory for talc.

Makary’s introductory remarks were followed by the panelists’ introducing themselves by their training and involvement with talc issues. Other than Makary, the participants were FDA Deputy Commissioner Sara Brenner, George Tidmarsh, John Joseph Godleski, Sandra McDonald, Daniel Cramer, Joellen Schildkraut, Malcolm Sim, Steven Pfeiffer, Nicolas Wentzensen, and Nicole C. Kleinstreuer. Godleski and Cramer have served as plaintiffs’ expert witnesses in ovarian cancer litigation, which was not particularly germane to the panel discussion. In their initial discussions of qualifications and background, neither Godleski nor Cramer disclosed his potential conflicts of interest, or the amount of fees earned. Sandra McDonald described her experience in assisting Godleski, but she did not declare whether she earned any money for consulting services to the lawsuit industry. Later in the panel discussion, when George Tidmarsh stated that no one should be vilified for past practices in using talc, Daniel Cramer jumped in to vilify Johnson & Johnson with the suggestion that somehow that company had surreptitiously arranged for the National Cancer Institute to remove a statement about how talc “may be associated with talc use” from its website just before he was about to testify in his first talc trial for plaintiffs.

None of the panelists had served as a defense expert witness. Steven Pfeiffer works for a pharmaceutical company, but not one that had any experience with the safety or efficacy of talc as an ingredient in medications.

None of the panelists had participated in any toxicologic or epidemiologic study of talc on cancers or diseases of the digestive organs. None of the panelists made it his or her business to become familiar with the extensive studies of the asbestos and talc on gastrointestinal cancers. The lack of experience, or specific citations to any study, did not stop Daniel Cramer from suggesting that talc was responsible for inflammatory bowel disease, autoimmune diseases, and gastrointestinal cancers.  Like Cramer, epidemiologist Joellen Schildkraut, focused on ovarian cancer, and made the false assertion that the relationship between talc and gastrointestinal cancers is understudied. Schildkraut held back from asserting that talc causes ovarian cancer, but she heartily endorsed banning talc on the precautionary principle. All the panelists concurred with the suggestion that talc be eliminated from food and drugs, without waiting for “the epidemiologists to catch up.”

Two issues were grossly misrepresented by the panelists. None of them, however, was well informed enough for the misrepresentations to have been overt lies. The first whopper was that National Toxicology Program (NTP) testing had shown carcinogenicity of talc in its inhalational studies for the lung and other organs. The second whopper was that rice on talc was used prevalently in the United States, and that it was responsible for digestive organ cancers. Nicole C. Kleinstreuer, who has worked at the NTP, and accurately described its activities gave a description of its animal talc studies, perhaps a bit slanted, but not too inaccurate. When George Tidmarsh later misrepresented NTP talc findings, however, Kleinsteuer was silent.

NTP Ingestion Studies

Makary did not identify the NTP studies to which he referred, but Kleinsteuer described a talc inhalation study that has only one referent. The NTP conducted long-term rodent inhalation and ingestion assays for both talc and different kinds of asbestos, in the 1980s and 1990s. For talc, the NTP published, in 1993, only one long-term inhalational study in rats and mice.[2] In mice, exposed to talc by inhalation for up to two years, there was no evidence of any “neoplastic” effects. The results in rats were more difficult to interpret. In male rats, exposed for over two years, there was weak evidence of neoplastic effects based upon an increased incidence of benign or malignant adrenal gland pheochromocytomas. In female rats, the NTP reported “clear evidence” of excess alveolar/bronchiolar (lung) adenomas and carcinomas and benign or malignant adrenal gland pheochromocytomas of the adrenal gland. The meaning of these rodent studies obviously varies depending upon whether you are a rat or a mouse of a certain breed; the meaning for humans is even murkier, even for humans that are rodent-like. The multiple comparisons across exposure levels for dozens if not hundreds of outcomes, and the lumping of benign and malignant effects together, certainly makes the NTP statistical analyses suspect. This report was marked by significant controversy, and some scientists refused to endorse its finding because adrenal gland pheochromocytomas were not treatment-related; the maximum-tolerated dose was exceeded for female rats at the higher exposure level, thus violating the study’s protocol; and talc is thus not expected to cause tumors in rats (and mice) exposed at levels that do not cause “marked chronic lung toxicity.”[3]

One of the lawsuit industry’s, and Makary’s, theories about the harmfulness of ingested talc is based upon the supposition that talc has asbestos contaminants. This theory is as vague as is the term asbestos, which has no mineralogical meaning; instead, the term asbestos was historically used to refer to six different minerals: actinolite, anthophyllite, amosite (cummingtonite-grunerite), chrysotile, crocidolite, and tremolite. All of these minerals, except for chrysotile, are amphibole minerals. Some of the amphibole minerals occur in both fibrous and non-fibrous form, and the ill health effects of the amphibole fibers are generally attributed to their resistance to biological degradation and their high aspect ratio. Things get a bit crazy because the federal government, for purposes of standardizing aerosol measurements, set the aspect ratio for counting “fibers,” at 3:1. The pathogenicity of “federal fibers,” which are not really fibers, is highly disputed.

The NTP never conducted long-term talc ingestion studies; it did something much better. The NTP tested dietary high-dose, long-term ingestion of various asbestos types in multiple species. The NTP did not leave the exposure issue vague with “asbestos” as the dietary source. Instead, the NTP was more precise when testing whether ingesting “asbestos” was harmful to rodents. The NTP ran separate ingestion experiments on chrysotile, amosite, and crocidolite, with the different form of asbestos making up one percent of the animals’ lifetime diet. Overall, these experiments were “null”; that is, they provided no support for the carcinogenicity of ingested asbestos of the types tested.

The NTP conducted lifetime ingestion studies in male and female rats with a diet of one percent crocidolite asbestos, the most toxic and carcinogenic form of asbestos in human beings. The NTP experiments showed that under these conditions, long-term ingestion of crocidolite asbestos was neither overtly toxic nor carcinogenic in male or in female rats.[4] After crocidolite, amosite asbestos, fibrous cummingtonite-grunerite, named for “asbestos mines of South Africa, is the most toxic and carcinogenic of the asbestos fibers. The NTP showed that feeding male and female rats amosite asbestos for one percent of their diet, for their lifetimes, was not overtly toxic, did not affect their survival, and was not carcinogenic.[5] The NTP repeated its life-time one percent amosite diet in Syrian Golden hamsters, again without toxic or carcinogenic response in either the male or female hamsters.[6]

Looking at the least toxic and carcinogenic asbestos mineral, chrysotile, the NTP’s conducted long-term one percent feed studies of both “short range” and “long range” (chrysotile fiber length) in Syrian Golden hamsters. Again the results were “null”; that is, there was no treatment-related toxicity or carcinogenicity.[7] There were no increases in adrenal cortical adenomas (benign growths) when compared with concurrent controls, but there was an increase of these benign tumors when compared with pooled control groups from other experiments. Ultimately, the NTP concluded that the biological importance of these benign adrenal growths in the absence of cancers or tumors of the gastrointestinal tract (which was the target organ) was questionable, at best.

Because of prior research suggesting that carcinogencity was a function of fiber rigidity and length, the NTP tested ingested chrysotile in rats, at two different fiber lengths. For its experiments, the NTP defined “short-range chrysotile (SR)” as short fibers with a median length of 0.66 microns, and a range of 0.088 to 51.1 microns. “Intermediate-range (IR) chrysotile fibers had a median length of 0.82 microns, with a range from 0.104 to 783.4 microns. The NTP did not use long-range chrysotile fibers, which are generally greater than 5 microns in length. Male and female F344/N rats ingested an NTP one percent diet of chrysotile, in the two lengths of chrysotile, SR and IR, for a lifetime. There were no neoplastic or non-neoplastic diseases, overt toxicity, or decrease in survival associated with SR chrysotile ingestion, in either the male or the female rats.[8] In the female rats, there was no effect on fertility or litter, overt toxicity, or carcinogenicity from IR chrysotile ingestion. The male rats also did not show any adverse clinical signs, but they experienced a statistically insignificant increase in benign colonic polyps, which the NTP stretched to characterize as “some” (but not clear) evidence of carcinogenicity.

Rice is Nice, With or Without Talc

The FDA panelists’ inaccurate claims about talc on rice also cry out for rebuttal, which no panelist seemed able or willing to give. Given that the panel was convened with only four days notice, and without public comment, it operated in a fact-free zone, and operated mostly as a propaganda exercise. The history of the ingested asbestos and talc controversy goes back over half a century. Some background is needed to understand exactly how outlandish the rice-on-talc claim is.

The causal association between asbestosis and lung cancer was well established by the early 1960s,[9] as was the causal association between crocidolite asbestos exposure and mesothelioma.[10] Some sources carelessly credit Irving Selikoff with these discoveries, but he was not so much of a discoverer, as he was a zealous spokesman for the safety of asbestos-exposed workers. Selikoff worked hand-in-hand with various labor unions to publicize and politicize asbestos risks that had been shown by other workers. Credit for the lung cancer connection properly goes to earlier work done by Sir Richard Doll and others, and the crocidolite-mesothelioma connection was shown by J. Christopher Wagner, in 1960. Where Selikoff deserves credit is in tireless efforts to expand the scope of asbestos-related diseases beyond lung cancer and mesothelioma, with or without sufficient evidence, and thus to expand the compensability of other diseases of ordinary life in asbestos workers.

In his efforts to extend the scope of compensation, Selikoff did not limit himself to risks that had been scientifically established; he sought to expand the list of asbestos-related diseases. He advanced the unsubstantiated notions that all six kinds of asbestos minerals carried the same risks, that asbestos caused virtually every kind of cancer in humans, that any asbestos in the environment required extreme remedial action, and that asbestos was responsible for a very high percentage of all human cancers.

No doubt Selikoff wanted credit for scientific discoveries, but he also wanted science that would support compensation. Selikoff understood that if the asbestos workers stopped smoking, their risks of lung cancer would fall, and their cancer morbidity and mortality would be more influenced by gastrointestinal cancers, given that colorectal cancer was the leading cause of cancer-related death in non-smoking men, in the 1960s.

By 1950, Selikoff had already become an advocate, who testified and wrote reports as a claimants’ expert witness in many asbestos cases. In the early 1950s, New Jersey lawyer Carl Gelman retained Selikoff to examine 17 workers from the Paterson plant of Union Asbestos and Rubber Company (UNARCO). Gelman filed workers’ compensation claims on behalf of these UNARCO workers, and Selikoff supported Gelman’s claims with reports and testimony. In the early 1950s, Anton Szczesniak, one of the UNARCO claimants, with Selikoff’s support as an expert witness, sought compensation for “intestinal cancer.” In 1965, Selikoff testified to support an asbestos insulator’s claim that asbestos exposure caused his colorectal cancer.[11] In 1974, Selikoff wrote a review article on asbestos exposure and gastrointestinal cancers, without any disclosure of his pro-plaintiff testimonial adventures.[12] Serious epidemiologists such as Sir Richard Doll and Sir Richard Peto pushed back on Selikoff’s exaggerated projections of asbestos-related mortality,[13] and his assertion that asbestos caused digestive system cancers.[14] Forty years after Selikoff testified for the claimant in an asbestos colorectal cancer case, the Institute of Medicine published a systematic review of the evidence available to Selikoff and later evidence, which showed that the evidence was insufficient “to infer a causal relationship between asbestos exposure and pharyngeal, stomach, and colorectal cancers.”[15]

Selikoff’s rent-seeking and fear-mongering spawned many asbestos scares. Some scientists accepted Selikoff’s dogma that a single asbestos fiber, of any variety, could cause any human cancer. The Mt. Sinai jihad against “asbestos” extended to any exposures involving asbestos, or even other minerals that contained “elongated mineral particles,” that nominally met the crude definition of asbestos. This jihad led to a prolonged litigation against the Reserve Mining Company, which had permits to dump taconite tailings in Lake Superior, since the late 1940s. Using Selikoff’s claim that “asbestiform” mineral particles had entered the water supply, the U.S. Environmental Protection Agency was able to obtain an injunction against the mining company.[16]

Regulatory overreach, Selikoff’s exaggerated testimony, and the trial judge’s partiality and bias marred the litigation.[17] After decades of research on asbestos in drinking water, there remains no substantial evidence that supports a conclusion that ingested asbestos in drinking water causes gastrointestinal or any other cancer.[18]

Selikoff was the head of an anti-asbestos lobby that promoted the fiction that asbestos was responsible for all manners of human ailments, regardless of dose or route of administration.[19] One of the panics he helped initiate involved the claim that talc-dusted rice was responsible for the high rate of stomach cancer among Japanese in Japan.

Reuben Merliss published an article in Science, in 1971, in which he attempted to attribute the high rate of stomach cancer in Japan to the Japanese custom of dusting rice with talc. Merliss relied upon overall population rates and trends to draw an ecologic inference that the Japanese rice (with talc and any asbestos contaminants) was responsible for the Japanese higher incidence of stomach cancer.[20]

The Merliss hypothesis, inspired by Selikoff, was sunk by a much more careful analysis (which got less media coverage). Two epidemiologists analyzed data about use of talc-coated rice in Japan and Hawaii, and found no support for the claim that talc-coated rice increased the risk of developing stomach cancer.[21]

Their more careful dietary assessment found high rates of stomach cancer among Japanese in Japan who did not consume talc-coated rice, while Japanese in Hawaii, who consumed considerable quantities of talc-coated rice had intermediate rates of stomach cancer (lower than in Japan). Filipinos in Hawai had very low rates of gastric cancer, even though they consumed the greatest amounts of talc-coated rice of any of the observed groups. The secular incidence trend of stomach cancer decreased more substantially among the talc-exposed Japanese living in Hawaii than among the non-exposed Japanese living in Japan.

Although the asbestos perpetual motion litigation machine continues to churn, the lawsuit industry has been hampered by the bankruptcy of virtually every company that made an asbestos-containing product, and the reduction of asbestos use and exposures over the last 50 years. The lawsuit industry’s shift to demonize and monetize talc as the next mineral target was predictable. What was not predictable was that we would have a Secretary of Health & Human Services whose sole experience in medicine has been in suing pharmaceutical and other manufacturing industries, perpetuating medieval beliefs in the miasma theory of disease causation,[22] and spreading conspiracies, misinformation, and disinformation. FDA Commissioner Makary has shown himself to be a willing accomplice in advancing the Secretary’s agenda. In his closing remarks, Makary made unsupported assertions, then retreated to the dodge that he was just asking questions. Makary strongly suggested that the recent increase in colorectal cancer among young people has been caused by the use of talc in food and medications. He failed to reference any evidence for his suggestion, which is, in any event, hard to square with the history of use of talc in medications for centuries, and the steady overall decline in the incidence of colorectal cancer in men and women.[23]

The Center for Truth in Science has sponsored rigorous systematic reviews of the evidence on cosmetic talc use and female reproductive cancers,[24] and respiratory cancers.[25] The systematic review of talc on reproductive organ cancers integrated evidence across toxicologic and epidemiologic studies, and found suggestive evidence of no association between the use of perineal talc and ovarian and endometrial cancers. The systematic review of talc use and respiratory cancers similarly integrated the available toxicologic and epistemiologic evidence, and rejected a causal association. The review reached a conclusion of suggestive evidence in the opposite direction – of no association between inhaled talc and mesothelioma or lung cancer.

The FDA talc panel was fool’s gold, and not the promised “gold standard” science. Rather than engaging with the systematic reviews sponsored by the Center, or for that matter with any systematic reviews, Commissioner Makary and his panel wallowed in anecdotes, stories, and isolated study results, without trying to identify and synthesize all the available evidence.


[1] FDA Expert Panel on Talc, “Independent Expert Panel to Evaluate Safety and Necessity of Talc in Food, Drug, and Cosmetic Products,” FDA (May 20, 2025).

[2] NTP Technical Report on the Toxicology and Carcinogenesis Studies of Talc (CAS No. 14807-96-6) in F344/N Rats and B6C3F Mice (Sept. 1993).

[3] Jay I. Goodman, “An Analysis of the National Toxicology Program’s (NTP) Technical Report (NTP TR 421) on the Toxicology and Carcinogenesis Studies of Talc,” 21 Regulatory Toxicol. & Pharmacology 244 (1995). See also Robyn L. Prueitt, Nicholas L. Drury, Ross A. Shore, Denali N. Boon & Julie E. Goodman, “Talc and human cancer: a systematic review of the experimental animal and mechanistic evidence,”  54 Critical Reviews in Toxicology  359 (2024).

[4] NTP TR-280 Toxicology and Carcinogenesis Studies of Crocidolite Asbestos (CASRN 12001-28-4) In F344/N Rats (Feed Studies) (1988).

[5] NTP TR-279 Toxicology and Carcinogenesis Studies of Amosite Asbestos (CASRN 12172-73-5) in F344/N Rats (Feed Studies) (1990).

[6] NTP TR-249 Lifetime Carcinogenesis Studies of Amosite Asbestos (CASRN 12172-73-5) in Syrian Golden Hamsters (Feed Studies) (1983).

[7] NTP TR-246 Lifetime Carcinogenesis Studies of Chrysotile Asbestos (CASRN 12001-29-5) in Syrian Golden Hamsters (Feed Studies) (1990).

[8] NTP – TR-295 Toxicology and Carcinogenesis Studies of Chrysotile Asbestos (CASRN 12001-29-5) in F344/N Rats (Feed Studies) (1985).

[9] See Richard Doll, “Mortality from Lung Cancer in Asbestos Workers,”  12 Br. J. Indus. Med. 81 (1955).

[10] See J. Christopher Wagner, C.A. Sleggs, and Paul Marchand, “Diffuse pleural mesothelioma and asbestos exposure in the North Western Cape Province,” 17 Br. J. Indus. Med. 260 (1960); J. Christopher Wagner, “The discovery of the association between blue asbestos and mesotheliomas and the aftermath,” 48 Br. J. Indus. Med. 399 (1991).

[11] See “Health Hazard Progress Notes,”16 The Asbestos Worker 13 (May 1966) (“A recent decision has widened the range of compensable diseases for insulation workers even further. A member of Local No. 12. Unfortunately died of a cancer of the colon. Dr. Selikoff reported to the compensation court that his research showed that these cancers of the intestine were at least three times as common among the insulation workers as in men of the same age in the general population. Based upon Dr. Selikoff’s testimony, the Referee gave the family a compensation award, holding that the exposure to many dusts during employment was responsible for the cancer. The insurance company appealed this decision. A special panel of the Workman’s Compensation Board reviewed the matter and agreed with the Referee’s judgment and affirmed the compensation award. This was the first case in which a cancer of the colon was established as compensable and it is likely that this case will become an historical precedent.”).

[12] Irving J. Selikoff, “Epidemiology of Gastrointestinal Cancer,” 9 Envt’l Health Persp. 299 (1974).

[13] Richard Doll & Richard Peto, “The causes of cancer: quantitative estimates of avoidable risks of cancer in the United States today,” 66 J. Nat’l Cancer Instit. 1191 (1981).

[14] Richard Doll and Julian Peto, Asbestos: Effects on Health of Exposure to Asbestos 8 (1985).

[15] Jonathan M. Samet, et al., Asbestos: Selected Cancers – Institute of Medicine (2006).

[16] See Wendy Wriston Adamson, Saving Lake Superior: A Story of environmental action (1974); Frank D. Schaumburg, Judgment Reserved: A Landmark Environmental Case (1976); Robert V. Bartlett, The Reserve Mining Controversy: Science, Technology, and Environmental Quality (1980); Thomas F. Bastow, This Vast Pollution: United States of America v. Reserve Mining Company (1986); Michael E. Berndt & William C. Brice, “The origins of public concern with taconite and human health: Reserve Mining and the asbestos case,” 52 Regulatory Toxicol. & Pharmacol. S31 (2008).

[17] Reserve Mining Co. v. Lord, 529 F.2d 181 (8th Cir. 1976) (removing Judge Lord from case).

[18] See World Health Organization, Asbestos in Drinking Water (4th ed. 2021) (“no causal association between asbestos exposure via drinking-water and cancer development has been reported for any asbestos fibre type”); Jennifer Go, Nawal Farhat, Karen Leingartner, Elvin Iscan Insel, Franco Momoli, Richard Carrier & Daniel Krewski, “Review of epidemiological and toxicological studies on health effects from ingestion of asbestos in drinking water,” 54 Critical Reviews in Toxicology 856 (2024) (“Based on high-quality animal studies, an increased risk for cancer or non-cancer endpoints was not supported, aligning with findings from human studies. Overall, the currently available body of evidence is insufficient to establish a clear link between asbestos contamination in drinking water and adverse health effects.”); Kenneth D. MacRae, “Asbestos in drinking water and cancer,” 22 J. Royal Coll. Physicians 7 (1988).

[19] Francis Douglas Kelly Liddell, “Magic, Menace, Myth and Malice,” 41 Ann. Occup. Hyg. 3, 3 (1997) (“[A]n anti-asbestos lobby, based in the Mount Sinai School of Medicine of the City University of New York, promoted the fiction that asbestos was an all-pervading menace, and trumped up a number of asbestos myths for widespread dissemination, through media eager for bad news.”).

[20] Rueben R. Merliss, “Talc-Treated Rice and Japanese Stomach Cancer,” 173 Science 1141 (1971). The claim persists in the underworld of medical speculation. See E. Whitin Kiritani, “Asbestos and Stomach Cancer in Japan – A Connection?” 33 Medical Hypotheses 159 (1990).

[21] Grant N. Stemmermann & Lawrence N. Kolonel, “Talc-coated rice as a risk factor for stomach cancer,” 31 Am. J. Clin. Nutrition 2017 (1978).

[22] Paul Offit, “Understanding RFK Jr.,” Beyond the Noise (Feb. 11, 2025).

[23] American Cancer Society, “Key Statistics for Colorectal Cancer” (last revised April 28, 2025).

[24] Heather N. Lynch, Daniel J. Lauer, Olivia Messina Leleck, Rachel D. Freid, Justin Collins, Kathleen Chen, William J. Thompson, A. Michael Ierardi, Ania Urban, Paolo Boffetta & Kenneth A. Mundt, “Systematic review of the association between talc and female reproductive tract cancers,” 5 Front. Toxicol. 1157761 (2023).

[25] Heather N. Lynch, Daniel J. Lauer, William J. Thompson, Olivia Leleck, Rachel D. Freid, Justin Collins, Kathleen Chen, A. Michael Ierardi, Ania M. Urban, Michael A. Cappello, Paolo Boffetta & Kenneth A. Mundt, “Systematic review of the scientific evidence of the pulmonary carcinogenicity of talc,” 10 Front. Public Health 989111 (2022).

Professor Lahav’s Radically Misguided Treatment of Chancy Tort Causation

September 27th, 2024

In the 19th and early 20th century, scientists and lay people usually conceptualized causation as “deterministic.” Their model of science was perhaps what was called Newtonian, in which observations were invariably described in terms of identifiable forces that acted upon antecedent phenomena. The universe was akin to a pool table, with the movement of the billiard balls fully explained by their previous positions, mass, and movements. There was little need for probability to describe events or outcomes in such a universe.

The 20th century ushered in probabilistic concepts and models in physics and biology. Because tort law is so focused on claims of bodily integrity and harms, I am focused here on claimed health effects. Departing from the Koch-Henle postulates and our understanding of pathogen-based diseases, the latter half of the 20th century saw the rise of observational epidemiology and scientific conclusions about stochastic processes and effects that could best be understood in terms of probabilities, with statistical inferences from samples of populations. The language of deterministic physics failed to do justice to epidemiologic evidence or conclusions. Modern medicine and biology invoked notions of base rates for chronic diseases, which rates might be modified by environmental exposures.

In the wake of the emerging science of epidemiology, the law experienced a new horizon on which many claimed tortogens did not involve exposures uniquely tied to the harms alleged. Rather, the harms asserted were often diseases of ordinary life, but with that suggested the harms were quantitatively more prevalent or incident among people exposed to the alleged tortogen. Of course, the backwaters of tort law saw reactionary world views on trial, as with claims of trauma-induced cancer cases, which are with us still. Nonetheless, slowly but not always steadily, the law came to grips with probability and statistical evidence.

In law, as in science, a key component of causal attribution is counterfactual analysis. If A causes B, then if in the same world, ceteris paribus, we do not have A, then we don’t have B. Counterfactual analysis applies as much to stochastic processes that are causally influenced by rate changes, as they apply to the Newtonian world of billiard balls. Some writers in the legal academy, however, would opportunistically use the advent of probabilistic analyses of health effects to dispose of science altogether. No one has more explicitly exploited the opportunity than Professor Alexandra Lahav.

In an essay published in 2022, Professor Lahav advanced extraordinary claims about probabilistic causation, or what she called “chancy causation.”[1] The proffered definition of chancy causation is bumfuzzling. Lahav provides an example of an herbicide that is “associated” with the type of cancer that the heavily exposed plaintiff developed. She tells us that:

“[t]here is a chance that the exposure caused his cancer, and a chance that it did not. Probability follows certain rules, or tendencies, but these regular laws do not abolish chance. This is a common problem in modern life, where much of what we know about medicines, interventions, and the chemicals to which we are exposed is probabilistic. Following the philosophical literature, I call this phenomenon chancy causation.”[2]

So the rules of probability do not abolish chance? It is hard to know what Lahav is trying to say here. Probability quantifies chance, and gives us an understanding of phenomena and their predictability. When we can model an empirical process with a probability distribution, such as one that is independent and identically distributed, we can often make and test quantitative inferences about the anticipated phenomena.

Lahav vaguely acknowledges that her term, “chancy causation” is borrowed, but she does not give credit to the many authors who have used it before.[3] Lahav does note that the concept of probabilistic causation used in modern-day risk factor epidemiology is different from the deterministic causal claims that dominated tort law in the 19th and the first half of the 20th century. Lahav claims that chancy causation is inconsistent with counterfactual analysis, but she cites no support for her claim, which is demonstrably false. If we previously saw the counterfactual of if A then B, as key to causality, we can readily restate the counterfactual as a probability: A probably causes B. On a counterfactual analysis, then if we do not have A as an antecedent, then we probably do not have B. For a classic tortogen such as tobacco smoking, we can say confidently that tobacco smoking probably causes lung cancer. And for a given instance of lung cancer, we can say based upon the entire evidentiary display, that if a person did not smoke tobacco, he would probably not have developed lung cancer. Of course, the correspondence is not 100 percent, which is only to say that it is probabilistic. There are highly penetrant genetic mutations that may be the cause of a given lung cancer case. We know, however, that such mutations do not cause or explain the large majority of lung cancer cases.

Contrary to Lahav’s ipse dixits, tort law can incorporate, and has accommodated, both general and specific causation in terms of probabilistic counterfactuals. The modification requires us, of course, to address the baseline situation as a rate or frequency of events, and the post-exposure world as one with a modified rate or frequency. Without confusion or embarrassment, we can say that the exposure is the cause of the change in event rates. Modern physics similarly addresses whether we must be content with probability statements, rather than precise deterministic “billiard ball” physics, which is so useful in a game of snooker, but less so in describing the position of sub-atomic particles. In the first half of the 20th century, the biological sciences learned with some difficulty that it must embrace probabilistic models, in genetic science, as well as in epidemiology. Many biological causation models are completely stated in terms of probabilities that are modified by specified conditions.

Lahav intends for her rejection of counterfactual causality to do a lot of work in her post-modern program. By falsely claiming that chancy causation has no factual basis, Lahav jumps to the conclusion that what the law calls for is nothing but “policy,”[4] and “normative decision.”[5] Having fabricated the demise of but-for causation in the context of probabilistic relationships, Lahav suggests that tort law can pretend that the causation question is nothing more than a normative analysis of the defendant’s conduct. (Perhaps it is more than a tad revealing that she does not see that the plaintiff’s conduct is involved in the normative judgment.) Of course, tort law already has ample room for policy and normative considerations built into the concepts of duty and breach of duty.

As we saw with the lung cancer example above, the claim that tobacco smoking probably caused the smoker to develop lung cancer can be entirely factual, and supported by a probabilistic judgment. Lahav calls her erroneous move “pragmatic,” although it has no relationship to the philosophical pragmatism of Peirce or Quine. Lahav’s move is an incorrect misrepresentation of probability and of epidemiologic science in the name of compensation free-for-alls. Obtaining a heads in the flip of a fair coin has a probability of 50%; that is a fact, not a normative decision, even though it is, to use Lahav’s vocabulary, “chancy.”

Lahav’s argument is not always easy to follow. In one place, she uses “chancy” to refer to the posterior probability of the correctness of the causal claim:

“the counterfactual standard can be successfully defended against by the introduction of chance. The more conflicting studies, the “more chancy” the causation. By that I do not mean proving a lower probability (although this is a good result from a defense point of view) but rather that more, different study results create the impression of irreducible chanciness, which in turn dictates that the causal relation cannot be definitively proven.”[6]

This usage, which clearly refers to the posterior probability of a claim, is not necessarily limited to so-called non-deterministic phenomena. People could refer to any conclusion, based upon conflicting evidence of deterministic phenomena, as “chancy.”

Lurking in her essay is a further confusion between the posterior probability we might assign to a claim, or to an inference from probabilistic evidence, and the probability of random error. In an interview conducted by Felipe Jiménez,[7] Lahav was more transparent in her confusion, and she explicitly commited the transpositional fallacy with her suggestion that customary statistical standards (statistical significance) ensure that even small increased risks, say of 30%, are known to a high degree of certainty.

Despite these confusions, it seems fairly clear that Lahav is concerned with stochastic causal processes, and most of her examples evidence that concern. Lahav poses a hypothetical in which epidemiologic studies show smokers have a 20% increased risk of developing lung cancer compared with non-smokers.[8] Given that typical smoking histories convey relative risks of 20 to 30, or increased risks of 2,000 to 3,000%, Lahav’s hypothetical may readers think she is shilling for tobacco compaies. In any event, in the face of a 20% increased risk (or relative rsk of 1.2), Lahav acknowledges that the probability of a smoker’s developing lung cancer is higher than that of a non-smoker, but “in any particular case the question whether a patient’s lung cancer was caused by smoking is uncertain.” This assertion, however, is untrue; the question is not “uncertain.” She has provided a certain quantification of the increased risk. Furthermore, her hypothetical gives us a good deal of information on which we can say that smoking probably did not result in the patient’s lung cancer. Causation may be chancy because it is based upon a probabilistic inference, but the chances are actual known, and they are low.

Lahav posits a more interesting hypothetical when she considers a case in which there is an 80% chance that a person’s lung cancer is attributable to smoking.[9] We can understand this hypothetical better if we reframe it as classic urn probability problem. In a given (large) population of non-smokers, we expect 100 lung cancers per year. In a population of smokers, otherwise just like the population of non-smokers, we observe 500 lung cancers. Of the observed number, 100 were “expected” because they happen without exposure to the putative causal agent, and 400 are “excess.”The relative risk would be 5, or 400% increased risk, and still well below the actual measure of risk from long-term smoking, but the attributable risk would be [(RR-1)/RR] or 0.8 (or 80%). If we imagine an urn with 100 white “expected” balls, and 400 red “excess” balls added, then any given draw from the urn, with replacement, yields an 80% probability of a red ball, or an excess case. Of course, if we can see the color, we may come to a consensus judgment that the ball is actually red. But on our analogy to discerning the cause of a given lung cancer, we have at present nothing by way of evidence with which to call the question, and so it remains “chancy” or probabilistic. The question is not, however, in any way normative. The answer is different quantitatively in the 20% and in the 400% hypotheticals.

Lahav asserts that we are in a state of complete ignorance once a smoker has lung cancer.[10] This is not, however, true. We have the basis for a probabilistic judgment that will probably be true. It may well be true that the probability of attribution will be affected by the probability that the relative risk = 5 is correct. If the posterior probability for the claim that smoking causes lung cancer by increasing its risk 400% is only 30%, then of course, we could not make the attribution in a given case with an 80% probability of correctness. In actual litigation, the argument is often framed on an assumption arguendo that the increased risk is greater than two, so that only the probability of attribution is involved. If the posterior probability of the claim that exposure to the tortogen increased risk by 400% or 20,000% was only 0.49, then the plaintiff would lose. If the posterior probability of the increased risk was greater than 0.5, the finder of fact could find that the specific causation claim had been carried if the magnitude of the relative risk, and the attributable risk, were sufficiently large. This inference on specific causation would not be a normative judgment; it would be guided by factual evidence about the magnitude of the relevant increased risk.

Lahav advances a perverse skepticism that any inferences about individuals can be drawn from information about rates or frequencies in groups of similar individuals.  Yes, there may always be some debate about what is “similar,” but successive studies may well draw the net tighter around what is the appropriate class. Lahav’s skepticism and her outright denialism about inferences from general causation to specific causation, are common among some in the legal academy, but it ignores that group to individual inferences are drawn in epidemiology in multiple contexts. Regressions for disease prediction are based upon individual data within groups, and the regression equations are then applied to future individuals to help predict those individuals’ probability of future disease (such as heart attack or breast cancer), or their probability of cancer-free survival after a specific therapy. Group to individual inferences are, of course, also the basis for prescribing decisions in clinical medicine.  These are not normative inferences; they are based upon evidence-based causal thinking about probabilistic inferences.

In the early tobacco litigation, defendants denied that tobacco smoking caused lung cancer, but they argued that even if it did, and the relative risk were 20, then the specific causation inference in this case was still insecure because the epidemiologic study tells us nothing about the particular case. Lahav seems to be channeling the tobacco-company argument, which has long since been rejected on the substantive law of causation. Indeed, as noted, epidemiologists do draw inferences about individual cases from population-based studies when they invoke clinical prediction models such as the Framingham cardiovascular risk event model, or the Gale breast cancer prediction model. Physicians base important clinical interventions, both pharmacologic and surgical, for individuals upon population studies. Lahav asserts, without evidence, that the only difference between an intervention based upon an 80% or a 30% probability is a “normative implication.”[11] The difference is starkly factual, not normative, and describes a long-term likelihood of success, as well as an individual probability of success.

Post-Modern Causation

What we have in Lahav’s essay is the ultimate post-modern program, which asserts, without evidence, that when causation is “chancy,” or indeterminate, courts leave the realm of science and step into the twilight zone of “normative decisions.” Lahav suggests that there is an extreme plasticity to the very concept of causation such that causation can be whatever judges want it to be. I for one sincerely doubt it. And if judges make up some Lahav-inspired concept of normative causation, the scientific community would rightfully scoff.

Establishing causation can be difficult, and many so-called mass tort litigations have failed for want of sufficient, valid evidence supporting causal claims. The late Professor Margaret Berger reacted to this difficulty in a more forthright way by arguing for the abandonment of general causation, or cause-in-fact, as an element of tort claims under the law.[12] Berger’s antipathy to requiring causation manifested in her hostility to judicial gatekeeping of the validity of expert witness opinions. Her animus against requiring causation and gatekeeping under Rule 702 was so strong that it exceeded her lifespan. Berger’s chapter in the third edition of the Reference Manual on Scientific Evidence, which came out almost one year after her death, embraced the First Circuit’s notorious anti-Daubert decision in Milward, which also post-dated her passing.[13]

Professor Lahav has previously expressed a distain for the causation requirement in tort law. In an earlier paper, “The Knowledge Remedy,” Lahav argued for an extreme, radical precautionary principle approach to causation.[14] Lahav believes that the likes of David Michaels have “demonstrated” that manufactured uncertainty is a genuine problem, but not one that affects her main claims. Remarkably, Lahav sees no problem with manufactured certainty in the advocacy science of many authors or the lawsuit industry.[15] In “Chancy Causation,” Lahav thus credulously repeats Michaels’ arguments, and goes so far as to describe Rule 702 challenges to causal claims as having the “negative effect” of producing “incentives to sow doubt about epidemiologic studies using methodological battles, a strategy pioneered by the tobacco companies … .”[16] Lahav’s agenda is revealed by the absence of any corresponding concern about the negative effect of producing incentives to overstate the findings, or the validity of inferences, in order to obtain an unwarranted and unsafe verdicts for claimants.


[1] Alexandra D. Lahav, “Chancy Causation in Tort,” 15 J. Tort L. 109 (2022) [hereafter Chancy Causation].

[2] Chancy Causation at 110.

[3] See, e.g., David K. Lewis, Philosophical Papers: Volume 2 175 (1986); Mark Parascandola, “Evidence and Association: Epistemic Confusion in Toxic Tort Law,” 63 Phil. Sci. S168 (1996).

[4] Chancy Causation at 109.

[5] Chancy Causation at 110-11.

[6] Chancy Causation at 129.

[7] Felipe Jiménez, “Alexandra Lahav on Chancy Causation in Tort,” The Private Law Podcast (Mar. 29, 2021).

[8] Chancy Causation at 115.

[9] Chancy Causation at 116-17.

[10] Chancy Causation at 117.

[11] Chancy Causation at 119.

[12] Margaret A. Berger, “Eliminating General Causation: Notes towards a New Theory of Justice and Toxic Torts,” 97 Colum. L. Rev. 2117 (1997).

[13] Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11 (1st Cir. 2011), cert. denied sub nom., U.S. Steel Corp. v. Milward, 132 S. Ct. 1002 (2012).

[14] Alexandra D. Lahav, “The Knowledge Remedy,” 98 Texas L. Rev. 1361 (2020). See “The Knowledge Remedy ProposalTortini (Nov. 14, 2020).

[15] Chancy Causation at 118 (citing plaintiffs’ expert witness David Michaels, The Triumph of Doubt: Dark Money and the Science of Deception (2020), among others).

[16] Chancy Causation at 129.

The Role of Peer Review in Rule 702 and 703 Gatekeeping

November 19th, 2023

“There is no expedient to which man will not resort to avoid the real labor of thinking.”
              Sir Joshua Reynolds (1723-92)

Some courts appear to duck the real labor of thinking, and the duty to gatekeep expert witness opinions,  by deferring to expert witnesses who advert to their reliance upon peer-reviewed published studies. Does the law really support such deference, especially when problems with the relied-upon studies are revealed in discovery? A careful reading of the Supreme Court’s decision in Daubert, and of the Reference Manual on Scientific Evidence provides no support for admitting expert witness opinion testimony that relies upon peer-reviewed published studies, when the studies are invalid or are based upon questionable research practices.[1]

In Daubert v. Merrell Dow Pharmaceuticals, Inc.,[2] The Supreme Court suggested that peer review of studies relied upon by a challenged expert witness should be a factor in determining the admissibility of that expert witness’s opinion. In thinking about the role of peer-review publication in expert witness gatekeeping, it is helpful to remember the context of how and why the Supreme was talking about peer review in the first place. In the trial court, the Daubert plaintiff had proffered an expert witness opinion that featured reliance upon an unpublished reanalysis of published studies. On the defense motion, the trial court excluded the claimant’s witness,[3] and the Ninth Circuit affirmed.[4] The intermediate appellate court expressed its view that unpublished, non-peer-reviewed reanalyses were deviations from generally accepted scientific discourse, and that other appellate courts, considering the alleged risks of Bendectin, refused to admit opinions based upon unpublished, non-peer-reviewed reanalyses of epidemiologic studies.[5] The Circuit expressed its view that reanalyses are generally accepted by scientists when they have been verified and scrutinized by others in the field. Unpublished reanalyses done for solely for litigation would be an insufficient foundation for expert witness opinion.[6]

The Supreme Court, in Daubert, evaded the difficult issues involved in evaluating a statistical analysis that has not been published by deciding the case on the ground that the lower courts had applied the wrong standard.  The so-called Frye test, or what I call the “twilight zone” test comes from the heralded 1923 case excluding opinion testimony based upon a lie detector:

“Just when a scientific principle or discovery crosses the line between the experimental and demonstrable stages is difficult to define. Somewhere in this twilight zone the evidential force of the principle must be recognized, and while the courts will go a long way in admitting expert testimony deduced from a well recognized scientific principle or discovery, the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs.”[7]

The Supreme Court, in Daubert, held that with the promulgation of the Federal Rules of Evidence in 1975, the twilight zone test was no longer legally valid. The guidance for admitting expert witness opinion testimony lay in Federal Rule of Evidence 702, which outlined an epistemic test for “knowledge,” which would be helpful to the trier of fact. The Court then proceeded to articulate several  non-definitive factors for “good science,” which might guide trial courts in applying Rule 702, such as testability or falsifiability, a showing of known or potential error rate. Another consideration, general acceptance carried over from Frye as a consideration.[8] Courts have continued to build on this foundation to identify other relevant considerations in gatekeeping.[9]

One of the Daubert Court’s pertinent considerations was “whether the theory or technique has been subjected to peer review and publication.”[10] The Court, speaking through Justice Blackmun, provided a reasonably cogent, but probably now out-dated discussion of peer review:

 “Publication (which is but one element of peer review) is not a sine qua non of admissibility; it does not necessarily correlate with reliability, see S. Jasanoff, The Fifth Branch: Science Advisors as Policymakers 61-76 (1990), and in some instances well-grounded but innovative theories will not have been published, see Horrobin, “The Philosophical Basis of Peer Review and the Suppression of Innovation,” 263 JAMA 1438 (1990). Some propositions, moreover, are too particular, too new, or of too limited interest to be published. But submission to the scrutiny of the scientific community is a component of “good science,” in part because it increases the likelihood that substantive flaws in methodology will be detected. See J. Ziman, Reliable Knowledge: An Exploration of the Grounds for Belief in Science 130-133 (1978); Relman & Angell, “How Good Is Peer Review?,” 321 New Eng. J. Med. 827 (1989). The fact of publication (or lack thereof) in a peer reviewed journal thus will be a relevant, though not dispositive, consideration in assessing the scientific validity of a particular technique or methodology on which an opinion is premised.”[11]

To the extent that peer review was touted by Justice Blackmun, it was because the peer-review process advanced the ultimate consideration of the scientific validity of the opinion or claim under consideration. Validity was the thing; peer review was just a crude proxy.

If the Court were writing today, it might well have written that peer review is often a feature of bad science, advanced by scientists who know that peer-reviewed publication is the price of admission to the advocacy arena. And of course, the wild proliferation of journals, including the “pay-to-play” journals, facilitates the festschrift.

Reference Manual on Scientific Evidence

Certainly, judicial thinking evolved since 1993, and the decision in Daubert. Other considerations for gatekeeping have been added. Importantly, Daubert involved the interpretation of a statute, and in 2000, the statute was amended.

Since the Daubert decision, the Federal Judicial Center and the National Academies of Science have weighed in with what is intended to be guidance for judges and lawyers litigating scientific and technical issue. The Reference Manual on Scientific Evidence is currently in a third edition, but a fourth edition is expected in 2024.

How does the third edition[12] treat peer review?

An introduction by now retired Associate Justice Stephen Breyer blandly reports the Daubert considerations, without elaboration.[13]

The most revealing and important chapter in the Reference Manual is the one on scientific method and procedure, and sociology of science, “How Science Works,” by Professor David Goodstein.[14] This chapter’s treatment is not always consistent. In places, the discussion of peer review is trenchant. At other places, it can be misleading. Goodstein’s treatment, at first, appears to be a glib endorsement of peer review as a substitute for critical thinking about a relied-upon published study:

“In the competition among ideas, the institution of peer review plays a central role. Scientifc articles submitted for publication and proposals for funding often are sent to anonymous experts in the field, in other words, to peers of the author, for review. Peer review works superbly to separate valid science from nonsense, or, in Kuhnian terms, to ensure that the current paradigm has been respected.11 It works less well as a means of choosing between competing valid ideas, in part because the peer doing the reviewing is often a competitor for the same resources (space in prestigious journals, funds from government agencies or private foundations) being sought by the authors. It works very poorly in catching cheating or fraud, because all scientists are socialized to believe that even their toughest competitor is rigorously honest in the reporting of scientific results, which makes it easy for a purposefully dishonest scientist to fool a referee. Despite all of this, peer review is one of the venerated pillars of the scientific edifice.”[15]

A more nuanced and critical view emerges in footnote 11, from the above-quoted passage, when Goodstein discusses how peer review was framed by some amici curiae in the Daubert case:

“The Supreme Court received differing views regarding the proper role of peer review. Compare Brief for Amici Curiae Daryl E. Chubin et al. at 10, Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993) (No. 92-102) (“peer review referees and editors limit their assessment of submitted articles to such matters as style, plausibility, and defensibility; they do not duplicate experiments from scratch or plow through reams of computer-generated data in order to guarantee accuracy or veracity or certainty”), with Brief for Amici Curiae New England Journal of Medicine, Journal of the American Medical Association, and Annals of Internal Medicine in Support of Respondent, Daubert v. Merrell Dow Pharm., Inc., 509 U.S. 579 (1993) (No. 92-102) (proposing that publication in a peer-reviewed journal be the primary criterion for admitting scientifc evidence in the courtroom). See generally Daryl E. Chubin & Edward J. Hackett, Peerless Science: Peer Review and U.S. Science Policy (1990); Arnold S. Relman & Marcia Angell, How Good Is Peer Review? 321 New Eng. J. Med. 827–29 (1989). As a practicing scientist and frequent peer reviewer, I can testify that Chubin’s view is correct.”[16]

So, if, as Professor Goodstein attests, Chubin is correct that peer review does not “guarantee accuracy or veracity or certainty,” the basis for veneration is difficult to fathom.

Later in Goodstein’s chapter, in a section entitled “V. Some Myths and Facts about Science,” the gloves come off:[17]

Myth: The institution of peer review assures that all published papers are sound and dependable.

Fact: Peer review generally will catch something that is completely out of step with majority thinking at the time, but it is practically useless for catching outright fraud, and it is not very good at dealing with truly novel ideas. Peer review mostly assures that all papers follow the current paradigm (see comments on Kuhn, above). It certainly does not ensure that the work has been fully vetted in terms of the data analysis and the proper application of research methods.”[18]

Goodstein is not a post-modern nihilist. He acknowledges that “real” science can be distinguished from “not real science.” He can hardly be seen to have given a full-throated endorsement to peer review as satisfying the gatekeeper’s obligation to evaluate whether a study can be reasonably relied upon, or whether reliance upon such a particular peer-reviewed study can constitute sufficient evidence to render an expert witness’s opinion helpful, or the application of a reliable methodology.

Goodstein cites, with apparent approval, the amicus brief filed by the New England Journal of Medicine, and other journals, which advised the Supreme Court that “good science,” requires a “a rigorous trilogy of publication, replication and verification before it is relied upon.” [19]

“Peer review’s ‘role is to promote the publication of well-conceived articles so that the most important review, the consideration of the reported results by the scientific community, may occur after publication.’”[20]

Outside of Professor Goodstein’s chapter, the Reference Manual devotes very little ink or analysis to the role of peer review in assessing Rule 702 or 703 challenges to witness opinions or specific studies.  The engineering chapter acknowledges that “[t]he topic of peer review is often raised concerning scientific and technical literature,” and helpfully supports Goodstein’s observations by noting that peer review “does not ensure accuracy or validity.”[21]

The chapter on neuroscience is one of the few chapters in the Reference Manual, other than Professor Goodstein’s, to address the limitations of peer review. Peer review, if absent, is highly suspicious, but its presence is only the beginning of an evaluation process that continues after publication:

Daubert’s stress on the presence of peer review and publication corresponds nicely to scientists’ perceptions. If something is not published in a peer-reviewed journal, it scarcely counts. Scientists only begin to have confidence in findings after peers, both those involved in the editorial process and, more important, those who read the publication, have had a chance to dissect them and to search intensively for errors either in theory or in practice. It is crucial, however, to recognize that publication and peer review are not in themselves enough. The publications need to be compared carefully to the evidence that is proffered.[22]

The neuroscience chapter goes on to discuss peer review also in the narrow context of functional magnetic resonance imaging (fMRI). The authors note that fMRI, as a medical procedure, has been the subject of thousands of peer-reviewed, but those peer reviews do little to validate the use of fMRI as a high-tech lie detector.[23] The mental health chapter notes in a brief footnote that the science of memory is now well accepted and has been subjected to peer review, and that “[c]areful evaluators” use only tests that have had their “reliability and validity confirmed in peer-reviewed publications.”[24]

Echoing other chapters, the engineering chapter also mentions peer review briefly in connection with qualifying as an expert witness, and in validating the value of accrediting societies.[25]  Finally, the chapter points out that engineering issues in litigation are often sufficiently novel that they have not been explored in peer-reviewed literature.[26]

Most of the other chapters of the Reference Manual, third edition, discuss peer review only in the context of qualifications and membership in professional societies.[27] The chapter on exposure science discusses peer review only in the narrow context of a claim that EPA guidance documents on exposure assessment are peer reviewed and are considered “authoritative.”[28]

Other chapters discuss peer review briefly and again only in very narrow contexts. For instance, the epidemiology chapter discusses peer review in connection with two very narrow issues peripheral to Rule 702 gatekeeping. First, the chapter raises the question (without providing a clear answer) whether non-peer-reviewed studies should be included in meta-analyses.[29] Second, the chapter asserts that “[c]ourts regularly affirm the legitimacy of employing differential diagnostic methodology,” to determine specific causation, on the basis of several factors, including the questionable claim that the methodology “has been subjected to peer review.”[30] There appears to be no discussion in this key chapter about whether, and to what extent, peer review of published studies can or should be considered in the gatekeeping of epidemiologic testimony. There is certainly nothing in the epidemiology chapter, or for that matter elsewhere in the Reference Manual, to suggest that reliance upon a peer-reviewed published study pretermits analysis of that study to determine whether it is indeed internally valid or reasonably relied upon by expert witnesses in the field.


[1] See Jop de Vrieze, “Large survey finds questionable research practices are common: Dutch study finds 8% of scientists have committed fraud,” 373 Science 265 (2021); Yu Xie, Kai Wang, and Yan Kong, “Prevalence of Research Misconduct and Questionable Research Practices: A Systematic Review and Meta-Analysis,” 27 Science & Engineering Ethics 41 (2021).

[2] 509 U.S. 579 (1993).

[3]  Daubert v. Merrell Dow Pharmaceuticals, Inc., 727 F.Supp. 570 (S.D.Cal.1989).

[4] 951 F. 2d 1128 (9th Cir. 1991).

[5]  951 F. 2d, at 1130-31.

[6] Id. at 1131.

[7] Frye v. United States, 293 F. 1013, 1014 (D.C. Cir. 1923) (emphasis added).

[8]  Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 590 (1993).

[9] See, e.g., In re TMI Litig. II, 911 F. Supp. 775, 787 (M.D. Pa. 1995) (considering the relationship of the technique to methods that have been established to be reliable, the uses of the method in the actual scientific world, the logical or internal consistency and coherence of the claim, the consistency of the claim or hypothesis with accepted theories, and the precision of the claimed hypothesis or theory).

[10] Id. at  593.

[11] Id. at 593-94.

[12] National Research Council, Reference Manual on Scientific Evidence (3rd ed. 2011) [RMSE]

[13] Id., “Introduction” at 1, 13

[14] David Goodstein, “How Science Works,” RMSE 37.

[15] Id. at 44-45.

[16] Id. at 44-45 n. 11 (emphasis added).

[17] Id. at 48 (emphasis added).

[18] Id. at 49 n.16 (emphasis added)

[19] David Goodstein, “How Science Works,” RMSE 64 n.45 (citing Brief for the New England Journal of Medicine, et al., as Amici Curiae supporting Respondent, 1993 WL 13006387 at *2, in Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993).

[20] Id. (citing Brief for the New England Journal of Medicine, et al., 1993 WL 13006387 *3)

[21] Channing R. Robertson, John E. Moalli, David L. Black, “Reference Guide on Engineering,” RMSE 897, 938 (emphasis added).

[22] Henry T. Greely & Anthony D. Wagner, “Reference Guide on Neuroscience,” RMSE 747, 786.

[23] Id. at 776, 777.

[24] Paul S. Appelbaum, “Reference Guide on Mental Health Evidence,” RMSE 813, 866, 886.

[25] Channing R. Robertson, John E. Moalli, David L. Black, “Reference Guide on Engineering,” RMSE 897, 901, 931.

[26] Id. at 935.

[27] Daniel Rubinfeld, “Reference Guide on Multiple Regression,” 303, 328 RMSE  (“[w]ho should be qualified as an expert?”); Shari Seidman Diamond, “Reference Guide on Survey Research,” RMSE 359, 375; Bernard D. Goldstein & Mary Sue Henifin, “Reference Guide on Toxicology,” RMSE 633, 677, 678 (noting that membership in some toxicology societies turns in part on having published in peer-reviewed journals).

[28] Joseph V. Rodricks, “Reference Guide on Exposure Science,” RMSE 503, 508 (noting that EPA guidance documents on exposure assessment often are issued after peer review).

[29] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” RMSE 549, 608.

[30] Id. at 617-18 n.212.

Consenus is Not Science

November 8th, 2023

Ted Simon, a toxicologist and a fellow board member at the Center for Truth in Science, has posted an intriguing piece in which he labels scientific consensus as a fool’s errand.[1]  Ted begins his piece by channeling the late Michael Crichton, who famously derided consensus in science, in his 2003 Caltech Michelin Lecture:

“Let’s be clear: the work of science has nothing whatever to do with consensus. Consensus is the business of politics. Science, on the contrary, requires only one investigator who happens to be right, which means that he or she has results that are verifiable by reference to the real world. In science, consensus is irrelevant. What is relevant is reproducible results. The greatest scientists in history are great precisely because they broke with the consensus.

* * * *

There is no such thing as consensus science. If it’s consensus, it isn’t science. If it’s science, it isn’t consensus. Period.”[2]

Crichton’s (and Simon’s) critique of consensus is worth remembering in the face of recent proposals by Professor Edward Cheng,[3] and others,[4] to make consensus the touchstone for the admissibility of scientific opinion testimony.

Consensus or general acceptance can be a proxy for conclusions drawn from valid inferences, within reliably applied methodologies, based upon sufficient evidence, quantitatively and qualitatively. When expert witnesses opine contrary to a consensus, they raise serious questions regarding how they came to their conclusions. Carl Sagan declaimed that “extraordinary claims require extraordinary evidence,” but his principle was hardly novel. Some authors quote the French polymath Pierre Simon Marquis de Laplace, who wrote in 1810: “[p]lus un fait est extraordinaire, plus il a besoin d’être appuyé de fortes preuves,”[5] but as the Quote Investigator documents,[6] the basic idea is much older, going back at least another century to church rector who expressed his skepticism of a contemporary’s claim of direct communication with the almighty: “Sure, these Matters being very extraordinary, will require a very extraordinary Proof.”[7]

Ted Simon’s essay is also worth consulting because he notes that many sources of apparent consensus are really faux consensus, nothing more than self-appointed intellectual authoritarians who systematically have excluded some points of view, while turning a blind eye to their own positional conflicts.

Lawyers, courts, and academics should be concerned that Cheng’s “consensus principle” will change the focus from evidence, methodology, and inference, to a surrogate or proxy for validity. And the sociological notion of consensus will then require litigation of whether some group really has announced a consensus. Consensus statements in some areas abound, but inquiring minds may want to know whether they are the result of rigorous, systematic reviews of the pertinent studies, and whether the available studies can support the claimed consensus.

Professor Cheng is hard at work on a book-length explication of his proposal, and some criticism will have to await the event.[8] Perhaps Cheng will overcome the objections placed against his proposal.[9] Some of the examples Professor Cheng has given, however, such as his errant his dramatic misreading of the American Statistical Association’s 2016 p-value consensus statement to represent, in Cheng’s words:

“[w]hile historically used as a rule of thumb, statisticians have now concluded that using the 0.05 [p-value] threshold is more distortive than helpful.”[10]

The 2016 Statement said no such thing, although a few statisticians attempted to distort the statement in the way that Cheng suggests. In 2021, a select committee of leading statisticians, appointed by the President of the ASA, issued a statement to make clear that the ASA had not embraced the Cheng misinterpretation.[11] This one example alone does not bode well for the viability of Cheng’s consensus principle.


[1] Ted Simon, “Scientific consensus is a fool’s errand made worse by IARC” (Oct. 2023).

[2] Michael Crichton, “Aliens Cause Global Warming,” Caltech Michelin Lecture (Jan. 17, 2003).

[3] Edward K. Cheng, “The Consensus Rule: A New Approach to Scientific Evidence,” 75 Vanderbilt L. Rev. 407 (2022) [Consensus Rule]

[4] See Norman J. Shachoy Symposium, The Consensus Rule: A New Approach to the Admissibility of Scientific Evidence (2022), 67 Villanova L. Rev. (2022); David S. Caudill, “The ‘Crisis of Expertise’ Reaches the Courtroom: An Introduction to the Symposium on, and a Response to, Edward Cheng’s Consensus Rule,” 67 Villanova L. Rev. 837 (2022); Harry Collins, “The Owls: Some Difficulties in Judging Scientific Consensus,” 67 Villanova L. Rev. 877 (2022); Robert Evans, “The Consensus Rule: Judges, Jurors, and Admissibility Hearings,” 67 Villanova L. Rev. 883 (2022); Martin Weinel, “The Adversity of Adversarialism: How the Consensus Rule Reproduces the Expert Paradox,” 67 Villanova L. Rev. 893 (2022); Wendy Wagner, “The Consensus Rule: Lessons from the Regulatory World,” 67 Villanova L. Rev. 907 (2022); Edward K. Cheng, Elodie O. Currier & Payton B. Hampton, “Embracing Deference,” 67 Villanova L. Rev. 855 (2022).

[5] Pierre-Simon Laplace, Théorie analytique des probabilités (1812) (The more extraordinary a fact, the more it needs to be supported by strong proofs.”). See Tressoldi, “Extraordinary Claims Require Extraordinary Evidence: The Case of Non-Local Perception, a Classical and Bayesian Review of Evidences,” 2 Frontiers Psych. 117 (2011); Charles Coulston Gillispie, Pierre-Simon Laplace, 1749-1827: a life in exact science (1997).

[6]Extraordinary Claims Require Extraordinary Evidence” (Dec. 5, 2021).

[7] Benjamin Bayly, An Essay on Inspiration 362, part 2 (2nd ed. 1708).

[8] The Consensus Principle, under contract with the University of Chicago Press.

[9] SeeCheng’s Proposed Consensus Rule for Expert Witnesses” (Sept. 15, 2022);
Further Thoughts on Cheng’s Consensus Rule” (Oct. 3, 2022); “Consensus Rule – Shadows of Validity” (Apr. 26, 2023).

[10] Consensus Rule at 424 (citing but not quoting Ronald L. Wasserstein & Nicole A. Lazar, “The ASA Statement on p-Values: Context, Process, and Purpose,” 70 Am. Statistician 129, 131 (2016)).

[11] Yoav Benjamini, Richard D. DeVeaux, Bradly Efron, Scott Evans, Mark Glickman, Barry Braubard, Xuming He, Xiao Li Meng, Nancy Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young, and Karen Kafadar, “The ASA President’s Task Force Statement on Statistical Significance and Replicability,” 15 Annals of Applied Statistics 1084 (2021); see also “A Proclamation from the Task Force on Statistical Significance” (June 21, 2021).

Is the Scientific Method Fascist?

June 14th, 2023

Just before the pandemic, when our country seems to have gone tits up, there was a studied effort to equate any emphasis on scientific method, and the valuation of “[o]bjective, rational linear thinking; “[c]ause and effect relationships”; and “[q]uantitative emphasis,” with white privilege and microaggression against non-white people.

I am not making up this claim; I am not creative enough. Indeed, for a while, the  Smithsonian National Museum of African American History & Culture featured a graphic that included “emphasis on scientific method” as aspect of white culture, and implied it was an unsavory aspect of “white privilege.”[1]

Well, as it turns out, scientific method is not only racist, but fascist as well.

With pretentious citations to Deleuze,[2] Foucault,[3] and Lyotard,[4] a group of Canadian authors[5] set out to decolonize science and medicine from the fascist grip of scientific methodology and organizations such as the Cochrane Group. The grand insight is that the health sciences have been “colonized” by a scientific research “paradigm” that is “outrageously exclusionary and dangerously normative with regards to scientific knowledge.” By excluding “alternative forms of knowledge,” evidence-based medicine acts as a “fascist structure.” The Cochcrane Group in particular is singled out for having created an exclusionary and non-egalitarian hierarchy of evidence.  Intolerance for non-approved modes of inference and thinking are, in these authors’ view, “manifestations of fascism,” which are more “pernicious,” even if less brutal than the fascism practiced by Hitler and Mussolini.[6]

Clutch the pearls!

Never mind that “deconstruction” itself sounds a bit fascoid,[7] not to mention a rather vague concept. The authors seem intent to promote multiple ways of knowing without epistemic content. Indeed, our antifa authors do not attempt to show that evidence-based medicine leads regularly to incorrect results, or that their unspecified alternatives have greater predictive value. Nonetheless, decolonization of medicine and deconstruction of hierarchical methodology remain key for them to achieve an egalitarian epistemology, by which everyone is equally informed and equally stupid. In the inimitable words of the authors, “many scientists find themselves interpellated by hegemonic discourses and come to disregard all others.”[8]

These epistemic freedom fighters want to divorce the idea of evidence from objective reality, and make evidence bend to “values.”[9] Apparently, the required deconstruction of the “knowing subject” is that the subject is “implicitly implicitly male, white, Western and heterosexual.” Medicine’s fixation on binaries such as normal and pathological, male and female, shows that evidence-based medicine is simply not queer enough. Our intrepid authors must be credited for having outed the “hidden political agenda” of those who pretend simply to find the truth, but who salivate over imposing their “hegemonic norms,” asserted in the “name of ‘truth’.”

These Canadian authors leave us with a battle cry: “scholars have not only a scientific duty, but also an ethical obligation to deconstruct these regimes of power.”[10] Scientists of the world, you have nothing to lose but your socially constructed non-sensical conception of scientific truth.

Although it is easy to make fun of post-modernist pretensions,[11] there is a point about the force of argument and evidence. The word “valid” comes to us from the 16th century French word “valide,” which in turn comes from the Latin validus, meaning strong. Similarly, we describe a well-conducted study with robust findings as compelling our belief.

I recall the late Robert Nozick, back in the 1970s, expressing the view that someone who embraced a contradiction might pop out of existence, the way an electron and a positron might cancel each other. If only it were so, we might have people exercising more care in their thinking and speaking.


[1]Is Your Daubert Motion Racist?” (July 17, 2020). The Smithsonian has since seen fit to remove the chart reproduced here, but we know what they really believe.

[2] Gilles Deleuze and Félix Guattari, Anti-oedipus: Capitalism and Schizophrenia (1980); Gilles Deleuze and Félix Guattari, A Thousand Plateaus: Capitalism and Schizophrenia (1987). This dross enjoyed funding from the Canadian Institutes of Health Research, and the Social Science and Humanities Research Council of Canada.

[3] Michel Foucault, The Birth of the Clinic: An Archaeology of Medical Perception (1973); Michel Foucault, The History of Sexuality, Volume 1: An Introduction (trans. Robert Hurley 1978); Michel Foucault, Society Must Be Defended: Lectures at the Collège de France, 1975–1976 (2003); Michel Foucault, Power/Knowledge: Selected Interviews and Other Writings, 1972–1977 (1980); Michel Foucault, Fearless Speech (2001).

[4] Jean-François Lyotard, The Postmodern Condition: A Report on Knowledge (1984).

[5] Dave Holmes, Stuart J Murray, Amélie Perron, and Geneviève Rail, “Deconstructing the evidence-based discourse in health sciences: truth, power and fascism,” 4 Internat’l J. Evidence-Based Health 180 (2006) [Deconstructing]

[6][6] Deconstructing at 181.

[7] Pace David Frum

[8] Deconstructing at 182.

[9] Deconstructing at 183.

[10] Deconstructing  at 180-81.

[11] Alan D. Sokal, “Transgressing the Boundaries: Toward a Transformative Hermeneutics of Quantum Gravity,” 46 Social Text 217 (1994).

Consensus Rule – Shadows of Validity

April 26th, 2023

Back in 2011, at a Fourth Circuit Judicial Conference, Chief Justice John Roberts took a cheap shot at law professors and law reviews when he intoned:

“Pick up a copy of any law review that you see, and the first article is likely to be, you know, the influence of Immanuel Kant on evidentiary approaches in 18th Century Bulgaria, or something, which I’m sure was of great interest to the academic that wrote it, but isn’t of much help to the bar.”[1]

Anti-intellectualism is in vogue these days. No doubt, Roberts was jocularly indulging in an over-generalization, but for anyone who tries to keep up with the law reviews, he has a small point. Other judges have rendered similar judgments. Back in 1993, in a cranky opinion piece – in a law review – then Judge Richard A. Posner channeled the liar paradox by criticizing law review articles for “the many silly titles, the many opaque passages, the antic proposals, the rude polemics, [and] the myriad pretentious citations.”[2] In a speech back in 2008, Justice Stephen Breyer noted that “[t]here is evidence that law review articles have left terra firma to soar into outer space.”[3]

The temptation to rationalize, and to advocate for reflective equilibrium between the law as it exists, and the law as we think it should be, combine to lead to some silly and harmful efforts to rewrite the law as we know it.  Jeremy Bentham, Mr. Nonsense-on-Stilts, who sits stuffed in the hallway of the University of London, ushered in a now venerable tradition of rejecting tradition and common sense, in proposing all sorts of law reforms.[4]  In the early 1800s, Jeremy Bentham, without much in the way of actual courtroom experience, deviled the English bench and bar with sweeping proposals to place evidence law on what he thought was a rational foundation. As with his naïve utilitarianism, Bentham’s contributions to jurisprudence often ignored the realities of human experience and decision making. The Benthamite tradition of anti-tradition is certainly alive and well in the law reviews.

Still, I have a soft place in my heart for law reviews.  Although not peer reviewed, law reviews provide law students a tremendous opportunity to learn about writing and scholarship through publishing the work of legal scholars, judges, thoughtful lawyers, and other students. Not all law review articles are non-sense on stilts, but we certainly should have our wits about us when we read immodest proposals from the law professoriate.

*   *   *   *   *   *   *   *   *   *

Professor Edward Cheng has written broadly and insightfully about evidence law, and he certainly has the educational training to do so. Recently, Cheng has been bemused by the expert paradox, which wonders how lay persons, without expertise, can evaluate and judge issues of the admissibility, validity, and correctness of expert opinion. The paradox has long haunted evidence law, and it is at center stage in the adjudication of expert admissibility issues, as well as the trial of technical cases. Recently, Cheng has proposed a radical overhaul to the law of evidence, which would require that we stop asking courts to act as gatekeepers, and to stop asking juries to determine the validity and correctness of expert witness opinions before them. Cheng’s proposal would revert to the nose counting process of Frye and permit consideration of only whether there is an expert witness consensus to support the proffered opinion for any claim or defense.[5] Or in Plato’s allegory of the cave, we need to learn to be content with shadows on the wall rather than striving to know the real thing.

When Cheng’s proposal first surfaced, I wrote briefly about why it was a bad idea.[6] Since his initial publication, a law review symposium was assembled to address and perhaps to celebrate the proposal.[7] The papers from that symposium are now in print.[8] Unsurprisingly, the papers are both largely sympathetic (but not completely) to Cheng’s proposal, and virtually devoid of references to actual experiences of gatekeeping or trials of technical issues.

Cheng contends that the so-called Daubert framework for addressing the admissibility of expert witness opinion is wrong.  He does not argue that the existing law, in the form of Federal Rules of Evidence 702 and 703, does not call for an epistemic standard for both admitting opinion testimony, as well for the fact-finders’ assessments. There is no effort to claim that somehow four Supreme Court cases, and thousand of lower courts, have erroneously viewed the whole process. Rather, Cheng simply asserts non-expert judges cannot evaluate the reliability (validity) of expert witness opinions, and that non-expert jurors cannot “reach independent, substantive conclusions about specialized facts.”[9] The law must change to accommodate his judgment.

In his symposium contribution, Cheng expands upon his previous articulation of his proposed “consensus rule.”[10] What is conspicuously absent, however, is any example of failed gatekeeping that excluded valid expert witness opinion. One example, the appellate decision in Rosen v. Ciba-Geigy Corporation,[11] which Cheng does give, is illustrative of Cheng’s project. The expert witness, whose opinion was excluded, was on the faculty of the University of Chicago medical school; Richard Posner, the appellate judge who wrote the opinion that affirmed the expert witness’s exclusion, was on the faculty of that university’s law school. Without any discussion of the reports, depositions, hearings, or briefs, Cheng concludes that “the very idea that a law professor would tell medical school colleagues that their assessments were unreliable seems both breathtakingly arrogant and utterly ridiculous.”[12]

Except, of course, very well qualified scientists and physicians advance invalid and incorrect claims all the time. What strikes me as breathtakingly arrogant and utterly ridiculous is the judgment of a law professor who has little to no experience trying or defending Rule 702 and 703 issues labeling the “very idea” as arrogant and ridiculous. Aside from its being a petitio principia, we could probably add that the reaction is emotive, uninformed, and uninformative, and that it fails to support the author’s suggestion that “Daubert has it all wrong,” and that “[w]e need a different approach.”

Judges and jurors obviously will never fully understand the scientific issues before them.  If and when this lack of epistemic competence is problematic, we should honestly acknowledge how we are beyond the realm of the Constitution’s seventh amendment. Since Cheng is fantasizing about what the law should be, why not fantasize about not allowing lay people to decide complex scientific issues? Verdicts from jurors who do not have to give reasons for their decisions, and who are not in any sense peers of the scientists whose work they judge are normatively problematic.

Professor Cheng likens his consensus rule to how the standard of care is decided in medical malpractice litigation. The analogy is interesting, but hardly compelling in that it ignores “two schools of thought” doctrine.[13] In litigation of claims of professional malpractice, the “two schools of thought doctrine” is a complete defense.  As explained by the Pennsylvania Supreme Court,[14] physicians may defend against claims that they deviated from the standard of care, or of professional malpractice, by adverting to support for their treatment by a minority of professionals in their field:

“Where competent medical authority is divided, a physician will not be held responsible if in the exercise of his judgment he followed a course of treatment advocated by a considerable number of recognized and respected professionals in his given area of expertise.”[15]

The analogy to medical malpractice litigation seems inapt.

Professor Cheng advertises that he will be giving full-length book treatment to his proposal, and so perhaps my critique is uncharitable in looking at a preliminary, (antic?) law review article. Still, his proposal seems to ignore that “general acceptance” renders consensus, when it truly exists, as relevant to both the court’s gatekeeping decisions, and the fact finders’ determination of the facts and issues in dispute. Indeed, I have never seen a Rule 702 hearing that did not involve, to some extent, the assertion of a consensus, or the lack thereof.

To the extent that we remain committed to trials of scientific claims, we can see that judges and jurors often can detect inconsistencies, cherry picking, unproven assumptions, and other aspects of the patho-epistemology of exert witness opinions. It takes a community of scientists and engineers to build a space rocket, but any Twitter moron can determine when a rocket blows up on launch. Judges in particular have (and certainly should have) the competence to determine deviations from the scientific and statistical standards of care that pertain to litigants’ claims.

Cheng’s proposal also ignores how difficult and contentious it is to ascertain the existence, scope, and actual content of scientific consensus. In some areas of science, such as occupational and environmental epidemiology and medicine, faux consensuses are set up by would-be expert witnesses for both claimants and defendants. A search of the word “consensus” in the PubMed database yields over a quarter of a million hits. The race to the bottom is on. Replacing epistemic validity with sociological and survey navel gazing seems like a fool’s errand.

Perhaps the most disturbing aspect of Cheng’s proposal is what happens in the absence of consensus.  Pretty much anything goes, a situation that Cheng finds “interesting,” and I find horrifying:

“if there is no consensus, the legal system’s options become a bit more interesting. If there is actual dissensus, meaning that the community is fractured in substantial numbers, then the non-expert can arguably choose from among the available theories. If the expert community cannot agree, then one cannot possibly expect non-experts to do any better.”[16]

Cheng reports that textbooks and other documents “may be both more accurate and more efficient” evidence of consensus.[17] Maybe; maybe not.  Textbooks are typically often dated by the time they arrive on the shelves, and contentious scientists are not beyond manufacturing certainty or doubt in the form of falsely claimed consensus.

Of course, often, if not most of the time, there will be no identifiable, legitimate consensus for a litigant’s claim at trial. What would Professor Cheng do in this default situation? Here Cheng, fully indulging the frolic, tells us that we

“should hypothetically ask what the expert community is likely to conclude, rather than try to reach conclusions on their own.”[18]

So the default situation transforms jurors into tea-leaf readers of what an expert community, unknown to them, will do if and when there is evidence of a quantum and quality to support a consensus, or when that community gets around to articulating what the consensus is. Why not just toss claims that lack consensus support?


[1] Debra Cassens Weiss, “Law Prof Responds After Chief Justice Roberts Disses Legal Scholarship,” Am. Bar Ass’n J. (July 7, 2011).

[2] Richard A. Posner, “Legal Scholarship Today,” 45 Stanford L. Rev. 1647, 1655 (1993), quoted in Walter Olson, “Abolish the Law Reviews!” The Atlantic (July 5, 2012); see also Richard A. Posner, “Against the Law Reviews: Welcome to a world where inexperienced editors make articles about the wrong topics worse,”
Legal Affairs (Nov. 2004).

[3] Brent Newton, “Scholar’s highlight: Law review articles in the eyes of the Justices,” SCOTUS Blog (April 30, 2012); “Fixing Law Reviews,” Inside Higher Education (Nov. 19, 2012).

[4]More Antic Proposals for Expert Witness Testimony – Including My Own Antic Proposals” (Dec. 30, 2014).

[5] Edward K. Cheng, “The Consensus Rule: A New Approach to Scientific Evidence,” 75 Vanderbilt L. Rev. 407 (2022).

[6]Cheng’s Proposed Consensus Rule for Expert Witnesses” (Sept. 15, 2022);
Further Thoughts on Cheng’s Consensus Rule” (Oct. 3, 2022).

[7] Norman J. Shachoy Symposium, The Consensus Rule: A New Approach to the Admissibility of Scientific Evidence (2022), 67 Villanova L. Rev. (2022).

[8] David S. Caudill, “The ‘Crisis of Expertise’ Reaches the Courtroom: An Introduction to the Symposium on, and a Response to, Edward Cheng’s Consensus Rule,” 67 Villanova L. Rev. 837 (2022); Harry Collins, “The Owls: Some Difficulties in Judging Scientific Consensus,” 67 Villanova L. Rev. 877 (2022); Robert Evans, “The Consensus Rule: Judges, Jurors, and Admissibility Hearings,” 67 Villanova L. Rev. 883 (2022); Martin Weinel, “The Adversity of Adversarialism: How the Consensus Rule Reproduces the Expert Paradox,” 67 Villanova L. Rev. 893 (2022); Wendy Wagner, “The Consensus Rule: Lessons from the Regulatory World,” 67 Villanova L. Rev. 907 (2022); Edward K. Cheng, Elodie O. Currier & Payton B. Hampton, “Embracing Deference,” 67 Villanova L. Rev. 855 (2022).

[9] Embracing Deference at 876.

[10] Edward K. Cheng, Elodie O. Currier & Payton B. Hampton, “Embracing Deference,” 67 Villanova L. Rev. 855 (2022) [Embracing Deference]

[11] Rosen v. Ciba-Geigy Corp., 78 F.3d 316 (7th Cir. 1996).

[12] Embracing Deference at 859.

[13]Two Schools of Thought” (May 25, 2013).

[14] Jones v. Chidester, 531 Pa. 31, 40, 610 A.2d 964 (1992).

[15] Id. at 40.  See also Fallon v. Loree, 525 N.Y.S.2d 93, 93 (N.Y. App. Div. 1988) (“one of several acceptable techniques”); Dailey, “The Two Schools of Thought and Informed Consent Doctrine in Pennsylvania,” 98 Dickenson L. Rev. 713 (1994); Douglas Brown, “Panacea or Pandora’ Box:  The Two Schools of Medical Thought Doctrine after Jones v. Chidester,” 44 J. Urban & Contemp. Law 223 (1993).

[16] Embracing Deference at 861.

[17] Embracing Deference at 866.

[18] Embracing Deference at 876.

Reference Manual – Desiderata for 4th Edition – Part VI – Rule 703

February 17th, 2023

One of the most remarkable, and objectionable, aspects of the third edition was its failure to engage with Federal Rule of Evidence of 703, and the need for courts to assess the validity of individual studies relied upon. The statistics chapter has a brief, but important discussion of Rule 703, as does the chapter on survey evidence. The epidemiology chapter mentions Rule 703 only in a footnote.[1]

Rule 703 appears to be the red-headed stepchild of the Federal Rules, and it is often ignored and omitted from so-called Daubert briefs.[2] Perhaps part of the problem is that Rule 703 (“Bases of an Expert”) is one of the mostly poorly drafted rules in the Federal Rules of Evidence:

“An expert may base an opinion on facts or data in the case that the expert has been made aware of or personally observed. If experts in the particular field would reasonably rely on those kinds of facts or data in forming an opinion on the subject, they need not be admissible for the opinion to be admitted. But if the facts or data would otherwise be inadmissible, the proponent of the opinion may disclose them to the jury only if their probative value in helping the jury evaluate the opinion substantially outweighs their prejudicial effect.”

Despite its tortuous wording, the rule is clear enough in authorizing expert witnesses to rely upon studies that are themselves inadmissible, and allowing such witnesses to disclose the studies that they have relied upon, when there has been the requisite showing of probative value that outweighs any prejudice.

The statistics chapter in the third edition, nonetheless, confusingly suggested that

“a particular study may use a method that is entirely appropriate but that is so poorly executed that it should be inadmissible under Federal Rules of Evidence 403 and 702. Or, the method may be inappropriate for the problem at hand and thus lack the ‘fit’ spoken of in Daubert. Or the study might rest on data of the type not reasonably relied on by statisticians or substantive experts and hence run afoul of Federal Rule of Evidence 703.”[3]

Particular studies, even when beautifully executed, are not admissible. And particular studies are not subject to evaluation under Rule 702, apart from the gatekeeping of expert witness opinion testimony that is based upon the particular studies. To be sure, the reference to Rule 703 is important and welcomed counter to the suggestion, elsewhere in the third edition, that courts should not look at individual studies. The independent review of individual studies is occasionally lost in the shuffle of litigation, and the statistics chapter is correct to note an evidentiary concern whether each individual study may or may not be reasonably relied upon by an expert witness. In any event, reasonably relied upon studies do not ipso facto become admissible.

The third edition’s chapter on Survey Research contains the most explicit direction on Rule 703, in terms of courts’ responsibilities.  In that chapter, the authors instruct that Rule 703:

“redirect[ed] attention to the ‘validity of the techniques employed’. The inquiry under Rule 703 focuses on whether facts or data are ‘of a type reasonably relied upon by experts in the particular field in forming opinions or inferences upon the subject’.”[4]

Although Rule 703 is clear enough on admissibility, the epidemiology chapter described epidemiologic studies broadly as admissible if sufficiently rigorous:

“An epidemiologic study that is sufficiently rigorous to justify a conclusion that it is scientifically valid should be admissible, as it tends to make an issue in dispute more or less likely.”[5]

The authors of the epidemiology chapter acknowledge, in a footnote, “that [h]earsay concerns may limit the independent admissibility of the study, but the study could be relied on by an expert in forming an opinion and may be admissible pursuant to Fed. R. Evid. 703 as part of the underlying facts or data relied on by the expert.”[6]

This footnote is curious, and incorrect. There is no question that hearsay “concerns” “may limit” admissibility of a study; hearsay is inadmissible unless there is a statutory exception.[7] Rule 703 is not one of the exceptions to the rule against hearsay in Article VIII of the Federal Rules of Evidence. An expert witness’s reliance upon a study does not make the study admissible. The authors cite two cases,[8] but neither case held that reasonable reliance by expert witnesses transmuted epidemiologic studies into admissible evidence. The text of Rule 703 itself, and the overwhelming weight of case law interpreting and applying the rule,[9]  makes clear that the rule does not render scientific studies admissible. The two cases cited by the epidemiology chapter, Kehm and Ellis, both involved “factual findings” in public investigative or evaluative reports, which were independently admissible under Federal Rule of Evidence 803(8)(C).[10] As such, the cases failed to support the chapter’s suggestion that Rule 703 is a rule of admissibility for epidemiologic studies. The third edition thus, in one sentence, confused Rule 703 with an exception to the rule against hearsay, which would prevent the statistically based epidemiologic studies from being received in evidence. The point was reasonably clear, however, that studies “may be offered” to explain an expert witness’s opinion. Under Rule 705, that offer may also be refused.

The Reference Manual was certainly not alone in advancing the notion that studies are themselves admissible. Other well-respected evidence scholars have misstated the law on this issue.[11] The fourth edition would do well to note that scientific studies, and especially epidemiologic studies, involve multiple levels of hearsay. A typical epidemiologic study may contain hearsay leaps from patient to clinician, to laboratory technicians, to specialists interpreting test results, back to the clinician for a diagnosis, to a nosologist for disease coding, to a national or hospital database, to a researcher querying the database, to a statistician analyzing the data, to a manuscript that details data, analyses, and results, to editors and peer reviewers, back to study authors, and on to publication. Those leaps do not mean that the final results are thus untrustworthy or not reasonably relied upon, but they do raise well-nigh insuperable barriers to admissibility. The inadmissibility of scientific studies is generally not problematic because Rule 703 permits testifying expert witnesses to formulate opinions based upon facts and data, which are not themselves admissible in evidence. The distinction between relied upon, and admissible, studies is codified in the Federal Rules of Evidence, and in virtually every state’s evidence law.

The fourth edition might well also note that under Rule 104(a), the Rules of Evidence themselves do not govern a trial court’s preliminary determination, under Rules 702 or 703, of the admissibility of an expert witness’s opinion, or the appropriateness of reliance upon a particular study. Although Rule 705 may allow disclosure of facts and data described in studies, it is not an invitation to permit testifying expert witnesses to become a conduit for off-hand comments and opinions in the introduction or discussion sections of relied upon articles.[12] The wholesale admission of such hearsay opinions undermines the court’s control over opinion evidence. Rule 703 authorizes reasonable reliance upon “facts and data,” not every opinion that creeps into the published literature.

Reference Manual’s Disregard of Study Validity in Favor of the “Whole Tsumish”

The third edition evidence considerable ambivalence in whether trial judges should engage in resolving disputes about the validity of individual studies relied upon by expert witnesses. Since 2000, Rule 702 clearly required such engagement, which made the Manual’s hesitancy, on the whole, unjustifiable.  The ambivalence with respect to study validity, however, was on full display in the late Professor Margaret Berger’s chapter, “The Admissibility of Expert Testimony.”[13] Berger’s chapter criticized “atomization,” or looking at individual studies in isolation, a process she described pejoratively as “slicing-and-dicing.”[14]

Drawing on the publications of Daubert-critic Susan Haack, Berger appeared to reject the notion that courts should examine the reliability of each study independently.[15] Berger described the “proper” scientific method, as evidenced by works of the International Agency for Research on Cancer (IARC), the Institute of Medicine, the National Institute of Health, the National Research Council, and the National Institute for Environmental Health Sciences, “is to consider all the relevant available scientific evidence, taken as a whole, to determine which conclusion or hypothesis regarding a causal claim is best supported by the body of evidence.”[16]

Berger’s description of the review process, however, was profoundly misleading in its incompleteness. Of course, scientists undertaking a systematic review identify all the relevant studies, but some of the “relevant” studies may well be insufficiently reliable (because of internal or external validity issues) to answer the research question at hand. All the cited agencies, and other research organizations and researchers, exclude studies that are fundamentally flawed, whether as a result of bias, confounding, erroneous data analyses, or related problems. Berger cited no support for her remarkable suggestion that scientists do not make “reliability” judgments about available studies when assessing the “totality of the evidence.”[17]

Professor Berger, who had a distinguished career as a law professor and evidence scholar, died in November 2010, before the third edition was published. She was no friend of Daubert,[18] but her antipathy remarkably outlived her. Berger’s critical discussion of “atomization” cited the notorious decision in Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11, 26 (1st Cir. 2011), which was decided four months after her passing.[19]

Professor Berger’s contention about the need to avoid assessments of individual studies in favor of the whole “tsumish” must also be rejected because Federal Rule of Evidence 703 requires that each study considered by an expert witness “qualify” for reasonable reliance by virtue of the study’s containing facts or data that are “of a type reasonably relied upon by experts in the particular field forming opinions or inferences upon the subject.” One of the deeply troubling aspects of the Milward decision is that it reversed the trial court’s sensible decision to exclude a toxicologist, Dr. Martyn Smith, who outran his headlights on issues having to do with a field in which he was clearly inexperienced – epidemiology.

Another curious omission in the third edition’s discussions of Milward is the dark ethical cloud of misconduct that hovers over the First Circuit’s reversal of the trial court’s exclusions of Martyn Smith and Carl Cranor. On appeal, the Council for Education and Research on Toxics (CERT) filed an amicus brief in support of reversing the exclusion of Smith and Cranor. The CERT amicus brief, however, never disclosed that CERT was founded by Smith and Cranor, and that CERT funded Smith’s research.[20]

Rule 702 requires courts to pay attention to, among other things, the sufficiency of the facts and data relied upon by expert witnesses. Rule 703’s requirement that individual studies must be reasonably relied upon is an important additional protreptic against the advice given by Professor Berger, in the third edition.


[1] The index notes the following page references for Rule 703: 214, 361, 363-364, and 610 n.184.

[2] See David E. Bernstein & Eric G. Lasker,“Defending Daubert: It’s Time to Amend Federal Rule of Evidence 702,” 57 William & Mary L. Rev. 1, 32 (2015) (“Rule 703 is frequently ignored in Daubert analyses”);  Schachtman, “Rule 703 – The Problem Child of Article VII,” 17 Proof 3 (Spring 2009); Schachtman “The Effective Presentation of Defense Expert Witnesses and Cross-examination of Plaintiffs’ Expert Witnesses”; at the ALI-ABA Course on Opinion and Expert Witness Testimony in State and Federal Courts (February 14-15, 2008). See also Julie E. Seaman, “Triangulating Testimonial Hearsay: The Constitutional Boundaries of Expert Opinion Testimony,” 96 Georgetown L.J. 827 (2008); “RULE OF EVIDENCE 703 — Problem Child of Article VII” (Sept. 19, 2011); “Giving Rule 703 the Cold Shoulder” (May 12, 2012); “New Reference Manual on Scientific Evidence Short Shrifts Rule 703,” (Oct. 16, 2011).

[3] RMSE3d at 214.

[4] RMSE3d at 364 (internal citations omitted).

[5] RMSE 3d at 610 (internal citations omitted).

[6] RSME3d at 601 n.184.

[7] Rule 802 (“Hearsay Rule”) “Hearsay is not admissible except as provided by these rules or by other rules prescribed by the Supreme Court pursuant to statutory authority or by Act of Congress.”

[8] Kehm v. Procter & Gamble Co., 580 F. Supp. 890, 902 (N.D. Iowa 1982) (“These [epidemiologic] studies were highly probative on the issue of causation—they all concluded that an association between tampon use and menstrually related TSS [toxic shock syndrome] cases exists.”), aff’d, 724 F.2d 613 (8th Cir. 1984); Ellis v. International Playtex, Inc., 745 F.2d 292, 303 (4th Cir. 1984). The chapter also cited another the en banc decision in Christophersen for the proposition that “[a]s a general rule, questions relating to the bases and sources of an expert’s opinion affect the weight to be assigned that opinion rather than its admissibility. . . . ” In the Christophersen case, the Fifth Circuit was clearly addressing the admissibility of the challenged expert witness’s opinions, not the admissibility of relied-upon studies. Christophersen v. Allied-Signal Corp., 939 F.2d 1106, 1111, 1113-14 (5th Cir. 1991) (en banc) (per curiam) (trial court may exclude opinion of expert witness whose opinion is based upon incomplete or inaccurate exposure data), cert. denied, 112 S. Ct. 1280 (1992).

[9] Interestingly, the authors of this chapter abandoned their suggestion, advanced in the second edition, that studies relied upon “might qualify for the learned treatise exception to the hearsay rule, Fed. R. Evid. 803(18), or possibly the catchall exceptions, Fed. R. Evid. 803(24) & 804(5).” which was part of their argument in the Second Edition. RMSE 2d at 335 (2000). See also RMSE 3d at 214 (discussing statistical studies as generally “admissible,” but acknowledging that admissibility may be no more than permission to explain the basis for an expert’s opinion, which is hardly admissibility at all).

[10] See Ellis, 745 F.2d at 299-303; Kehm, 724 F.2d at 617-18. These holdings predated the Supreme Court’s 1993 decision in Daubert, and the issue whether they are subject to Rule 702 has not been addressed.  Federal agency factual findings have been known to be invalid, on occasion.

[11] David L. Faigman, et al., Modern Scientific Evidence: The Law and Science of Expert Testimony v.1, § 23:1,at 206 (2009) (“Well conducted studies are uniformly admitted.”).

[12] Montori, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Br. Med. J. 1093, 1093 (2004) (advising readers on how to avoid being misled by published literature, and counseling readers to “Read only the Methods and Results sections; bypass the Discussion section.”)  (emphasis added).

[13] RSME 3d 11 (2011).

[14] Id. at 19.

[15] Id. at 20 & n. 51 (citing Susan Haack, “An Epistemologist in the Bramble-Bush: At the Supreme Court with Mr. Joiner,” 26 J. Health Pol. Pol’y & L. 217–37 (1999).

[16] Id. at 19-20 & n.52.

[17] See Berger, “The Admissibility of Expert Testimony,” RSME 3d 11 (2011).  Professor Berger never mentions Rule 703 at all!  Gone and forgotten.

[18] Professor Berger filed an amicus brief on behalf of plaintiffs, in Rider v. Sandoz Pharms. Corp., 295 F.3d 1194 (11th Cir. 2002).

[19] Id. at 20 n.51. (The editors note that the published chapter was Berger’s last revision, with “a few edits to respond to suggestions by reviewers.”) The addition of the controversial Milward decision cannot seriously be considered an “edit.”

[20]From Here to CERT-ainty” (June 28, 2018); “ THE COUNCIL FOR EDUCATION AND RESEARCH ON TOXICS” (July 9, 2013).

Reference Manual – Desiderata for 4th Edition – Part IV – Confidence Intervals

February 10th, 2023

Putting aside the idiosyncratic chapter by the late Professor Berger, most of the third edition of the Reference Manual presented guidance on many important issues.  To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges. On several issues, especially statistical in nature, the fourth edition could benefit from an editor to ensure that the individual chapters, written by different authors, actually agree on key concepts.  One such example is the third edition’s treatment of confidence intervals.[1]

The “DNA Identification” chapter noted that the meaning of a confidence interval is subtle,[2] but I doubt that the authors, David Kaye and George Sensabaugh, actually found it subtle or difficult. In the third edition’s chapter on statistics, David Kaye and co-author, the late David A. Freedman, gave a reasonable definition of confidence intervals in their glossary:

confidence interval. An estimate, expressed as a range, for a parameter. For estimates such as averages or rates computed from large samples, a 95% confidence interval is the range from about two standard errors below to two standard errors above the estimate. Intervals obtained this way cover the true value about 95% of the time, and 95% is the confidence level or the confidence coefficient.”[3]

Intervals, not the interval, which is correct. This chapter made clear that it was the procedure of obtaining multiple samples with intervals that yielded the 95% coverage. In the substance of their chapter, Kaye and Freedman are explicit about how intervals are constructed, and that:

“the confidence level does not give the probability that the unknown parameter lies within the confidence interval.”[4]

Importantly, the authors of the statistics chapter named names; that is, they cited some cases that butchered the concept of the confidence interval.[5] The fourth edition will have a more difficult job because, despite the care taken in the statistics chapter, many more decisions have misstated or misrepresented the meaning of a confidence interval.[6] Citing more cases perhaps will disabuse federal judges of their reliance upon case law for the meaning of statistical concepts.

The third edition’s chapter on multiple regression defined confidence interval in its glossary:

confidence interval. An interval that contains a true regression parameter with a given degree of confidence.”[7]

The chapter avoided saying anything obviously wrong only by giving a very circular definition. When the chapter substantively described a confidence interval, it ended up giving an erroneous one:

“In general, for any parameter estimate b, the expert can construct an interval around b such that there is a 95% probability that the interval covers the true parameter. This 95% confidence interval is given by: b ± 1.96 (SE of b).”[8]

The formula provided is correct, but the interpretation of a 95% probability that the interval covers the true parameter is unequivocably wrong.[9]

The third edition’s chapter by Shari Seidman Diamond on survey research, on the other hand, gave an anodyne example and a definition:

“A survey expert could properly compute a confidence interval around the 20% estimate obtained from this sample. If the survey were repeated a large number of times, and a 95% confidence interval was computed each time, 95% of the confidence intervals would include the actual percentage of dentists in the entire population who would believe that Goldgate was manufactured by the makers of Colgate.

                 *  *  *  *

Traditionally, scientists adopt the 95% level of confidence, which means that if 100 samples of the same size were drawn, the confidence interval expected for at least 95 of the samples would be expected to include the true population value.”[10]

Similarly, the third edition’s chapter on epidemiology correctly defined the confidence interval operationally as a process of iterative intervals that collectively cover the true value in 95% of all the intervals:

“A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”[11]

Not content to leave it well said, the chapter’s authors returned to the confidence interval and provided another, more problematic definition, a couple of pages later in the text:

“A confidence interval is a range of possible values calculated from the results of a study. If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[12]

The first sentence refers to “a study”; that is, one study, one range of values. The second sentence then tells us that “the range” (singular, presumably referring back to the single “a study”), will capture 95% of the results from many resamplings from the same population. Now the definition is not framed with respect to the true population parameter, but the results from many other samples. The authors seem to have given the first sample’s confidence interval the property of including 95% of all future studies, and that is incorrect. From reviewing the case law, courts remarkably have gravitated to the second, incorrect definition.

The glossary to the third edition’s epidemiology chapter clearly, however, runs into the ditch:

“confidence interval. A range of values calculated from the results of a study within which the true value is likely to fall; the width of the interval reflects random error. Thus, if a confidence level of .95 is selected for a study, 95% of similar studies would result in the true relative risk falling within the confidence interval.”[13]

Note that the sentence before the semicolon talked of “a study” with “a range of values,” and that there is a likelihood of that range including the “true value.” This definition thus used the singular to describe the study and to describe the range of values.  The definition seemed to be saying, clearly but wrongly, that a single interval from a single study has a likelihood of containing the true value. The second full sentence ascribed a probability, 95%, to the true relative risk’s falling within “the interval.” To point out the obvious, “the interval,” is singular, and refers back to “a study,” also singular. At best, this definition was confusing; at worst, it was wrong.

The Reference Manual has a problem beyond its own inconsistencies, and the refractory resistance of the judiciary to statistical literacy. There are any number of law professors and even scientists who have held out incorrect definitions and interpretations of confidence intervals.  It would be helpful for the fourth edition to caution its readers, both bench and bar, to the prevalent misunderstandings.

Here, for instance, is an example of a well-credentialed statistician, who gave a murky definition in a declaration filed in federal court:

“If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”[14]

The expert witness correctly identifies the repeated sampling, but specifies a 95% probability to “the range,” which leaves unclear whether it is the range of all intervals or “a 95% confidence interval,” which is in the antecedent of the statement.

Much worse was a definition proffered in a recent law review article by well-known, respected authors:

“A 95% confidence interval, in contrast, is a one-sided or two-sided interval from a data sample with 95% probability of bounding a fixed, unknown parameter, for which no nondegenerate probability distribution is conceived, under specified assumptions about the data distribution.”[15]

The phrase “for which no nondegenerate probability distribution is conceived,” is unclear as to whether the quoted phrase refers to the confidence interval or to the unknown parameter. It seems that the phrase modifies the noun closest to it in the sentence, the “fixed, unknown parameter,” which suggests that these authors were simply trying to emphasize that they were giving a frequentist interpretation and not conceiving of the parameter as a random variable as Bayesians would. The phrase “no nondegenerate” appears to be a triple negative, since a degenerate distribution is one that does not have a variation. The phrase makes the definition obscure, and raises questions what is being excluded by the phrase.

The more concerning aspect of the quoted footnote is its obfuscation of the important distinction between the procedure of repeatedly calculating confidence intervals (which procedure has a 95% success rate in the long run) and the probability that any given instance of the procedure, in a single confidence interval, contains the parameter. The latter probability is either zero or one.

The definition’s reference to “a” confidence interval, based upon “a” data sample, actually leaves the reader with no way of understanding the definition to be referring to the repeated process of sampling, and the set of resulting intervals. The upper and lower interval bounds are themselves random variables that need to be taken into account, but by referencing a single interval from a single data sample, the authors misrepresent the confidence interval and invite a Bayesian interpretation.[16]

Sadly, there is a long tradition of scientists and academics in giving errant definitions and interpretations of the confidence interval.[17] Their error is not harmless because they invite the attribution of a high level of probability to the claim that the “true” population measure is within the reported confidence interval. The error encourages readers to believe that the confidence interval is not conditioned upon the single sample result, and it misleads readers into believing that not only random error, but systematic and data errors are accounted for in the posterior probability.[18] 


[1]Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012).

[2] David H. Kaye & George Sensabaugh, “Reference Guide on DNA Identification Evidence” 129, 165 n.76.

[3] David H. Kaye & David A. Freedman, “Reference Guide on Statistics” 211, 284-5 (Glossary).

[4] Id. at 247.

[5] Id. at 247 n.91 & 92 (citing DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993); SmithKline Beecham Corp. v. Apotex Corp., 247 F. Supp. 2d 1011, 1037 (N.D. Ill. 2003), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005); In re Silicone Gel Breast Implants Prods. Liab. Litig, 318 F. Supp. 2d 879, 897 (C.D. Cal. 2004) (“a margin of error between 0.5 and 8.0 at the 95% confidence level . . . means that 95 times out of 100 a study of that type would yield a relative risk value somewhere between 0.5 and 8.0.”).

[6] See, e.g., Turpin v. Merrell Dow Pharm., Inc., 959 F.2d 1349, 1353–54 & n.1 (6th Cir. 1992) (erroneously describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10”); American Library Ass’n v. United States, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), rev’d on other grounds, 539 U.S. 194 (2003); Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc., 482 F.Supp. 2d 478, 495 (D.N.J.2007) (“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”) (apparently relying party’s expert witness’s report), aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd., 344 Fed.Appx. 595 (Fed. Cir. 2009); Eli Lilly & Co. v. Teva Pharms, USA, 2008 WL 2410420, *24 (S.D. Ind. 2008) (stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”); Benavidez v. City of Irving, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009) (interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”); Pritchard v. Dow Agro Sci., 705 F. Supp. 2d 471, 481, 488 (W.D. Pa. 2010) (excluding Dr. Bennet Omalu who assigned a 90% probability that an 80% confidence interval excluded relative risk of 1.0), aff’d, 430 F. App’x 102 (3d Cir.), cert. denied, 132 S. Ct. 508 (2011); Estate of George v. Vermont League of Cities and Towns, 993 A.2d 367, 378 n.12 (Vt. 2010) (erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”); Garcia v. Tyson Foods, 890 F. Supp. 2d 1273, 1285 (D. Kan. 2012) (quoting expert witness Robert G. Radwin, who testified that a 95% confidence interval in a study means “if I did this study over and over again, 95 out of a hundred times I would expect to get an average between that interval.”); In re Chantix (Varenicline) Prods. Liab. Litig., 889 F. Supp. 2d 1272, 1290n.17 (N.D. Ala. 2012); In re Zoloft Products, 26 F. Supp. 3d 449, 454 (E.D. Pa. 2014) (“A 95% confidence interval means that there is a 95% chance that the ‘‘true’’ ratio value falls within the confidence interval range.”), aff’d, 858 F.3d 787 (3d Cir. 2017); Duran v. U.S. Bank Nat’l Ass’n, 59 Cal. 4th 1, 36, 172 Cal. Rptr. 3d 371, 325 P.3d 916 (2014) (“Statisticians typically calculate margin of error using a 95 percent confidence interval, which is the interval of values above and below the estimate within which one can be 95 percent certain of capturing the ‘true’ result.”); In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832, 842 (2017) (correctly quoting an incorrect definition from the third edition at p.580), rev’d on other grounds, 235 N.J. 229, 194 A.3d 503 (2018); In re Testosterone Replacement Therapy Prods. Liab., No. 14 C 1748, MDL No. 2545, 2017 WL 1833173, *4 (N.D. Ill. May 8, 2017) (“A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time.”); Maldonado v. Epsilon Plastics, Inc., 22 Cal. App. 5th 1308, 1330, 232 Cal. Rptr. 3d 461 (2018) (“The 95 percent ‘confidence interval’, as used by statisticians, is the ‘interval of values above and below the estimate within which one can be 95 percent certain of capturing the “true” result’.”); Escheverria v. Johnson & Johnson, 37 Cal. App. 5th 292, 304, 249 Cal. Rptr. 3d 642 (2019) (quoting uncritically and with approval one of plaintiff’s expert witnesses, Jack Siemiatycki, who gave the jury an example of a study with a relative risk of 1.2, with a “95 percent probability that the true estimate is between 1.1 and 1.3.” According to the court, Siemiatycki went on to explain that this was “a pretty tight interval, and we call that a confidence interval. We call it a 95 percent confidence interval when we calculate it in such a way that it covers 95 percent of the underlying relative risks that are compatible with this estimate from this study.”); In re Viagra (Sildenafil Citrate) & Cialis (Tadalafil) Prods. Liab. Litig., 424 F.Supp.3d 781, 787 (N.D. Cal. 2020) (“For example, a given study could calculate a relative risk of 1.4 (a 40 percent increased risk of adverse events), but show a 95 percent “confidence interval” of .8 to 1.9. That confidence interval means there is 95 percent chance that the true value—the actual relative risk—is between .8 and 1.9.”); Rhyne v. United States Steel Corp., 74 F. Supp. 3d 733, 744 (W.D.N.C. 2020) (relying upon, and quoting, one of the more problematic definitions given in the third edition at p.580: “If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the population.”); Wilant v. BNSF Ry., C.A. No. N17C-10-365 CEB, (Del. Super. Ct. May 13, 2020) (citing third edition at p.573, “a confidence interval provides ‘a range (interval) within which the risk likely would fall if the study were repeated numerous times’.”; “[s]o a 95% confidence interval indicates that the range of results achieved in the study would be achieved 95% of the time when the study is replicated from the same population.”); Germaine v. Sec’y Health & Human Servs., No. 18-800V, (U.S. Fed. Ct. Claims July 29, 2021) (giving an incorrect definition directly from the third edition, at p.621; “[a] “confidence interval” is “[a] range of values … within which the true value is likely to fall[.]”).

[7] Daniel Rubinfeld, “Reference Guide on Multiple Regression” 303, 352.

[8] Id. at 342.

[9] See Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman, “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” 31 Eur. J. Epidemiol. 337, 343 (2016).

[10] Shari Seidman Diamond, “Reference Guide on Survey Research” 359, 381.

[11] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” 549, 573.

[12] Id. at 580.

[13] Id. at 621.

[14] In re Testosterone Replacement Therapy Prods. Liab. Litig., Declaration of Martin T. Wells, Ph.D., at 2-3 (N.D. Ill., Oct. 30, 2016). 

[15] Joseph Sanders, David Faigman, Peter Imrey, and A. Philip Dawid, “Differential Etiology: Inferring Specific Causation in the Law from Group Data in Science,” 63 Arizona L. Rev. 851, 898 n.173 (2021).

[16] The authors are well-credentialed lawyers and scientists. Peter Imrey, was trained in, and has taught, mathematical statistics, biostatistics, and epidemiology. He is a professor of medicine in the Cleveland Clinic Lerner College of Medicine. A. Philip Dawid is a distinguished statistician, an Emeritus Professor of Statistics, Cambridge University, Darwin College, and a Fellow of the Royal Society. David Faigman is the Chancellor & Dean, and the John F. Digardi Distinguished Professor of Law at the University of California Hastings College of the Law. Joseph Sanders is the A.A. White Professor, at the University of Houston Law Center. I have previously pointed this problem in these authors’ article. “Differential Etiologies – Part One – Ruling In” (June 19, 2022).

[17] See, e.g., Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004) (“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”); Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 60-61 n. 17 (2007) (quoting Clapp and Ozonoff with obvious approval); Déirdre DwyerThe Judicial Assessment of Expert Evidence 154-55 (Cambridge Univ. Press 2008) (“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”); Frank C. Woodside, III & Allison G. Davis, “The Bradford Hill Criteria: The Forgotten Predicate,” 35 Thomas Jefferson L. Rev. 103, 110 (2013) (“A confidence interval provides both the relative risk found in the study and a range (interval) within which the risk would likely fall if the study were repeated numerous times.”); Christopher B. Mueller, “Daubert Asks the Right Questions:  Now Appellate Courts Should Help Find the Right Answers,” 33 Seton Hall L. Rev. 987, 997 (2003) (describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant & Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 Judicature 12, 19 (2003) (“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, Modern Epidemiology 190 (Philadelphia 1998);  John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New Reference Manual on Scientific Evidence,” 74 N.C.L.Rev. 1183, 1212 n.172 (1996) (“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

[18] See Brock v. Merrill Dow Pharm., Inc., 874 F.2d 307, 311–12 (5th Cir. 1989) (incorrectly stating that the court need not resolve questions of bias and confounding because “the studies presented to us incorporate the possibility of these factors by the use of a confidence interval”). Bayesian credible intervals can similarly be misleading when the interval simply reflects sample results and sample variance, but not the myriad other ways the estimate may be wrong.