TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Amicus Curious – Gelbach’s Foray into Lipitor Litigation

August 25th, 2022

Professor Schauer’s discussion of statistical significance, covered in my last post,[1] is curious for its disclaimer that “there is no claim here that measures of statistical significance map easily onto measures of the burden of proof.” Having made the disclaimer, Schauer proceeds to falls into the transposition fallacy, which contradicts his disclaimer, and, generally speaking, is not a good thing for a law professor eager to advance the understanding of “The Proof,” to do.

Perhaps more curious than Schauer’s error is his citation support for his disclaimer.[2] The cited paper by Jonah B. Gelbach is one of several of Gelbach’s papers that advances the claim that the p-value does indeed map onto posterior probability and the burden of proof. Gelbach’s claim has also been the center piece in his role as an advocate in support of plaintiffs in the Lipitor (atorvastatin) multi-district litigation (MDL) over claims that ingestion of atorvastatin causes diabetes mellitus.

Gelbach’s intervention as plaintiffs’ amicus is peculiar on many fronts. At the time of the Lipitor litigation, Sonal Singh was an epidemiologist and Assistant Professor of Medicine, at the Johns Hopkins University. The MDL trial court initially held that Singh’s proffered testimony was inadmissible because of his failure to consider daily dose.[3] In a second attempt, Singh offered an opinion for 10 mg daily dose of atorvastatin, based largely upon the results of a clinical trial known as ASCOT-LLA.[4]

The ASCOT-LLA trial randomized 19,342 participants with hypertension and at least three other cardiovascular risk factors to two different anti-hypertensive medications. A subgroup with total cholesterol levels less than or equal to 6.5 mmol./l. were randomized to either daily 10 mg. atorvastatin or placebo.  The investigators planned to follow up for five years, but they stopped after 3.3 years because of clear benefit on the primary composite end point of non-fatal myocardial infarction and fatal coronary heart disease. At the time of stopping, there were 100 events of the primary pre-specified outcome in the atorvastatin group, compared with 154 events in the placebo group (hazard ratio 0.64 [95% CI 0.50 – 0.83], p = 0.0005).

The atorvastatin component of ASCOT-LLA had, in addition to its primary pre-specified outcome, seven secondary end points, and seven tertiary end points.  The emergence of diabetes mellitus in this trial population, which clearly was at high risk of developing diabetes, was one of the tertiary end points. Primary, secondary, and tertiary end points were reported in ASCOT-LLA without adjustment for the obvious multiple comparisons. In the treatment group, 3.0% developed diabetes over the course of the trial, whereas 2.6% developed diabetes in the placebo group. The unadjusted hazard ratio was 1.15 (0.91 – 1.44), p = 0.2493.[5] Given the 15 trial end points, an adjusted p-value for this particular hazard ratio, for diabetes, might well exceed 0.5, and even approach 1.0.

On this record, Dr. Singh honestly acknowledged that statistical significance was important, and that the diabetes finding in ASCOT-LLA might have been the result of low statistical power or of no association at all. Based upon the trial data alone, he testified that “one can neither confirm nor deny that atorvastatin 10 mg is associated with significantly increased risk of type 2 diabetes.”[6] The trial court excluded Dr. Singh’s 10mg/day causal opinion, but admitted his 80mg/day opinion. On appeal, the Fourth Circuit affirmed the MDL district court’s rulings.[7]

Jonah Gelbach is a professor of law at the University of California at Berkeley. He attended Yale Law School, and received his doctorate in economics from MIT.

Professor Gelbach entered the Lipitor fray to present a single issue: whether statistical significance at conventionally demanding levels such as 5 percent is an appropriate basis for excluding expert testimony based on statistical evidence from a single study that did not achieve statistical significance.

Professor Gelbach is no stranger to antic proposals.[8] As amicus curious in the Lipitor litigation, Gelbach asserts that plaintiffs’ expert witness, Dr. Singh, was wrong in his testimony about not being able to confirm the ASCOT-LLA association because he, Gelbach, could confirm the association.[9] Ultimately, the Fourth Circuit did not discuss Gelbach’s contentions, which is not surprising considering that the asserted arguments and alleged factual considerations were not only dehors the record, but in contradiction of the record.

Gelbach’s curious claim is that any time a risk ratio, for an exposure and an outcome of interest, is greater than 1.0, with a p-value < 0.5,[10] the evidence should be not only admissible, but sufficient to support a conclusion of causation. Gelbach states his claim in the context of discussing a single randomized controlled trial (ASCOT-LLA), but his broad pronouncements are carelessly framed such that others may take them to apply to a single observational study, with its greater threats to internal validity.

Contra Kumho Tire

To get to his conclusion, Gelbach attempts to remove the constraints of traditional standards of significance probability. Kumho Tire teaches that expert witnesses must “employ[] in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.”[11] For Gelbach, this “eminently reasonable admonition” does not impose any constraints on statistical inference in the courtroom. Statistical significance at traditional levels (p < 0.05) is for elitist scholarly work, not for the “practical” rent-seeking work of the tort bar. According to Gelbach, the inflation of the significance level ten-fold to p < 0.5 is merely a matter of “weight” and not admissibility of any challenged opinion testimony.

Likelihood Ratios and Posterior Probabilities

Gelbach maintains that any evidence that has a likelihood ratio (LR > 1) greater than one is relevant, and should be admissible under Federal Rule of Evidence 401.[12] This argument ignores the other operative Federal Rules of Evidence, namely 702 and 703, which impose additional criteria of admissibility for expert witness opinion testimony.

With respect to variance and random error, Gelbach tells us that any evidence that generates a LR > 1, should be admitted when “the statistical evidence is statistically significant below the 50 percent level, which will be true when the p-value is less than 0.5.”[13]

At times, Gelbach seems to be discussing the admissibility of the ASCOT-LLA study itself, and not the proffered opinion testimony of Dr. Singh. The study itself would not be admissible, although it is clearly the sort of hearsay an expert witness in the field may consider. If Dr. Singh were to have reframed and recalculated the statistical comparisons, then the Rule 703 requirement of “reasonable reliance” by scientists in the field of interest may not have been satisfied.

Gelbach also generates a posterior probability (0.77), which is based upon his calculations from data in the ASCOT-LLA trial, and not the posterior probability of Dr. Singh’s opinion. The posterior probability, as calculated, is problematic on many fronts.

Gelbach does not present his calculations – for the sake of brevity he says – but he tells us that the ASCOT-LLA data yield a likelihood ratio of roughly 1.9, and a p-value of 0.126.[14] What the clinical trialists reported was a hazard ratio of 1.15, which is a weak association on most researchers’ scales, with a two-sided p-value of 0.25, which is five times higher than the usual 5 percent. Gelbach does not explain how or why his calculated p-value for the likelihood ratio is roughly half the unadjusted, two-sided p-value for the tertiary outcome from ASCOT-LLA.

As noted, the reported diabetes hazard ratio of 1.15 was a tertiary outcome for the ASCOT trial, one of 15 calculated by the trialists, with p-values unadjusted for multiple comparisons.  The failure to adjust is perhaps excusable in that some (but certainly not all) of the outcome variables are overlapping or correlated. A sophisticated reader would not be misled; only when someone like Gelbach attempts to manufacture an inflated posterior probability without accounting for the gross underestimate in variance is there an insult to statistical science. Gelbach’s recalculated p-value for his LR, if adjusted for the multiplicity of comparisons in this trial, would likely exceed 0.5, rendering all his arguments nugatory.

Using the statistics as presented by the published ASCOT-LLA trial to generate a posterior probability also ignores the potential biases (systematic errors) in data collection, the unadjusted hazard ratios, the potential for departures from random sampling, errors in administering the participant recruiting and inclusion process, and other errors in measurements, data collection, data cleaning, and reporting.

Gelbach correctly notes that there is nothing methodologically inappropriate in advocating likelihood ratios, but he is less than forthcoming in explaining that such ratios translate into a posterior probability only if he posits a prior probability of 0.5.[15] His pretense to having simply stated “mathematical facts” unravels when we consider his extreme, unrealistic, and unscientific assumptions.

The Problematic Prior

Gelbach’s glibly assumes that the starting point, the prior probability, for his analysis of Dr. Singh’s opinion is 50%. This is an old and common mistake,[16] long since debunked.[17] Gelbach’s assumption is part of an old controversy, which surfaced in early cases concerning disputed paternity. The assumption, however, is wrong legally and philosophically.

The law simply does not hand out 0.5 prior probability to both parties at the beginning of a trial. As Professor Jaffee noted almost 35 years ago:

“In the world of Anglo-American jurisprudence, every defendant, civil and criminal, is presumed not liable. So, every claim (civil or criminal) starts at ground zero (with no legal probability) and depends entirely upon proofs actually adduced.”[18]

Gelbach assumes that assigning “equal prior probability” to two adverse parties is fair, because the fact-finder would not start hearing evidence with any notion of which party’s contentions are correct. The 0.5/0.5 starting point, however, is neither fair nor is it the law.[19] The even odds prior is also not good science.

The defense is entitled to a presumption that it is not liable, and the plaintiff must start at zero.  Bayesians understand that this is the death knell of their beautiful model.  If the prior probability is zero, then Bayes’ Theorem tells us mathematically that no evidence, no matter how large a likelihood ratio, can move the prior probability of zero towards one. Bayes’ theorem may be a correct statement about inverse probabilities, but still be an inadequate or inaccurate model for how factfinders do, or should, reason in determining the ultimate facts of a case.

We can see how unrealistic and unfair Gelbach’s implied prior probability is if we visualize the proof process as a football field.  To win, plaintiffs do not need to score a touchdown; they need only cross the mid-field 50-yard line. Rather than making plaintiffs start at the zero-yard line, however, Gelbach would put them right on the 50-yard line. Since one toe over the mid-field line is victory, the plaintiff is spotted 99.99+% of its burden of having to present evidence to build up 50% probability. Instead, plaintiffs are allowed to scoot from the zero yard line right up claiming success, where even the slightest breeze might give them winning cases. Somehow, in the model, plaintiffs no longer have to present evidence to traverse the first half of the field.

The even odds starting point is completely unrealistic in terms of the events upon which the parties are wagering. The ASCOT-LLA study might have shown a protective association between atorvastatin and diabetes, or it might have shown no association at all, or it might have show a larger hazard ratio than measured in this particular sample. Recall that the confidence interval for hazard ratios for diabetes ran from 0.91 to 1.44. In other words, parameters from 0.91 (protective association) to 1.0 (no association), to 1.44 (harmful association) were all reasonably compatible with the observed statistic, based upon this one study’s data. The potential outcomes are not binary, which makes the even odds starting point inappropriate.[20]


[1]Schauer’s Long Footnote on Statistical Significance” (Aug. 21, 2022).

[2] Frederick Schauer, The Proof: Uses of Evidence in Law, Politics, and Everything Else 54-55 (2022) (citing Michelle M. Burtis, Jonah B. Gelbach, and Bruce H. Kobayashi, “Error Costs, Legal Standards of Proof, and Statistical Significance,” 25 Supreme Court Economic Rev. 1 (2017).

[3] In re Lipitor Mktg., Sales Practices & Prods. Liab. Litig., MDL No. 2:14–mn–02502–RMG, 2015 WL 6941132, at *1  (D.S.C. Oct. 22, 2015).

[4] Peter S. Sever, et al., “Prevention of coronary and stroke events with atorvastatin in hypertensive patients who have average or lower-than-average cholesterol concentrations, in the Anglo-Scandinavian Cardiac Outcomes Trial Lipid Lowering Arm (ASCOT-LLA): a multicentre randomised controlled trial,” 361 Lancet 1149 (2003). [cited here as ASCOT-LLA]

[5] ASCOT-LLA at 1153 & Table 3.

[6][6] In re Lipitor Mktg., Sales Practices & Prods. Liab. Litig., 174 F.Supp. 3d 911, 921 (D.S.C. 2016) (quoting Dr. Singh’s testimony).

[7] In re Lipitor Mktg., Sales Practices & Prods. Liab. Litig., 892 F.3d 624, 638-39 (2018) (affirming MDL trial court’s exclusion in part of Dr. Singh).

[8] SeeExpert Witness Mining – Antic Proposals for Reform” (Nov. 4, 2014).

[9] Brief for Amicus Curiae Jonah B. Gelbach in Support of Plaintiffs-Appellants, In re Lipitor Mktg., Sales Practices & Prods. Liab. Litig., 2017 WL 1628475 (April 28, 2017). [Cited as Gelbach]

[10] Gelbach at *2.

[11] Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999).

[12] Gelbach at *5.

[13] Gelbach at *2, *6.

[14] Gelbach at *15.

[15] Gelbach at *19-20.

[16] See Richard A. Posner, “An Economic Approach to the Law of Evidence,” 51 Stanford L. Rev. 1477, 1514 (1999) (asserting that the “unbiased fact-finder” should start hearing a case with even odds; “[I]deally we want the trier of fact to work from prior odds of 1 to 1 that the plaintiff or prosecutor has a meritorious case. A substantial departure from this position, in either direction, marks the trier of fact as biased.”).

[17] See, e.g., Richard D. Friedman, “A Presumption of Innocence, Not of Even Odds,” 52 Stan. L. Rev. 874 (2000). [Friedman]

[18] Leonard R. Jaffee, “Prior Probability – A Black Hole in the Mathematician’s View of the Sufficiency and Weight of Evidence,” 9 Cardozo L. Rev. 967, 986 (1988).

[19] Id. at p.994 & n.35.

[20] Friedman at 877.

Schauer’s Long Footnote on Statistical Significance

August 21st, 2022

One of the reasons that, in 2016, the American Statistical Association (ASA) issued, for the first time in its history, a consensus statement on p-values, was the persistent and sometimes deliberate misstatements and misrepresentations about the meaning of the p-value. Indeed, of the six principles articulated by the ASA, several were little more than definitional, designed to clear away misunderstandings.  Notably, “Principle Two” addresses one persistent misunderstanding and states:

“P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”[1]

The ASA consensus statement followed on the heels of an important published article, written by seven important authors in the fields of statistics and epidemiology.[2] One statistician,[3] who frequently shows up as an expert witness for multi-district litigation plaintiffs, described the article’s authors as the “A-Team” of statistics. In any event, the seven prominent thought leaders identified common statistical misunderstandings, including the belief that:

“2. The P value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8% probability that chance alone produced the association. No![4]

This is all basic statistics.

Frederick Schauer is the David and Mary Harrison Distinguished Professor of Law at the University of Virginia. Schauer has had contributed prolifically to legal scholarship, and his publications are often well written and thoughtful analyses. Schauer’s recent book, The Proof: Uses of Evidence in Law, Politics, and Everything Else, published by the Harvard University Press is a contribution to the literature of “legal epistemology,” and the foundations of evidence that lie beneath many of our everyday and courtroom approaches to resolving disputes.[5] Schauer’s book might be a useful addition to an undergraduate’s reading list for a course in practical epistemology, or for a law school course on evidence. The language of The Proof is clear and lively, but at times wanders into objectionable and biased political correctness. For example, Schauer channels Naomi Oreskes and her critique of manufacturing industry in his own discussion of “manufactured evidence,”[6] but studiously avoids any number of examples of explicit manufacturing of fraudulent evidence in litigation by the lawsuit industry.[7] Perhaps the most serious omission in this book on evidence is its failure to discuss the relative quality and hierarchy of evidence in science, medicine, and in policy.  Readers will not find any mention of the methodology of systematic reviews or meta-analyses in Schauer’s work.

At the end of his chapter on burdens of proof, Schauer adds “A Long Footnote on Statistical Significance,” in which he expresses surprise that the subject of statistical significance is controversial. Schauer might well have brushed up on the statistical concepts he wanted to discuss.

Schauer’s treatment of statistical significance is both distinctly unbalanced, as well as misstated. In an endnote,[8] Schauer cites some of the controversialists who have criticized significance tests, but none of the statisticians who have defended their use.[9]

As for conceptual accuracy, after giving a serviceable definition of the p-value, Schauer immediately goes astray:

And this likelihood is conventionally described in terms of a p-value, where the p-value is the probability that positive results—rejection of the “null hypothesis” that there is no connection between the examined variables—were produced by chance.”[10]

And again, for emphasis, Schauer tells us:

“A p-value of greater than .05 – a greater than 5 percent probability that the same results would have been the result of chance – has been understood to mean that the results are not statistically significant.”[11]

And then once more for emphasis, in the context of an emotionally laden hypothetical about an experimental drug “cures” a dread, incurable disease, p = 0.20, Schauer tells us that he suspects most people would want to take the medication:

“recognizing that an 80 percent likelihood that the rejection of ineffectiveness was still good enough, at least if there were no other alternatives.”

Schauer wants to connect his discussion of statistical significance to degrees or varying strengths of evidence, but his discursion into statistical significance largely conflates precision with strength. Evidence can be statistically robust but not be very strong. If we imagine a very large randomized clinical trial that found that a medication lowered systolic blood pressure by 1mm of mercury, p < 0.05, we would not consider that finding to constitute strong evidence for therapeutic benefit. If the observation of lowering blood pressure by 1mm came from an observational study, p < 0.05, the finding might not even qualify as evidence in the views of sophisticated cardiovascular physicians and researchers.

Earlier in the chapter, Schauer points to instances in which substantial evidence for a conclusion is downplayed because it is not “conclusive,” or “definitive.” He is obviously keen to emphasize that evidence that is not “conclusive” may still be useful in some circumstances. In this context, Schauer yet again misstates the meaning of significance probability, when he tells us that:

“[j]ust as inconclusive or even weak evidence may still be evidence, and may still be useful evidence for some purposes, so too might conclusions – rejections of the null hypothesis – that are more than 5 percent likely to have been produced by chance still be valuable, depending on what follows from those conclusions.”[12]

And while Schauer is right that weak evidence may still be evidence, he seems loathe to admit that weak evidence may be pretty crummy support for a conclusion. Take, for instance, a fair coin.  We have an expected value on ten flips of five heads and five tails.  We flip the coin ten times, but we observe six heads and four tails.  Do we now have “evidence” that the expected value and the expected outcome are wrong?  Not really. The probability of observing the expected outcome on the binomial model that most people would endorse for the thought experiment is 24.6%. The probability of not observing the expected value in ten flips is three times greater. If we look at an epidemiologic study, with a sizable number of participants, the “expected value” of 1.0, embodied in the null hypothesis, is an outcome that we would rarely expect to see, even if the null hypothesis is correct.  Schauer seems to have missed this basic lesson of probability and statistics.

Perhaps even more disturbing is that Schauer fails to distinguish the other determinants of study validity and the warrants for inferring a conclusion at any level of certainty. There is a distinct danger that his comments about p-values will be taken to apply to various study designs, descriptive, observational, and experimental. And there is a further danger that incorrect definitions of the p-value and statistical significance probabilities will be used to misrepresent p-values as relating to posterior probabilities. Surely, a distinguished professor of law, at a top law school, in a book published by a prestigious  publisher (Belknap Press) can do better. The message for legal practitioners is clear. If you need to define or discuss statistical concepts in a brief, seek out a good textbook on statistics. Do not rely upon other lawyers, even distinguished law professors, or judges, for accurate working definitions.


[1] Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The Am. Statistician 129, 131 (2016).

[2] Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman, “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” 31 European J. Epidemiol. 337 (2016).[cited as “Seven Sachems”]

[3] Martin T. Wells.

[4] Seven Sachems at 340 (emphasis added).

[5] Frederick Schauer, The Proof: Uses of Evidence in Law, Politics, and Everything Else (2022). [Schauer] One nit: Schauer cites a paper by A. Philip Dawid, “Statistical Risk,” 194 Synthese 3445 (2017). The title of the paper is “On individual risk.”

[6] Naomi Oreskes & Erik M. Conway, Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Climate Change (2010).

[7] See, e.g., In re Silica Prods. Liab. Litig., 398 F.Supp. 2d 563 (S.D.Tex. 2005); Transcript of Daubert Hearing at 23 (Feb. 17, 2005) (“great red flags of fraud”).

[8] See Schauer endnote 44 to Chapter 3, “The Burden of Proof,” citing Valentin Amrhein, Sander Greenland, and Blake McShane, “Scientists Rise Up against Statistical Significance,” www .nature .com (March 20, 2019), which in turn commented upon Blakey B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett, “Abandon Statistical Significance,” 73 American Statistician 235 (2019).

[9] Yoav Benjamini, Richard D. DeVeaux, Bradly Efron, Scott Evans, Mark Glickman, Barry Braubard, Xuming He, Xiao Li Meng, Nancy Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young, and Karen Kafadar, “The ASA President’s Task Force Statement on Statistical Significance and Replicability,” 15 Annals of Applied Statistics 1084 (2021); see alsoA Proclamation from the Task Force on Statistical Significance” (June 21, 2021).

[10] Schauer at 55. To be sure, Schauer, in endnote 43 to Chapter 3, disclaims any identification of p-values or measures of statistical significance with posterior probabilities or probabilistic measures of the burden of proof. Nonetheless, in the text, he proceeds to do exactly what he disclaimed in the endnote.

[11] Schauer at 55.

[12] Schauer at 56.