For your delectation and delight, desultory dicta on the law of delicts.

Has the American Statistical Association Gone Post-Modern?

March 24th, 2019

Last week, the American Statistical Association (ASA) released a special issue of its journal, The American Statistician, with 43 articles addressing the issue of “statistical significance.” If you are on the ASA’s mailing list, you received an email announcing that

the lead editorial calls for abandoning the use of ‘statistically significant’, and offers much (not just one thing) to replace it. Written by Ron Wasserstein, Allen Schirm, and Nicole Lazar, the co-editors of the special issue, ‘Moving to a World Beyond ‘p < 0.05’ summarizes the content of the issue’s 43 articles.”

In 2016, the ASA issued its “consensus” statement on statistical significance, in which it articulated six principles for interpreting p-values, and for avoiding erroneous interpretations. Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The American Statistician 129 (2016) [ASA Statement] In the final analysis, that ASA Statement really did not change very much, and could be read fairly only to state that statistical significance was not sufficient for causal inference.1 Aside from overzealous, over-claiming lawyers and their expert witnesses, few scientists or statisticians had ever maintained that statistical significance was sufficient to support causal inference. Still, many “health effect claims” involve alleged causation that is really a modification of a base rate of a disease or disorder that happens without the allegedly harmful exposure, and which does not invariably happen even with the exposure. It is hard to imagine drawing an inference of such causation without ruling out random error, as well as bias and confounding.

According to the lead editorial for the special issue:

The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different’, ‘p < 0.05’, and ‘nonsignificant’ survive, whether expressed in words, by asterisks in a table, or in some other way.”2

The ASA (through Wasserstein and colleagues) appear to be condemning dichotomizing p-values, which are a continuum between zero and one. Presumably saying that a p-value is less than 5% is tantamount to dichotomizing, but providing the actual value of the p-value would cause no offense, as long as it was not labeled “significant.”

So although the ASA appears to have gone “whole hog,” the Wasserstein editorial does not appear to condemn assessing random error, or evaluating the extent of random error as part of assessing a study’s support for an association. Reporting p < 0.05 as opposed to p = a real number between zero and one is largely an artifact of statistical tables in the pre-computer era.

So what is the ASA affirmatively recommending? “Much, not just one thing?” Or too much of nothing, which we know makes a man feel ill at ease. Wasserstein’s editorial earnestly admits that there is no replacement for:

the outsized role that statistical significance has come to play. The statistical community has not yet converged on a simple paradigm for the use of statistical inference in scientific research—and in fact it may never do so.”3

The 42 other articles in the special issue certainly do not converge on any unified, coherent response to the perceived crisis. Indeed, a cursory review of the abstracts alone suggests deep disagreements over an appropriate approach to statistical inference. The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.4

2 Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 Am. Statistician S1, S2 (2019).

3 Id. at S2.

4 See, e.g., John P. A. Ioannidis, “Retiring statistical significance would give bias a free pass,” 567 Nature 461 (2019); Valen E. Johnson, “Raise the Bar Rather than Retire Significance,” 567 Nature 461 (2019).

Lipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test

March 23rd, 2019

Muriel Bristol was a biologist who studied algae at the Rothamsted Experimental Station in England, after World War I.  In addition to her knowledge of plant biology, Bristol claimed the ability to tell whether tea had been added to milk, or the tea poured first and then milk had been added.  Bristol, as a scientist and a proper English woman, preferred the latter.

Ronald Fisher, who also worked at Rothamsted, expressed his skepticism over Dr. Bristol’s claim. Fisher set about to design a randomized experiment that would efficiently and effectively test her claim. Bristol was presented with eight cups of tea, four of which were prepared with milk added to tea, and four prepared with tea added to milk.  Bristol, of course, was blinded to which was which, but was required to label each according to its manner of preparation. Fisher saw his randomized experiment as a 2 x 2 contingency table, from he could calculate the observed outcome (and ones more extreme if there were any more extreme outcomes) using the assumption of fixed marginal rates and the hypergeometric probability distribution.  Fisher’s Exact Test was born at tea time.[1]

Fisher described the origins of his Exact Test in one of his early texts, but he neglected to report whether his experiment vindicated Bristol’s claim. According to David Salsburg, H. Fairfield Smith, one of Fisher’s colleagues, acknowledged that Bristol nailed Fisher’s Exact test, with all eight cups correctly identified. The test has gone on to become an important tool in the statistician’s armamentarium.

Fisher’s Exact, like any statistical test, has model assumptions and preconditions.  For one thing, the test is designed for categorical data, with binary outcomes. The test allows us to evaluate whether two proportions are likely different by chance alone, by calculating the probability of the observed outcome, as well as more extreme outcomes.

The calculation of an exact attained significance probability, using Fisher’s approach, provides a one-sided p-value, with no unique solution to calculating a two-side attained significance probability. In discrimination cases, the one-sided p-value may well be more appropriate for the issue at hand. The Fisher’s Exact Test has thus played an important role in showing the judiciary that small sample size need not be an insuperable barrier to meaningful statistical analysis. In discrimination cases, the one-sided p-value provided by the test is not a particular problem.[2]

The difficulty of using Fisher’s Exact for small sample sizes is that the hypergeometric distribution, upon which the test is based, is highly asymmetric. The observed one-sided p-value does not measure the probability of a result equally extreme in the opposite direction. There are at least three ways to calculate the p-value:

  • Double the one-sided p-value.
  • Add the point probabilities from the opposite tail that are more extreme than the observed point probability.
  • Use the mid-P value; that is, add all values more extreme (smaller) than the observed point probability from both sides of the distribution, PLUS ½ of the observed point probability.

Some software programs will proceed in one of these ways by default, but their doing so does guarantee the most accurate measure of two-tailed significance probability.

In the Lipitor MDL for diabetes litigation, Judge Gergel generally used sharp analyses to cut through the rancid fat of litigation claims, to get to the heart of the matter. By and large, he appears to have done a splendid job. In course of gatekeeping under Federal Rule of Evidence 702, however, Judge Gergel may have misunderstood the nature of Fisher’s Exact Test.

Nicholas Jewell is a well-credentialed statistician at the University of California.  In the courtroom, Jewell is a well-known expert witness for the litigation industry.  He is no novice at generating unreliable opinion testimony. See In re Zoloft Prods. Liab. Litig., No. 12–md–2342, 2015 WL 7776911 (E.D. Pa. Dec. 2, 2015) (excluding Jewell’s opinions as scientifically unwarranted and methodologically flawed). In re Zoloft Prod. Liab. Litig., MDL NO. 2342, 12-MD-2342, 2016 WL 1320799 (E.D. Pa. April 5, 2016) (granting summary judgment after excluding Dr. Jewell). SeeThe Education of Judge Rufe – The Zoloft MDL” (April 9, 2016).

In the Lipitor cases, some of Jewell’s opinions seemed outlandish indeed, and Judge Gergel generally excluded them. See In re Lipitor Marketing, Sales Practices and Prods. Liab. Litig., 145 F.Supp. 3d 573 (D.S.C. 2015), reconsideration den’d, 2016 WL 827067 (D.S.C. Feb. 29, 2016). As Judge Gergel explained, Jewell calculated a relative risk for abnormal blood glucose in a Lipitor group to be 3.0 (95% C.I., 0.9 to 9.6), using STATA software. Also using STATA, Jewell obtained an attained significance probability of 0.0654, based upon Fisher’s Exact Test. Lipitor Jewell at *7.

Judge Gergel did not report whether Jewell’s reported p-value of 0.0654, was one- or two-sided, but he did state that the attained probability “indicates a lack of statistical significance.” Id. & n. 15. The rest of His Honor’s discussion of the challenged opinion, however, makes clear that of 0.0654 must have been a two-sided value.  If it had been a one-sided p-value, then there would have been no way of invoking the mid-p to generate a two-sided p-value below 5%. The mid-p will always be larger than the one-tailed exact p-value generated by Fisher’s Exact Test.

The court noted that Dr. Jewell had testified that he believed that STATA generated this confidence interval by “flip[ping]” the Taylor series approximation. The STATA website notes that it calculates confidence intervals for odds ratios (which are different from the relative risk that Jewell testified he computed), by inverting the Fisher exact test.[3] Id. at *7 & n. 17. Of course, this description suggests that the confidence interval is not based upon exact methods.

STATA does not provide a mid p-value calculation, and so Jewell used an on-line calculator, to obtain a mid p-value of 0.04, which he declared statistically significant. The court took Jewell to task for using the mid p-value as though it were a different analysis or test.  Id. at *8. Because the mid-p value will always be larger than the one-sided exact p-value from Fisher’s Exact Test, the court’s explanation does not really make sense:

“Instead, Dr. Jewell turned to the mid-p test, which would ‘[a]lmost surely’ produce a lower p-value than the Fisher exact test.”

Id. at *8. The mid-p test, however, is not different from the Fisher’s exact; rather it is simply a way of dealing with the asymmetrical distribution that underlies the Fisher’s exact, to arrive at a two-tailed p-value that more accurately captures the rate of Type I error.

The MDL court acknowledged that the mid-p approach, was not inherently unreliable, but questioned Jewell’s inconsistent, selective use of the approach for only one test.[4]  Jewell certainly did not help the plaintiffs’ cause and his standing by having discarding the analyses that were not incorporated into his report, thus leaving the MDL court to guess at how much selection went on in his process of generating his opinions..  Id. at *9 & n. 19.

None of Jewell’s other calculated p-values involved the mid-p approach, but the court’s criticism begs the question whether the other p-values came from a Fisher’s Exact Test with small sample size, or other highly asymmetrical distribution. Id. at *8. Although Jewell had shown himself willing to engage in other dubious, result-oriented analyses, Jewell’s use of the mid-p for this one comparison may have been within acceptable bounds after all.

The court also noted that Jewell had obtained the “exact p-value and that this p-value was not significant.” Id. The court’s notation here, however, does not report the important detail whether that exact, unreported p-value was merely the doubled of the one-sided p-value given by the Fisher’s Exact Test. As the STATA website, cited by the MDL court, explains:

“The test naturally gives a one-sided p-value, and there are at least four different ways to convert it to a two-sided p-value (Agresti 2002, 93). One way, not implemented in Stata, is to double the one-sided p-value; doubling is simple but can result in p-values larger than one.”

Wesley Eddings, “Fisher’s exact test two-sided idiosyncrasy” (Jan. 2009) (citing Alan Agresti, Categorical Data Analysis 93 (2d ed. 2002)).

On plaintiffs’ motion for reconsideration, the MDL court reaffirmed its findings with respect to Jewell’s use of the mid-p.  Lipitor Jewell Reconsidered at *3. In doing so, the court insisted that the one instance in which Jewell used the mid-p stood in stark contrast to all the other instances in which he had used Fisher’s Exact Test.  The court then cited to the record to identify 21 other instances in which Jewell used a p-value rather than a mid-p value.  The court, however, did not provide the crucial detail whether these 21 other instances actually involved small-sample applications of Fisher’s Exact Test.  As result-oriented as Jewell can be, it seems safe to assume that not all his statistical analyses involved Fisher’s Exact Test, with its attendant ambiguity for how to calculate a two-tailed p-value.

[1] Sir Ronald A. Fisher, The Design of Experiments at chapter 2 (1935); see also Stephen Senn, “Tea for three: Of infusions and inferences and milk in first,” Significance 30 (Dec. 2012); David Salsburg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century  (2002).

[2] See, e.g., Dendy v. Washington Hosp. Ctr., 431 F. Supp. 873 (D.D.C. 1977) (denying preliminary injunction), rev’d, 581 F.2d 99 (D.C. Cir. 1978) (reversing denial of relief, and remanding for reconsideration). See also National Academies of Science, Reference Manual on Scientific Evidence 255 n.108 (3d ed. 2011) (“Well-known small sample techniques [for testing significance and calculating p-values] include the sign test and Fisher’s exact test.”).

[3] See Wesley Eddings, “Fisher’s exact test two-sided idiosyncrasy” (Jan. 2009), available at <>, last visited April 19, 2016 (“Stata’s exact confidence interval for the odds ratio inverts Fisher’s exact test.”). This article by Eddings contains a nice discussion of why the Fisher’s Exact Test attained significance probability disagrees with the calculated confidence interval. Eddings points out the asymmetry of the hypergeometric distribution, which complicates arriving at an exact p-value for a two-sided test.

[4] See Barber v. United Airlines, Inc., 17 Fed. Appx. 433, 437 (7th Cir. 2001) (“Because in formulating his opinion Dr. Hynes cherry-picked the facts he considered to render an expert opinion, the district court correctly barred his testimony because such a selective use of facts fails to satisfy the scientific method and Daubert.”).

ASA Statement Goes to Court – Part 2

March 7th, 2019

It has been almost three years since the American Statistical Association (ASA) issued its statement on statistical significance. Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The American Statistician 129 (2016) [ASA Statement]. Before the ASA’s Statement, courts and lawyers from all sides routinely misunderstood, misstated, and misrepresented the meaning of statistical significance.1 These errors were pandemic despite the efforts of the Federal Judicial Center and the National Academies of Science to educate judges and lawyers, through their Reference Manuals on Scientific Evidence and seminars. The interesting question is whether the ASA’s Statement has improved, or will improve, the unfortunate situation.2

The ASA Statement on Testosterone

“Ye blind guides, who strain out a gnat and swallow a camel!”
Matthew 23:24

To capture the state of the art, or the state of correct and flawed interpretations of the ASA Statement, reviewing a recent but now resolved, large so-called mass tort may be illustrative. Pharmaceutical products liability cases almost always turn on evidence from pharmaco-epidemiologic studies that compare the rate of an outcome of interest among patients taking a particular medication with the rate among similar, untreated patients. These studies compare the observed with the expected rates, and invariably assess the differences as either a “risk ratio,” or a “risk difference,” for both the magnitude of the difference and for “significance probability” of observing a rate at least as large as seen in the exposed group, given the assumptions that that the medication did not change the rate and that the data followed a given probability distribution. In these alleged “health effects” cases, claims and counterclaims of misuse of significance probability have been pervasive. After the ASA Statement was released, some lawyers began to modify their arguments to suggest that their adversaries’ arguments offend the ASA’s pronouncements.

One litigation that showcases the use and misuse of the ASA Statement arose from claims that AbbVie, Inc.’s transdermal testosterone medication (TRT) causes heart attacks, strokes, and venous thromboembolism. The FDA had reviewed the plaintiffs’ claims, made in a Public Citizen complaint, and resoundingly rejected the causal interpretation of two dubious observational studies, and an incomplete meta-analysis that used an off-beat composite end point.3 The Public Citizen petition probably did succeed in pushing the FDA to convene an Advisory Committee meeting, which again resulted in a rejection of the causal claims. The FDA did, however, modify the class labeling for TRT with respect to indication and a possible association with cardiovascular outcomes. And then the litigation came.

Notwithstanding the FDA’s determination that a causal association had not been shown, thousands of plaintiffs sued several companies, with most of the complaints falling on AbbVie, Inc., which had the largest presence in the market. The ASA Statement came up occasionally in pre-trial depositions, but became a major brouhaha, when AbbVie moved to exclude plaintiffs’ causation expert witnesses.4

The Defense’s Anticipatory Parry of the ASA Statement

As AbbVie described the situation:

Plaintiffs’ experts uniformly seek to abrogate the established methods and standards for determining … causal factors in favor of precisely the kind of subjective judgments that Daubert was designed to avoid. Tests for statistical significance are characterized as ‘misleading’ and rejected [by plaintiffs’ expert witnesses] in favor of non-statistical ‘estimates’, ‘clinical judgment’, and ‘gestalt’ views of the evidence.”5

AbbVie’s brief in support of excluding plaintiffs’ expert witnesses barely mentioned the ASA Statement, but in a footnote, the defense anticipated the Plaintiffs’ opposition would be based on rejecting the importance of statistical significance testing and the claim that this rejection was somehow supported by the ASA Statement:

The statistical community is currently debating whether scientists who lack expertise in statistics misunderstand p-values and overvalue significance testing. [citing ASA Statement] The fact that there is a debate among professional statisticians on this narrow issue does not validate Dr. Gerstman’s [plaintiffs’ expert witness’s] rejection of the importance of statistical significance testing, or undermine Defendants’ reliance on accepted methods for determining association and causation.”6

In its brief in support of excluding causation opinions, the defense took pains to define statistical significance, and managed to do so, painfully, or at least in ways that the ASA conferees would have found objectionable:

Any association found must be tested for its statistical significance. Statistical significance testing measures the likelihood that the observed association could be due to chance variation among samples. Scientists evaluate whether an observed effect is due to chance using p-values and confidence intervals. The prevailing scientific convention requires that there be 95% probability that the observed association is not due to chance (expressed as a p-value < 0.05) before reporting a result as “statistically significant. * * * This process guards against reporting false positive results by setting a ceiling for the probability that the observed positive association could be due to chance alone, assuming that no association was actually present.7

AbbVie’s brief proceeded to characterize the confidence interval as a tool of significance testing, again in a way that misstates the mathematical meaning and importance of the interval:

The determination of statistical significance can be described equivalently in terms of the confidence interval calculated in connection with the association. A confidence interval indicates the level of uncertainty that exists around the measured value of the association (i.e., the OR or RR). A confidence interval defines the range of possible values for the actual OR or RR that are compatible with the sample data, at a specified confidence level, typically 95% under the prevailing scientific convention. Reference Manual, at 580 (Ex. 14) (“If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population.”). * * * If the confidence interval crosses 1.0, this means there may be no difference between the treatment group and the control group, therefore the result is not considered statistically significant.”8

Perhaps AbbVie’s counsel should be permitted a plea in mitigation by having cited to, and quoted from, the Reference Manual on Scientific Evidence’s chapter on epidemiology, which was also wide of the mark in its description of the confidence interval. Counsel would have been better served by the Manual’s more rigorous and accurate chapter on statistics. Even so, the above-quoted statements give an inappropriate interpretation of random error as a probability about the hypothesis being tested.9 Particularly dangerous, in terms of failing to advance AbbVie’s own objectives, was the characterization of the confidence interval as measuring the level of uncertainty, as though there were no other sources of uncertainty other than random error in the measurement of the risk ratio.

The Plaintiffs’ Attack on Significance Testing

The Plaintiffs, of course, filed an opposition brief that characterized the defense position as an attempt to:

elevate statistical significance, as measured by confidence intervals and so-called p-values, to the status of an absolute requirement to the establishment of causation.”10

Tellingly, the plaintiffs’ brief fails to point to any modern-era example of a scientific determination of causation based upon epidemiologic evidence, in which the pertinent studies were not assessed for, and found to show, statistical significance.

After citing a few judicial opinions that underplayed the importance of statistical significance, the Plaintiffs’ opposition turned to the ASA Statement for what it perceived to be support for its loosey-goosey approach to causal inference.11 The Plaintiffs’ opposition brief quoted a series of propositions from the ASA Statement, without the ASA’s elaborations and elucidations, and without much in the way of explanation or commentary. At the very least, the Plaintiffs’ heavy reliance upon, despite their distortions of, the ASA Statement helped them to define key statistical concepts more carefully than had AbbVie in its opening brief.

The ASA Statement, however, was not immune from being misrepresented in the Plaintiffs’ opposition brief. Many of the quoted propositions were quite beside the points of the dispute over the validity and reliability of Plaintiffs’ expert witnesses’ conclusions of causation about testosterone and heart attacks, conclusions not reached or shared by the FDA, any consensus statement from medical organizations, or any serious published systematic review:

P-values do not measure the probability that the studied hypothesis is true, … .”12

This proposition from the ASA Statement is true, but trivially true. (Of course, this ASA principle is relevant to the many judicial decisions that have managed to misstate what p-values measure.) The above-quoted proposition follows from the definition and meaning of the p-value; only someone who did not understand significance probability would confuse it with the probability of the truth of the studied hypothesis. P-values’ not measuring the probability of the null hypothesis, or any alternative hypothesis, is not a flaw in p-values, but arguably their strength.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”13

Again, true, true, and immaterial. The existence of other importance metrics, such as the magnitude of an association or correlation, hardly detracts from the importance of assessing the random error in an observed statistic. The need to assess clinical or practical significance of an association or correlation also does not detract from the importance of the assessed random error in a measured statistic.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”14

The Plaintiffs’ opposition attempted to spin the above ASA statement as a criticism of p-values involves an elenchi ignoratio. Once again, the p-value assumes a probability model and a null hypothesis, and so it cannot provide a “measure” or the model or hypothesis’ probability.

The Plaintiffs’ final harrumph on the ASA Statement was their claim that the ASA Statement’s conclusion was “especially significant” to the testosterone litigation:

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”15

The existence of other important criteria in the evaluation and synthesis of a complex body of studies does not erase or supersede the importance of assessing stochastic error in the epidemiologic studies. Plaintiffs’ Opposition Brief asserted that the Defense had attempted to:

to substitute the single index, the p-value, for scientific reasoning in the reports of Plaintiffs’ experts should be rejected.”16

Some of the defense’s opening brief could indeed be read as reducing causal inference to the determination of statistical significance. A sympathetic reading of the entire AbbVie brief, however, shows that it had criticized the threats to validity in the observational epidemiologic studies, as well as some of the clinical trials, and other rampant flaws in the Plaintiffs’ expert witnesses’ reasoning. The Plaintiffs’ citations to the ASA Statement’s “negative” propositions about p-values (to emphasize what they are not) appeared to be the stuffing of a strawman, used to divert attention from other failings of their own claims and proffered analyses. In other words, the substance of the Rule 702 application had much more to do with data quality and study validity than statistical significance.

What did the trial court make of this back and forth about statistical significance and the ASA Statement? For the most part, the trial court denied both sides’ challenges to proffered expert witness testimony on causation and statistical issues. In sorting the controversy over the ASA Statement, the trial court apparently misunderstood key statistical concepts and paid little attention to the threats to validity other than random variability in study results.17 The trial court summarized the controversy as follows:

In arguing that the scientific literature does not support a finding that TRT is associated with the alleged injuries, AbbVie emphasize [sic] the importance of considering the statistical significance of study results. Though experts for both AbbVie and plaintiffs agree that statistical significance is a widely accepted concept in the field of statistics and that there is a conventional method for determining the statistical significance of a study’s findings, the parties and their experts disagree about the conclusions one may permissibly draw from a study result that is deemed to possess or lack statistical significance according to conventional methods of making that determination.”18

Of course, there was never a controversy presented to the court about drawing a conclusion from “a study.” By the time the briefs were filed, both sides had multiple observational studies, clinical trials, and meta-analyses to synthesize into opinions for or against causal claims.

Ironically, AbbVie might claim to have prevailed in having the trial court adopt its misleading definitions of p-values and confidence intervals:

Statisticians test for statistical significance to determine the likelihood that a study’s findings are due to chance. *** According to conventional statistical practice, such a result *** would be considered statistically significant if there is a 95% probability, also expressed as a “p-value” of <0.05, that the observed association is not the product of chance. If, however, the p-value were greater than 0.05, the observed association would not be regarded as statistically significant, according to prevailing conventions, because there is a greater than 5% probability that the association observed was the result of chance.”19

The MDL court similarly appeared to accept AbbVie’s dubious description of the confidence interval:

A confidence interval consists of a range of values. For a 95% confidence interval, one would expect future studies sampling the same population to produce values within the range 95% of the time. So if the confidence interval ranged from 1.2 to 3.0, the association would be considered statistically significant, because one would expect, with 95% confidence, that future studies would report a ratio above 1.0 – indeed, above 1.2.”20

The court’s opinion clearly evidences the danger in stating the importance of statistical significance without placing equal emphasis on the need to exclude bias and confounding. Having found an observational study and one meta-analysis of clinical trial safety outcomes that were statistically significant, the trial court held that any dispute over the probativeness of the studies was for the jury to assess.

Some but not all of AbbVie’s brief might have encouraged this lax attitude by failing to emphasize study validity at the same time as emphasizing the importance of statistical significance. In any event, trial court continued with its précis of the plaintiffs’ argument that:

a study reporting a confidence interval ranging from 0.9 to 3.5, for example, should certainly not be understood as evidence that there is no association and may actually be understood as evidence in favor of an association, when considered in light of other evidence. Thus, according to plaintiffs’ experts, even studies that do not show a statistically significant association between TRT and the alleged injuries may plausibly bolster their opinions that TRT is capable of causing such injuries.”21

Of course, a single study that reported a risk ratio greater than 1.0, with a confidence interval 0.9 to 3.5 might be reasonably incorporated into a meta-analysis that in turn could support, or not support a causal inference. In the TRT litigation, however, the well-conducted, most up-to-date meta-analyses did not report statistically significant elevated rates of cardiovascular events among users of TRT. The court’s insistence that a study with a confidence interval 0.9 to 3.5 cannot be interpreted as evidence of no association is, of course, correct. Equally correct would be to say that the interval shows that the study failed to show an association. The trial court never grappled with the reality that the best conducted meta-analyses failed to show statistically significant increases in the rates of cardiovascular events.

The American Statistical Association and its members would likely have been deeply disappointed by how both parties used the ASA Statement for their litigation objectives. AbbVie’s suggestion that the ASA Statement reflects a debate about “whether scientists who lack expertise in statistics misunderstand p-values and overvalue significance testing” would appear to have no support in the Statement itself or any other commentary to come out of the meeting leading up to the Statement. The Plaintiffs’ argument that p-values properly understood are unimportant and misleading similarly finds no support in the ASA Statement. Conveniently, the Plaintiffs’ brief ignored the Statement’s insistence upon transparency in pre-specification of analyses and outcomes, and in handling of multiple comparisons:

P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. Cherrypicking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference, and ‘p-hacking’, leads to a spurious excess of statistically significant results in the published literature and should be vigorously avoided.”22

Most if not all of the plaintiffs’ expert witnesses’ reliance materials would have been eliminated under this principle set forth by the ASA Statement.

1 See, e.g., In re Ephedra Prods. Liab. Litig., 393 F.Supp. 2d 181, 191 (S.D.N.Y. 2005). See alsoConfidence in Intervals and Diffidence in the Courts” (March 4, 2012); “Scientific illiteracy among the judiciary” (Feb. 29, 2012).

3Letter of Janet Woodcock, Director of FDA’s Center for Drug Evaluation and Research, to Sidney Wolfe, Director of Public Citizen’s Health Research Group (July 16, 2014) (denying citizen petition for “black box” warning).

4 Defendants’ (AbbVie, Inc.’s) Motion to Exclude Plaintiffs Expert Testimony on the Issue of Causation, and for Summary Judgment, and Memorandum of Law in Support, Case No. 1:14-CV-01748, MDL 2545, Document #: 1753, 2017 WL 1104501 (N.D. Ill. Feb. 20, 2017) [AbbVie Brief].

5 AbbVie Brief at 3; see also id. at 7-8 (“Depending upon the expert, even the basic tests of statistical significance are simply ignored, dismissed as misleading… .”) AbbVie’s definitions of statistical significance occasionally wandered off track and into the transposition fallacy, but generally its point was understandable.

6 AbbVie Brief at 63 n.16 (emphasis in original).

7 AbbVie Brief at 13 (emphasis in original).

8 AbbVie Brief at 13-14 (emphasis in original).

9 The defense brief further emphasized statistical significance almost as though it were a sufficient basis for inferring causality from observational studies: “Regardless of this debate, courts have routinely found the traditional epidemiological method—including bedrock principles of significance testing—to be the most reliable and accepted way to establish general causation. See, e.g., In re Zoloft, 26 F. Supp. 3d 449, 455; see also Rosen v. Ciba-Geigy Corp., 78 F.3d 316, 319 (7th Cir. 1996) (“The law lags science; it does not lead it.”). AbbVie Brief at 63-64 & n.16. The defense’s language about “including bedrock principles of significance testing” absolves it of having totally ignored other necessary considerations, but still the defense might have advantageously pointed out at the other needed considerations for causal inference at the same time.

10 Plaintiffs’ Steering Committee’ Memorandum of Law in Opposition to Motion of AbbVie Defendants to Exclude Plaintiffs’ Expert Testimony on the Issue of Causation, and for Summary Judgment at p.34, Case No. 1:14-CV-01748, MDL 2545, Document No. 1753 (N.D. Ill. Mar. 23, 2017) [Opp. Brief].

11 Id. at 35 (appending the ASA Statement and the commentary of more than two dozen interested commentators).

12 Id. at 38 (quoting from the ASA Statement at 131).

13 Id. at 38 (quoting from the ASA Statement at 132).

14 Id. at 38 (quoting from the ASA Statement at 132).

15 Id. at 38 (quoting from the ASA Statement at 132).

16 Id. at 38

17  In re Testosterone Replacement Therapy Prods. Liab. Litig., MDL No. 2545, C.M.O. No. 46, 2017 WL 1833173 (N.D. Ill. May 8, 2017) [In re TRT]

18 In re TRT at *4.

19 In re TRT at *4.

20 Id.

21 Id. at *4.

22 ASA Statement at 131-32.

Daubert Retrospective – Statistical Significance

January 5th, 2019

The holiday break was an opportunity and an excuse to revisit the briefs filed in the Supreme Court by parties and amici, in the Daubert case. The 22 amicus briefs in particular provided a wonderful basis upon which to reflect how far we have come, and also how far we have to go, to achieve real evidence-based fact finding in technical and scientific litigation. Twenty-five years ago, Rules 702 and 703 vied for control over errant and improvident expert witness testimony. With Daubert decided, Rule 702 emerged as the winner. Sadly, most courts seem to ignore or forget about Rule 703, perhaps because of its awkward wording. Rule 702, however, received the judicial imprimatur to support the policing and gatekeeping of dysepistemic claims in the federal courts.

As noted last week,1 the petitioners (plaintiffs) in Daubert advanced several lines of fallacious and specious argument, some of which was lost in the shuffle and page limitations of the Supreme Court briefings. The plaintiffs’ transposition fallacy received barely a mention, although it did bring forth at least a footnote in an important and overlooked amicus brief filed by American Medical Association (AMA), the American College of Physicians, and over a dozen other medical specialty organizations,2 all of which both emphasized the importance of statistical significance in interpreting epidemiologic studies, and the fallacy of interpreting 95% confidence intervals as providing a measure of certainty about the estimated association as a parameter. The language of these associations’ amicus brief is noteworthy and still relevant to today’s controversies.

The AMA’s amicus brief, like the brief filed by the National Academies of Science and the American Association for the Advancement of Science, strongly endorsed a gatekeeping role for trial courts to exclude testimony not based upon rigorous scientific analysis:

The touchstone of Rule 702 is scientific knowledge. Under this Rule, expert scientific testimony must adhere to the recognized standards of good scientific methodology including rigorous analysis, accurate and statistically significant measurement, and reproducibility.”3

Having incorporated the term “scientific knowledge,” Rule 702 could not permit anything less in expert witness testimony, lest it pollute federal courtrooms across the land.

Elsewhere, the AMA elaborated upon its reference to “statistically significant measurement”:

Medical researchers acquire scientific knowledge through laboratory investigation, studies of animal models, human trials, and epidemiological studies. Such empirical investigations frequently demonstrate some correlation between the intervention studied and the hypothesized result. However, the demonstration of a correlation does not prove the hypothesized result and does not constitute scientific knowledge. In order to determine whether the observed correlation is indicative of a causal relationship, scientists necessarily rely on the concept of “statistical significance.” The requirement of statistical reliability, which tends to prove that the relationship is not merely the product of chance, is a fundamental and indispensable component of valid scientific methodology.”4

And then again, the AMA spelled out its position, in case the Court missed its other references to the importance of statistical significance:

Medical studies, whether clinical trials or epidemiologic studies, frequently demonstrate some correlation between the action studied … . To determine whether the observed correlation is not due to chance, medical scientists rely on the concept of ‘statistical significance’. A ‘statistically significant’ correlation is generally considered to be one in which statistical analysis suggests that the observed relationship is not the result of chance. A statistically significant correlation does not ‘prove’ causation, but in the absence of such a correlation, scientific causation clearly is not proven.95

In its footnote 9, in the above quoted section of the brief, the AMA called out the plaintiffs’ transposition fallacy, without specifically citing to plaintiffs’ briefs:

It is misleading to compare the 95% confidence level used in empirical research to the 51% level inherent in the preponderance of the evidence standard.”6

Actually the plaintiffs’ ruse was much worse than misleading. The plaintiffs did not compare the two probabilities; they equated them. Some might call this ruse, an outright fraud on the court. In any event, the AMA amicus brief remains an available, citable source for opposing this fraud and the casual dismissal of the importance of statistical significance.

One other amicus brief touched on the plaintiffs’ statistical shanigans. The Product Liability Advisory Council, National Association of Manufacturers, Business Roundtable, and Chemical Manufacturers Association jointly filed an amicus brief to challenge some of the excesses of the plaintiffs’ submissions.7  Plaintiffs’ expert witness, Shanna Swan, had calculated type II error rates and post-hoc power for some selected epidemiologic studies relied upon by the defense. Swan’s complaint had been that some studies had only 20% probability (power) to detect a statistically significant doubling of limb reduction risk, with significance at p < 5%.8

The PLAC Brief pointed out that power calculations must assume an alternative hypothesis, and that the doubling of risk hypothesis had no basis in the evidentiary record. Although the PLAC complaint was correct, it missed the plaintiffs’ point that the defense had set exceeding a risk ratio of 2.0, as an important benchmark for specific causation attributability. Swan’s calculation of post-hoc power would have yielded an even lower probability for detecting risk ratios of 1.2 or so. More to the point, PLAC noted that other studies had much greater power, and that collectively, all the available studies would have had much greater power to have at least one study achieve statistical significance without dodgy re-analyses.

1 The Advocates’ Errors in Daubert” (Dec. 28, 2018).

2 American Academy of Allergy and Immunology, American Academy of Dermatology, American Academy of Family Physicians, American Academy of Neurology, American Academy of Orthopaedic Surgeons, American Academy of Pain Medicine, American Association of Neurological Surgeons, American College of Obstetricians and Gynecologists, American College of Pain Medicine, American College of Physicians, American College of Radiology, American Society of Anesthesiologists, American Society of Plastic and Reconstructive Surgeons, American Urological Association, and College of American Pathologists.

3 Brief of the American Medical Association, et al., as Amici Curiae, in Support of Respondent, in Daubert v. Merrell Dow Pharmaceuticals, Inc., U.S. Supreme Court no. 92-102, 1993 WL 13006285, at *27 (U.S., Jan. 19, 1993)[AMA Brief].

4 AMA Brief at *4-*5 (emphasis added).

5 AMA Brief at *14-*15 (emphasis added).

6 AMA Brief at *15 & n.9.

7 Brief of the Product Liability Advisory Council, Inc., National Association of Manufacturers, Business Roundtable, and Chemical Manufacturers Association as Amici Curiae in Support of Respondent, as Amici Curiae, in Support of Respondent, in Daubert v. Merrell Dow Pharmaceuticals, Inc., U.S. Supreme Court no. 92-102, 1993 WL 13006288 (U.S., Jan. 19, 1993) [PLAC Brief].

8 PLAC Brief at *21.