Scientific illiteracy among the judiciary

Ken Feinberg, speaking at a symposium on mass torts, asks what legal challenges do mass torts confront in the federal courts.  The answer seems obvious.

Pharmaceutical cases that warrant federal court multi-district litigation (MDL) treatment typically involve complex scientific and statistical issues.  The public deserves having MDL cases assigned to judges who have special experience and competence to preside in cases in which these complex issues predominate.  There appears to be no procedural device to ensure that the judges selected in the MDL process have the necessary experience and competence, and a good deal of evidence to suggest that the MDL judges are not up to the task at hand.

In the aftermath of the Supreme Court’s decision in Daubert, the Federal Judicial Center assumed responsibility for producing science and statistics tutorials to help judges grapple with technical issues in their cases.  The Center has produced videotaped lectures as well as the Reference Manual on Scientific Evidence, now in its third edition.  Despite the Center’s best efforts, many federal judges have shown themselves to be incorrigible.  It is time to revive the discussions and debates about implementing a “science court.”

The following three federal MDLs all involved pharmaceutical products, well-respected federal judges, and a fundamental error in statistical inference.

Avandia

Avandia is a prescription oral anti-diabetic medication licensed by GlaxoSmithKline (GSK).  Concerns over Avandia’s association with excess heart attack risk resulted in regulatory revisions of its availability, as well as thousands of lawsuits.  In a decision that affected virtually all of those several thousand claims, aggregated for pretrial handing in a federal MDL, a federal judge, in ruling on a Rule 702 motion, described a clinical trial with a risk ratio greater than 1.0, with a p-value of 0.08, as follows:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance).FN72

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011)(Rufe, J.).  This is a remarkable error by a trial judge given the responsibility for pre-trial handling of so many cases.  There are many things you can argue about a p-value of 0.08, but Judge Rufe’s interpretation is not an argument; it is error.  That such an error, explicitly warned against in the Reference Manual on Scientific Evidence, could be made by an MDL judge, over 15 years since the first publication of the Manual, highlights the seriousness and the extent of the illiteracy problem.

What possible basis could the Avandia MDL court have to support this clearly erroneous interpretation of crucial studies in the litigation?  Footnote 72 in Judge Rufe’s opinion references a report by plaintiffs’ expert witness, Allan D. Sniderman, M.D, “a cardiologist, medical researcher, and professor at McGill University.” Id. at *10.  The trial court goes on to note that:

“GSK does not challenge Dr. Sniderman’s qualifications as a cardiologist, but does challenge his ability to analyze and draw conclusions from epidemiological research, since he is not an epidemiologist. GSK’s briefs do not elaborate on this challenge, and in any event the Court finds it unconvincing given Dr. Sniderman’s credentials as a researcher and published author, as well as clinician, and his ability to analyze the epidemiological research, as demonstrated in his report.”

Id.

What more evidence could the Avandia MDL trial court possibly have needed to show that Sniderman was incompetent to give statistical and epidemiologic testimony?  Fundamentally at odds with the Manual on an uncontroversial point, Sniderman had given the court a baseless, incorrect interpretation of a p-value.  Everything else he might have to say on the subject was likely suspect.  If, as the court suggested, GSK did not elaborate upon its challenge with specific examples, then shame on GSK. The trial court, however, could have readily determined that Sniderman was speaking nonsense by reading the chapter on statistics in the Reference Manual on Scientific Evidence.  For all my complaints about gaps in coverage in the Manual, the text, on this issue is clear and concise. It really is not too much to expect an MDL trial judge to be conversant with the basic concepts of scientific and statistical evidence set out in the Manual, which is prepared to help federal judges.

Phenylpropanolamine (PPA) Litigation

Litigation over phenylpropanolamine was aggregated, within the federal system, before Judge Barbara Rothstein.  Judge Rothstein is not only a respected federal trial judge, she was the director of the Federal Judicial Center, which produces the Reference Manual on Scientific Evidence.  Her involvement in overseeing the preparation of the third edition of the Manual, however, did not keep Judge Rothstein from badly misunderstanding and misstating the meaning of a p-value in the PPA litigation.  See In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F.Supp. 2d 1230, 1236 n.1 (W.D. Wash. 2003)(“P-values measure the probability that the reported association was due to chance… .”).  Tellingly, Judge Rothstein denied, in large part, the defendants’ Rule 702 challenges.  Juries, however, overwhelmingly rejected the claims that PPA caused their strokes.

Ephedra Litigation

Judge Rakoff, of the Southern District of New York, notoriously committed the transposition fallacy in the Ephedra litigation:

“Generally accepted scientific convention treats a result as statistically significant if the P-value is not greater than .05. The expression ‘P=.05’ means that there is one chance in twenty that a result showing increased risk was caused by a sampling error—i.e., that the randomly selected sample accidentally turned out to be so unrepresentative that it falsely indicates an elevated risk.”

In re Ephedra Prods. Liab. Litig., 393 F.Supp. 2d 181, 191 (S.D.N.Y. 2005).

Judge Rakoff then fallaciously argued that the use of a critical value of less than 5% of significance probability increased the “more likely than not” burden of proof upon a civil litigant.  Id. at 188, 193.  See Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 65 (2009).

Judge Rakoff may well have had help in confusing the probability used to characterize the plaintiff’s burden of proof with the probability of attained significance.  At least one of the defense expert witnesses in the Ephedra cases gave an erroneous definition of “statistically significant association,” which may have invited the judicial error:

“A statistically significant association is an association between exposure and disease that meets rigorous mathematical criteria demonstrating that the finding is unlikely to be the result of chance.”

Report of John Concato, MD, MS, MPH, at 7, ¶29 (Sept. 13, 2004).  Dr. Concato’s error was picked up and repeated in the defense briefing of its motion to preclude:

“The likelihood that an observed association could occur by chance alone is evaluated using tests for statistical significance.”

Memorandum of Law in Support of Motion by Ephedra Defendants to Exclude Expert Opinions of Charles Buncher, [et alia] …That Ephedra Causes Hemorrhagic Stroke, Ischemic Stroke, Seizure, Myocardial Infarction, Sudden Cardiac Death, and Heat-Related Illnesses at 9 (Dec. 3, 2004).

Judge Rakoff’s insistence that requiring “statistical significance” at the customary 5% level would change the plaintiffs’ burden of proof, and require greater certitude for epidemiologists than for other expert witnesses who opine in less “rigorous” fields of learning, is wrong as a matter of fact.  His Honor’s comparison, however, ignores the Supreme Court’s observation that the point of Rule 702 is:

‘‘to make certain that an expert, whether basing testimony upon professional studies or personal experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’

Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999).

Judge Rakoff not only ignored the conditional nature of significance probability, but he overinterpreted the role of significance testing in arriving at a conclusion of causality.  Statistical significance may answer the question of the strength of the evidence for ruling out chance in producing the data observed based upon an assumption of the no risk, but it doesn’t alone answer the question whether the study result shows an increased risk.  Bias and confounding must be considered, along with other Bradford Hill factors.

Even if the p-value could be turned into a posterior probability of the null hypothesis, there would be many other probabilities that would necessarily diminish that probability.  Some of the other factors (which could be expressed as objective or subjective probabilities) include:

  • accuracy of the data reporting
  • data collection
  • data categorization
  • data cleaning
  • data handling
  • data analysis
  • internal validity of the study
  • external validity of the study
  • credibility of study participants
  • credibility of study researchers
  • credibility of the study authors
  • accuracy of the study authors’ expression of their research
  • accuracy of the editing process
  • accuracy of the testifying expert witness’s interpretation
  • credibility of the testifying expert witness
  • other available studies, and their respective data and analysis factors
  • all the other Bradford Hill factors

If these largely independent factors each had a probability or accuracy of 95%, the conjunction of their probabilities would likely be below the needed feather weight on top of 50%.  In sum, Judge Rakoff’s confusing significance probability and the posterior probability of the null hypothesis does not subvert the usual standards of proof in civil cases.  See also Sander Greenland, “Null Misinterpretation in Statistical Testing and Its Impact on Health Risk Assessment,” 53 Preventive Medicine 225 (2011).

WHENCE COMES THIS ERROR

As a matter of intellectual history, I wonder where this error entered into the judicial system.  As a general matter, there was not much judicial discussion of statistical evidence before the 1970s.  The earliest manifestation of the transpositional fallacy in connection with scientific and statistical evidence appears in an opinion of the United States Court of Appeals, for the District of Columbia Circuit.  Ethyl Corp. v. EPA, 541 F.2d 1, 28 n.58 (D.C. Cir.), cert. denied, 426 U.S. 941 (1976).  The Circuit’s language is worth looking at carefully:

“Petitioners demand sole reliance on scientific facts, on evidence that reputable scientific techniques certify as certain.

Typically, a scientist will not so certify evidence unless the probability of error, by standard statistical measurement, is less than 5%. That is, scientific fact is at least 95% certain.  Such certainty has never characterized the judicial or the administrative process. It may be that the ‘beyond a reasonable doubt’ standard of criminal law demands 95% certainty.  Cf. McGill v. United States, 121 U.S.App.D.C. 179, 185 n.6, 348 F.2d 791, 797 n.6 (1965). But the standard of ordinary civil litigation, a preponderance of the evidence, demands only 51% certainty. A jury may weigh conflicting evidence and certify as adjudicative (although not scientific) fact that which it believes is more likely than not. ***”

 Id.  The 95% certainty appears to derive from 95% confidence intervals, although “confidence” is a technical term in statistics, and it most certainly does not mean the probability of the alternative hypothesis under consideration.  Similarly, the error that is less than 5% is not the probability of error of the belief in hypothesis of no difference between observations and expectations, but rather the probability of observing the data or the data even more extreme, on the assumption that observed would equal the expected.  The District of Columbia Circuit thus created a strawman:  scientific certainty is 95%, whereas civil and administrative law certainty is 51%.  This is rubbish, which confuses the frequentist probability from hypothesis testing with the subjective probability for belief in a fact.

The transpositional fallacy has a good pedigree, but that does not make it correct.  Only a lawyer would suggest that a mistake once made was somehow binding upon future litigants.  The following collection of citations and references illustrate how widespread the fundamental misunderstanding of statistical inference is, in the courts, in the academy, and at the bar.  If courts cannot deliver fair, accurate adjudication of scientific facts, then it is time to reform the system.


Courts

U.S. Supreme Court

Vasquez v. Hillery, 474 U.S. 254, 259 n.3 (1986) (“the District Court . . . accepted . . . a probability of 2 in 1,000 that the phenomenon was attributable to chance”)

U.S. Court of Appeals

First Circuit

Fudge v. Providence Fire Dep’t, 766 F.2d 650, 658 (1st Cir. 1985) (“Widely accepted statistical techniques have been developed to determine the likelihood an observed disparity resulted from mere chance.”)

Second Circuit

Nat’l Abortion Fed. v. Ashcroft, 330 F. Supp. 2d 436 (S.D.N.Y. 2004), aff’d in part, 437 F.3d 278 (2d Cir. 2006), vacated, 224 Fed. App’x 88 (2d Cir. 2007) (reporting an expert witness’s interpretation of a p-value of 0.30 to mean that there was a 30% probability that the study results were due to chance alone)

Smith v. Xerox Corp., 196 F.3d 358, 366 (2d Cir. 1999) (“If an obtained result varies from the expected result by two standard deviations, there is only about a .05 probability that the variance is due to chance.”)

Waisome v. Port Auth., 948 F.2d 1370, 1376 (2d Cir. 1991) (“about one chance in 20 that the explanation for a deviation could be random”)

Ottaviani v. State Univ. of New York at New Paltz, 875 F.2d 365, 372 n.7 (2d Cir. 1989)

Murphy v. General Elec. Co., 245 F. Supp. 2d 459, 467 (N.D.N.Y. 2003) (“less than a 5% probability that age was related to termination by chance”)

Third Circuit

United States v. State of Delaware, 2004 WL 609331, *10 n.27 (D. Del. 2004) (“there is a 5% (or 1 in 20) chance that the relationship observed is purely random”)

Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 605 n.26 (D.N.J. 2002) (“only 5% probability that an observed association is due to chance”)

Fifth Circuit

EEOC v. Olson’s Dairy Queens, Inc., 989 F.2d 165, 167 (5th Cir. 1993) (“Dr. Straszheim concluded that the likelihood that [the] observed hiring patterns resulted from truly race-neutral hiring practices was less than one chance in ten thousand.”)

Capaci v. Katz & Besthoff, Inc., 711 F.2d 647, 652 (5th Cir. 1983) (“the highest probability of unbiased hiring was 5.367 × 10-20”), cert. denied, 466 U.S. 927 (1984)

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982)(” A variation of two standard deviations would indicate that the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke. Sullivan, Zimmer & Richards, supra n.9 at 74.”)

Vuyanich v. Republic Nat’l Bank, 505 F. Supp. 224, 272 (N.D.Tex. 1980) (“the chances are less than one in 20 that the true coefficient is actually zero”), judgement vacated, 723 F.2d 1195 (5th Cir. 1984).

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982) (“the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke”)

Seventh Circuit

Adams v. Ameritech Services, Inc., 231 F.3d 414, 424, 427 (7th Cir. 2000) (“it is extremely unlikely (that is, there is less than a 5% probability) that the disparity is due to chance.”)

Sheehan v. Daily Racing Form, Inc., 104 F.3d 940, 941 (7th Cir. 1997) (“An affidavit by a statistician . . . states that the probability that the retentions . . . are uncorrelated with age is less than 5 percent.”)

Eighth Circuit

Craik v. Minnesota State Univ. Bd., 731 F.2d 465, 476n. 13 (8th Cir. 1984) (“Statistical significance is a measure of the probability that an observed disparity is not due to chance. Baldus & Cole, Statistical Proof of Discrimination § 9.02, at 290 (1980). A finding that a disparity is statistically significant at the 0.05 or 0.01 level means that there is a 5 per cent. or 1 per cent. probability, respectively, that the disparity is due to chance.

Ninth Circuit

Good v. Fluor Daniel Corp., 222 F.Supp. 2d 1236, 1241n.9 (E.D. Wash. 2002)(describing “statistical tools to calculate the probability that the difference seen is caused by random variation”)

D.C. Circuit

National Lime Ass’n v. EPA, 627 F.2d 416,453 (D.C. Cir. 1980)

FEDERAL CIRCUIT

Hodges v. Secretary Dep’t Health & Human Services, 9 F.3d 958, 967 (Fed. Cir. 1993) (Newman, J., dissenting) (“Scientists as well as judges must understand: ‘the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences’.”)(citing and quoting from the Report of the Carnegie Commission on Science, Technology, and Government, Science and Technology in Judicial Decision Making 28 (1993).


Regulatory Guidance

OSHA’s Guidance for Compliance with Hazard Communication Act:

“Statistical significance is a mathematical determination of the confidence in the outcome of a test. The usual criterion for establishing statistical significance is the p-value (probability value). A statistically significant difference in results is generally indicated by p < 0.05, meaning there is less than a 5% probability that the toxic effects observed were due to chance and were not caused by the chemical. Another way of looking at it is that there is a 95% probability that the effect is real, i.e., the effect seen was the result of the chemical exposure.”

U.S. Dep’t of Labor, Guidance for Hazard Determination for Compliance with the OSHA Hazard Communication Standard (29 CFR § 1910.1200) Section V (July 6, 2007).


Academic Commentators

Lucinda M. Finley, “Guarding the Gate to the Courthouse:  How Trial Judges Are Using Their Evidentiary Screening Role to Remake Tort Causation Rules,” 336 DePaul L. Rev. 335, 348 n. 49 (1999):

“Courts also require that the risk ratio in a study be ‘statistically significant,’ which is a statistical measurement of the likelihood that any detected association has occurred by chance, or is due to the exposure. Tests of statistical significance are intended to guard against what are called ‘Type I’ errors, or falsely ascribing a relationship when there in fact is not one (a false positive).  See SANDERS, supra note 5, at 51. The discipline of epidemiology is inherently conservative in making causal ascriptions, and regards Type I errors as more serious than Type II errors, or falsely assuming no association when in fact there is one (false negative). Thus, epidemiology conventionally requires a 95% level of statistical significance, i.e. that in statistical terms it is 95% likely that the association is due to exposure, rather than to chance. See id. at 50-52; Thompson, supra note 3, at 256-58. Despite courts’ use of statistical significance as an evidentiary screening device, this measurement has nothing to do with causation. It is most reflective of a study’s sample size, the relative rarity of the disease being studied, and the variance in study populations. Thompson, supra note 3, at 256.”

 

Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 42 n. 30 (2007):

 “‘By rejecting a hypothesis only when the test is statistically significant, we have placed an upper bound, .05, on the chance of rejecting a true hypothesis’. Fienberg et al., p. 22. Another way of explaining this is that it describes the probability that the procedure produced the observed effect by chance.”

Professor Fienberg stated the matter corrrectly, but Beecher-Monas goes on to restate the matter in her own words, erroneously.  Later, she repeats her incorrect interpretation:

“Statistical significance is a statement about the frequency with which a particular finding is likely to arise by chance.19”

Id. at 61 (citing a paper by Sander Greenland, who correctly stated the definition).

Mark G. Haug, “Minimizing Uncertainty in Scientific Evidence,” in Cynthia H. Cwik & Helen E. Witt, eds., Scientific Evidence Review:  Current Issues at the Crossroads of Science, Technology, and the Law – Monograph No. 7, at 87 (2006)

Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34(Oxford 1993)(One can think of α, β (the chances of type I and type II errors, respectively) and 1- β as measures of the “risk of error” or “standards of proof.”) See also id. at 44, 47, 55, 72-76.

Arnold Barnett, “An Underestimated Threat to Multiple Regression Analyses Used in Job Discrimination Cases, 5 Indus. Rel. L.J. 156, 168 (1982) (“The most common rule is that evidence is compelling if and only if the probability the pattern obtained would have arisen by chance alone does not exceed five percent.”)

David W. Barnes, Statistics as Proof: Fundamentals of Quantitative Evidence 162 (1983)(“Briefly, however, the findings of statistical significance at the P < .05, P < .04, and P < .02 levels indicate that the court can be 95%, 96%, and 98% certain, respectively, that the null hypotheses involved in the specific tests carried out … should be rejected.”)

Wayne Roth-Nelson & Kathey Verdeal, “Risk Evidence in Toxic Torts,” 2 Envt’l Lawyer 405,415-16 (1996) (confusing burden of proof with standard for hypothesis testint; and apparently endorsing the erroneous views given by Judge Newman, dissenting in Hodges). Caveat: Roth-Nelson is now a “forensic” toxicologist, who testifies in civil and criminal trials.

Steven R. Weller, “Book Review: Regulating Toxic Substances: A Philosophy of Science and Law,” 6 Harv. J. L. & Tech. 435, 436, 437-38 (1993) (“only when the statistical evidence gathered from studies shows that it is more than ninety-five percent likely that a test substance causes cancer will the substance be characterized scientifically as carcinogenic … to determine legal causality, the plaintiff need only establish that the probability with which it is true that the substance in question causes cancer is at least fifty percent, rather than the ninety-five percent to prove scientific causality”).

The Carnegie Commission on Science, Technology, and Government, Report on Science and Technology in Judicial Decision Making 28 (1993) (“The reality is that courts often decide cases not on the scientific merits, but on concepts such as burden of proof that operate differently in the legal and scientific realms. Scientists may misperceive these decisions as based on a misunderstanding of the science, when in actuality the decision may simply result from applying a different norm, one that, for the judiciary, is appropriate.  Much, for instance, has been written about ‘junk science’ in the courtroom. But judicial decisions that appear to be based on ‘bad’ science may actually reflect the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences.”).


Plaintiffs’ Counsel

Steven Rotman, “Don’t Know Much About Epidemiology?” Trial (Sept. 2007) (Author’s question answered in the affirmative:  “P values.  These measure the probability that a reported association between a drug and condition was due to chance.  A P-value of 0.05, which is generally considered the standard for statistical significance, means there is a 5 percent probability that the association was due to chance.”)

Defense Counsel

Bruce R. Parker & Anthony F. Vittoria, “Debunking Junk Science: Techniques for Effective Use of Biostatistics,” 65 Defense Csl. J. 35, 44 (2002) (“a P value of .01 means the researcher can be 99 percent sure that the result was not due to chance”).