Bad and Good Statistical Advice from the New England Journal of Medicine

Many people consider The New England Journal of Medicine (NEJM) a prestigious journal.  It is certainly widely read.  Judging from its “impact factor,” we know the journal is frequently cited.  So when the NEJM weighs in on issue that involves the intersection of law and science, I pay attention.

Unfortunately, this week’s issue contains an editorial “Perspective” piece that is filled with incoherent, inconsistent, and incorrect assertions, both on the law and the science.  Mark A. Pfeffer and Marianne Bowler, “Access to Safety Data – Stockholders versus Prescribers,” 364 New Engl. J. Med. ___ (2011).

Dr. Mark Pfeffer and the Hon. Marianne Bowler used the recent United States Supreme Court decision in Matrixx Initiatives, Inc. v. Siracusano, __ U.S. __, 131 S.Ct., 1309 (2011), to advance views, not supported by the law or the science.   Remarkably, Dr. Pfeffer is the Victor J. Dzau Professor of Medicine, at the Harvard Medical School.  He is both a physician, and he has received a Ph.D. degree in physiology and biophysics.  Ms. Bowler is both a lawyer and a federal judge.  Between the two, they should have provided better, more accurate, and more consistent advice.

1. The Authors Erroneously Characterize Statistical Significance in Inappropriate Bayesian Terms

The article begins with a relatively straightforward characterization of various legal burdens of proof.  The authors then try to collapse one of those burdens of proof, “beyond a reasonable doubt,” which has no accepted quantitative meaning, to a significance probability that is used to reject a pre-specified null hypothesis in scientific studies:

“To reject the null hypothesis (that a result occurred by chance) and deem an intervention effective in a clinical trial, the level of proof analogous to law’s ‘beyond a reasonable doubt’ standard would require an extremely stringent alpha level to permit researchers to claim a statistically significant effect, with the offsetting risk that a truly effective intervention would sometimes be deemed ineffective.  Instead, most randomized clinical trials are designed to achieve a lower level of evidence that in legal jargon might be called ‘clear and convincing’, making conclusions drawn from it highly probable or reasonably certain.”

Now this is both scientific and legal nonsense.  It is distressing that a federal judge characterizes the burden of proof that she must apply, or direct juries to apply, as “legal jargon.”  More important, these authors, scientist and judge, give questionable quantitative meanings to burdens of proof, and they misstate the meaning of statistical significance.  When judges or juries must determine guilt “beyond a reasonable doubt,” they are assessing the prosecution’s claim that the defendant is guilty, given the evidence at trial.  This posterior probability can be represented as:

Probability (Guilt | Evidence Adduced)

This is what is known as a posterior probability, and it is fundamentally different from significance probability.

The significance probability is a transposed conditional probability from the posterior probability that is used to assess guilt in a criminal trial, or contentions in a civil trial.  As law professor David Kaye and his statistician coauthor, the late David Freedman, described the p-value and significance probability:

“The p-value is the probability of getting data as extreme as, or more extreme than, the actual data, given that the null hypothesis is true:

p = Probability (extreme data | null hypothesis in model)

* * *

Conversely, large p-values indicate that the data are compatible with the null hypothesis: the observed difference is easy to explain by chance. In this context, small p-values argue for the plaintiffs, while large p-values argue for the defense.131Since p is calculated by assuming that the null hypothesis is correct (no real difference in pass rates), the p-value cannot give the chance that this hypothesis is true. The p-value merely gives the chance of getting evidence against the null hypothesis as strong or stronger than the evidence at hand—assuming the null hypothesis to be correct. No matter how many samples are obtained, the null hypothesis is either always right or always wrong. Chance affects the data, not the hypothesis. With the frequency interpretation of chance, there is no meaningful way to assign a numerical probability to the null hypothesis.132

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 122 (2ed. 2000).  Kaye and Freedman explained over a decade ago, for the benefit of federal judges:

“As noted above, it is easy to mistake the p-value for the probability that there is no difference. Likewise, if results are significant at the .05 level, it is tempting to conclude that the null hypothesis has only a 5% chance of being correct.142

This temptation should be resisted. From the frequentist perspective, statistical hypotheses are either true or false; probabilities govern the samples, not the models and hypotheses. The significance level tells us what is likely to happen when the null hypothesis is correct; it cannot tell us the probability that the hypothesis is true. Significance comes no closer to expressing the probability that the null hypothesis is true than does the underlying p-value.143

Id. at 124-25.

As we can see, our scientist from the Harvard School of Medical School and our federal judge have committed the transpositional fallacy by likening “beyond a reasonable doubt” to the alpha used to test for a statistically significant outcome in a clinical trial.  They are not the same; nor are they analogous.

This fallacy has been repeatedly described.  Not only has the Reference Manual on Scientific Manual (which is written specifically for federal judges) described the fallacy in detail, but legal and scientific writers have urged care to avoid this basic mistake in probabilistic reasoning.  Here is a recent admonition from one of the leading writers on the use (and misuse) of statistics in legal procedures:

“Some commentators, however, would go much further; they argue that is an arbitrary statistical convention and since preponderance of the evidence means 51% probability, lawyers should not use 5% as the level of statistical significance but 49% – thus rejecting the null hypothesis when there is up to a 49% chance that it is true. In their view, to use a 5% standard of significance would impermissibly raise the preponderance of evidence standard in civil trials. Of course the 5% figure is arbitrary (although widely accepted in statistics) but the argument is fallacious. It assumes that 5% (or 49% for that matter) is the probability that the null hypothesis is true. The 5% level of significance is not that, but the probability of the sample evidence if the null hypothesis were true. This is a very different matter. As I pointed out in Chapter1, the probability of the sample given the null hypothesis is not generally the same as the probability of the null hypothesis given the sample. To relate the level of significance to the probability of the null hypothesis would require an application of Bayes’s theorem and the assumption of a prior probability distribution. However, the courts have usually accepted the statistical standard, although with some justifiable reservations when the P-value is only slightly above the 5% cutoff.”

Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 54 (N.Y. 2009) (emphasis added).

2.  The Authors, Having Mischaracterized Burden-of-Proof and Significance Probabilities, Incorrectly Assess the Meaning of the Supreme Court’s Decision in Matrixx Initiatives.

I have written a good bit about the Court’s decision in Matrixx Initiatives, most recently with David Venderbush, for the Washington Legal Foundation.  See Schachtman & Venderbush, “Matrixx Unbounded: High Court’s Ruling Needlessly Complicates Scientific Evidence Principles,” W.L.F. Legal Backgrounder (June 17, 2011).

I was thus startled to see the claim of a federal judge that the Supreme Court, in Matrixx, had “applied the ‘fair preponderance of the evidence’ standard of proof used for civil matters.”  Matrixx was a case about the sufficiency of the pleadings, and thus there really could have been no such application of a burden of proof to an evidentiary display.  The very claim is incoherent, and at odds with the Supreme Court’s holding.

The NEJM authors went on to detail how the defendant in Matrixx had persuaded the trial court that the evidence against its product, Zicam, did not reach statistical significance, and therefore the evidence should not be considered “material.”  As I have pointed out before, Matrixx focused on adverse event reports, as raw number of reported events, which did not, and could not, be analyzed for statistical significance.  The very essence of Matrixx’s argument was nonsense, which perhaps explains the company’s nine-nothing loss in the Supreme Court.  The authors of the opinion piece in the NEJM, however, missed that it is not the evidence of adverse event reports, with or without a statistical analysis, that is material.  What was at issue was whether the company’s failure to disclose this information, along with a good deal more information, in the face of the company’s having made very aggressive, optimistic sales and profits projections for the future.

The NEJM authors proceed to tell us, correctly, that adverse events do not prove causality, but then they tell us, incorrectly, that the Matrixx case shows that “such a high level of proof did not have to be achieved.”  While the authors are correct about the sufficiency of adverse event reports for causal assessments, they miss the legal significance of there being no burden of proof at play in Matrixx; it was a case on the pleadings.  The issue was the sufficiency of those pleadings, and what the Supreme Court made clear was that in the context of a product subject to FDA regulation, causation was never the test for materiality because the FDA could withdraw the product on a showing far less than scientific causation of harm.  So the plaintiffs could allege less than causation, and still have pleaded a sufficient case of securities fraud.  The Supreme Court did not, and could not, address the issue that the NEJM authors discuss.  The authors’ assessment that the Matrixx case freed legal causation of any requirement of statistical significance is a tortured reading of obiter dictum, not the holding of the case.  This editorializing is troubling.

The NEJM authors similarly hold forth on what clinicians consider material, and they announce that “[c]linicians are well aware that to be considered material, information regarding drug safety does not have to reach the same level of certainty that we demand for demonstrating efficacy.”  This is true, but clinicians are ethically bound to err on the side of safety:  Primum non nocere. See, e.g., Tamraz v. Lincoln Elec. Co., 620 F.3d 665, 673 (6th Cir. 2010) (noting that treating physicians have more training in diagnosis than in etiologic assessments), cert. denied, ___ U.S.____ (2011).  Again, the authors’ statements have nothing to do with the Matrixx case, or with the standards for legal or scientific causation.

3.  The Authors, Inconsistently with Their Characterization of Various Probabilities, Proceed Correctly To Describe Statistical Significance Testing for Adverse Outcomes in Trials.

Having incorrectly described beyond a reasonable doubt as like p <0.05, the NEJM authors then, correctly point out that standard statistical testing cannot be used for “evaluating unplanned and uncommon adverse events.”  The authors also note that the flood of data in the assessment of causation of adverse events is filled with “biologic noise.”  Physicians and regulators may take the noise signals and claim that they hear a concert.  This is exactly why we should not confuse precautionary judgments with scientific assessments of causation.