Multiplicity versus Duplicity – The Harkonen Conviction

United States of America v. W. Scott Harkonen, MD — Part II

The Alleged Fraud – “False as a matter of statistics”

The essence of the government’s case was that drawing an inference of causation from a statistically nonsignificant, post-hoc analysis was “false as a matter of statistics.” ER2498.  Dr. Harkonen’s trial counsel did not present any statistician testimony at trial.  In their final argument, his counsel explained that they obtained sufficient concessions at trial to make their point.

In post-trial motions, new counsel for Dr. Harkonen submitted affidavits from Dr. Steven Goodman and Dr. Donald Rubin, two very capable and highly accomplished statisticians, who explained the diversity of views in their field about the role of p-values in interpreting study data and drawing causal inferences.  At trial, however, the government’s witnesses, Drs. Crager and Fleming, testified that p-values of [less than] 0.05 were “magic numbers.”  United States v. Harkonen, 2010 WL 2985257, at *5 (N.D. Calif. 2010) (Judge Patel’s opinion denying defendant’s post–trial motions to dismiss the indictment, for acquittal, or for a new trial).  Sometimes judges are looking for bright lines in the wrong places.

The Multiplicity Problem

The government argued that the proper interpretation of a given p-value requires information about the nature and context of the statistical test that gave rise to the p-value.  If many independent tests are run on the same set of data, a low p-value would be expected to occur by chance alone.  Multiple testing can inflate the rate of false-positive findings, Type I errors.  The generation of these potentially false positive results is sometimes called the “multiplicity problem”; in the face of multiple testing, a stated p-value can greatly understate the level of false-positive findings.

In the context of a randomized clinical trial, it is thus important to know what the prespecified primary and secondary end points were.  David Moher, Kenneth F. Schulz, and Douglas G. Altman, “The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials,” 357 Lancet 1191 (2001). Post hoc data dredging can lead to the “Texas Sharpshooter Fallacy,” which results when an investigator draws a target around a hit, after the fact, and declares a bulls-eye.

Dr. Fleming thus had a limited point; namely the use of the verb “demonstrate” rather than “show” or “suggest” was too strong if based solely upon InterMune’s clinical trial, given that the low p-value came in the context of a non-prespecified subgroup analysis. (The supposedly offensive press release issued by Dr. Harkonen did indicate that the data confirmed the results in a previously reported phase II trial.) If the government engaged in some counter-speech to say that Dr. Harkonen’s statements fell below an idealized “best statistical practice” in his use of “demonstrate,” many statisticians might well agree with the government.  Even this limited point would evaporates if Dr. Harkonen had stated that the phase III subgroup analysis, along with the earlier published clinical trial, and clinical experience, “demonstrated” a survival benefit.  Had Dr. Harkonen issued this more scientifically felicitous statement, the government could not have made a claim of falsity in using the verb “to demonstrate” with a single p-value from a post hoc subgroup analysis.  Such a statement would have taken Dr. Harkonen’s analytic inference out of the purely statistical realm. Indeed, Dr. Harkonen’s press release did reference an earlier phase II trial, as well as notify readers that more detailed analyses would be presented at upcoming medical conferences.  Although Dr. Harkonen did use “demonstrate” to characterize the results of the phase III trial, standing alone, the entire press release made clear that the data were preliminary. It is difficult to imagine any reasonable physician prescribing Actimmune on the basis of the press release.

The prosecution and conviction of Dr. Harkonen thus raises the issue whether the alleged improper characterization of a study’s statistical result can be criminalized by the State.  Clearly, the federal prosecutors were motivated by their perception that the alleged fraud was connected to an attempt to promote an off-label use of Actimmune.  Such linguistic precision, however, is widely flouted in the world of law and science.  Lawyers use the word “proofs,” which often admit of inferences for either side, to describe real, demonstrative, and testimonial evidence.  A mathematician might be moved to prosecute all lawyers for fraudulent speech.  From the mathematicians’ perspective, the lawyers have made a claim of certainty in using “proof,” which is totally out of place.  Even in the world of science, the verb “to demonstrate” is used in a way that does not imply the sort of certitude that the purists might wish to retain for the strongest of empirical inferences from clinical trials. See, e.g., William B. Wong, Vincent W. Lin, Denise Boudreau, and Emily Beth Devine, “Statins in the prevention of dementia and Alzheimer’s disease: A meta-analysis of observational studies and an assessment of confounding,” 21 Pharmacoepidemiology & Drug Safety in-press, at Abstract (2012) (“Studies demonstrate the potential for statins to prevent dementia and Alzheimer’s disease (AD), but the evidence is inconclusive.”) (emphasis added).

The Duplicity Problem – The Matrixx Motion

After the conviction, Dr. Harkonen’s counsel moved for a new trial on grounds of newly discovered evidence. Dr. Harkonen’s counsel hoisted the prosecutors with their own petards, by quoting the government’s amicus brief to the United States Supreme Court in Matrixx Initiatives Inc. v. Siracusano, 131 S. Ct. 1309 (2011).  In Matrixx, the securities fraud plaintiffs contended that they need not plead “statistically significant” evidence for adverse drug effects.  The Solicitor General’s office, along with counsel for the Food and Drug Division of the Department of Health & Human Services, in their zeal to assist plaintiffs in their claims against an over-the-counter pharmaceutical manufacturer, disclaimed the necessity, or even the importance, of statistical significance:

“[w]hile statistical significance provides some indication about the validity of a correlation between a product and a harm, a determination that certain data are not statistically significant … does not refute an inference of causation.”

Brief for the United States as Amicus Curiae Supporting Respondents, in Matrixx Initiatives, Inc. v. Siracusano, 2010 WL 4624148, at *14 (Nov. 12, 2010).

The government’s amicus brief introduces its discussion of this topic with a heading, entitled “Statistical significance is a limited and non-exclusive tool for inferring causation.” Id. at *13.  In a footnote, the government elaborated that its position applied to both safety and efficacy outcomes:

“[t]he same principle applies to studies suggesting that a particular drug is efficacious. A study  in which the cure rate for cancer patients who took a drug was twice the cure rate for those who took a placebo could generate meaningful interest even if the results were not statistically significant.”

Id. at *15 n.2.

The government might have suggested that Dr. Harkonen was parsing the amicus brief incorrectly.  After all, generating “meaningful interest” is not the same as generating a scientific conclusion, or as “demonstrating.” As I will show in a future post, the government, in its amicus brief, consistently misstated the meaning of statistical significance, and of significance probability.  The government’s inability to communicate these concepts correctly raises serious due process issues with a prosecution against someone for having using the wrong verb to describe a statistical inference. 

SCOTUS

The government’s amicus brief was clearly influential before the Supreme Court. The Court cited to, and adopted in dictum, the claim that the absence of statistical significance did not mean that medical expert witnesses could not have a reliable basis for inferring causation between a drug and an adverse event.  Matrixx Initiatives, Inc. v. Siracusano, — U.S. –, 131 S.Ct. 1309, 1319-20 (2011) (“medical professionals and researchers do not limit the data they consider to the results of randomized clinical trials or to statistically significant evidence”).

In any event, the prosecutor, in Dr. Harkonen’s trial, argued in summation that InterMune’s clinical trial had “failed,” and no conclusions could be drawn from the trial.  If this argument was not flatly contradicted by the government’s Matrixx brief, then the argument was certainly undermined by the rhetorical force of the government’s amicus brief.

The district court denied Dr. Harkonen’s motion for a new trial, and explained that the government’s Matrixx amicus brief contained “argument” rather than “newly discovered evidence.” United States v. Harkonen, No. C 08-00164 MHP, Memorandum and Order re Defendant Harkonen’s Motions for a New Trial at 14 (N.D. Calif. April 18, 2011). This rationale seems particularly inappropriate because the interpretation of a statistical test and the drawing of an inference are both “arguments,” and it is a fact that the government contended that p < 0.05 was not necessary to drawing causal inferences. The district court also offered that Matrixx was distinguishable on grounds that the securities fraud in Matrixx involved a safety outcome rather than an efficacy conclusion. This distinction truly lacks a difference:  the standards for determining causation do not differ between establishing harm or efficacy.  Of course, the FDA does employ a lesser, precautionary standard for regulating against harm, but this difference does not mean that the causal connections between harm and drugs are assessed on different standards.

On December 6th, the appeals in United States v. Harkonen were argued and submitted for decision.  Win or lose, Dr. Harkonen is likely to make important law in how scientists and lawyers speak about statistical inferences.