Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons. See In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (Rufe, J.). The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].

A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.

*   *   *   *   *   *   *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem. Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2].

In one case, a pharmaceutical company described a p-value of 0.058 as statistical significant in a “Dear Doctor” letter, no doubt to avoid a claim of under-warning physicians. Vanderwerf v. SmithKline Beecham Corp., 529 F.Supp. 2d 1294, 1301 & n.9 (D. Kan. 2008), appeal dism’d, 603 F.3d 842 (10th Cir. 2010). The trial court[3], quoting the FDA clinical review, reported that a finding of “significance” at the 0.05 level “must be discounted for the large number of comparisons made. Id. at 1303, 1308.

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:

“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:

“Even if the elevated levels of lung cancer for men had been statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

Boughton v. Cotter Corp., 65 F.3d 823, 835 n. 20 (10th Cir. 1995). See also Novo Nordisk A/S v. Caraco Pharm. Labs., 775 F.Supp. 2d 985, 1019-20 & n.21 (2011) (describing the Bonferroni correction, and noting that expert witness biostatistician Marcello Pagano had criticized the use of post-hoc, “cherry-picked” data that were not part of the prespecified protocol analysis, and the failure to use a “correction factor,” and that another biostatistician expert witness, Howard Tzvi Thaler, had described a “strict set of well-accepted guidelines for correcting or adjusting analysis obtained from the `post hoc’ analysis”).

The notorious Wells[4] case was cited by the Supreme Court in Matrixx Initiatives[5] for the proposition that statistical significance was unnecessary. Ironically, at least one of the studies relied upon by the plaintiffs’ expert witnesses in Wells had some outcomes with p-values below five percent. The problem, addressed by defense expert witnesses and ignored by the plaintiffs’ witnesses and Judge Shoob, was that there were over 20 reported outcomes, and probably many more outcomes analyzed but not reported. Accordingly, some qualitative or quantitative adjustment was required in Wells. See Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation 93 (1997)[6].

Reference Manual on Scientific Evidence

David Kaye’s and the late David Freedman’s chapter on statistics in the third, most recent, edition of Reference Manual, offers some helpful insights into the problem of multiple testing:

4. How many tests have been done?

Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.

Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are significant. …”

Reference Manual on Scientific Evidence at 256-57 (3d ed. 2011).

When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.  When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.  Even a nominally statistically significant finding must be understood in the full context of the study.

Some texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves.  Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”

*     *     *     *     *

Despite the guidance provided by the Reference Manual, some courts have remained resistant to the need to consider multiple comparison issues. Statistical issues arise frequently in securities fraud cases against pharmaceutical cases, involving the need to evaluate and interpret clinical trial data for the benefit of shareholders. In a typical case, joint venturers Aeterna Zentaris Inc. and Keryx Biopharmaceuticals, Inc., were both targeted by investors for alleged Rule 10(b)(5) violations involving statements of clinical trial results, made in SEC filings, press releases, investor presentations and investor conference calls from 2009 to 2012. Abely v. Aeterna Zentaris Inc., No. 12 Civ. 4711(PKC), 2013 WL 2399869 (S.D.N.Y. May 29, 2013); In re Keryx Biopharms, Inc., Sec. Litig., 1307(KBF), 2014 WL 585658 (S.D.N.Y. Feb. 14, 2014).

The clinical trial at issue tested perifosine in conjunction with, and without, other therapies, in multiple arms, which examined efficacy for seven different types of cancer. After a preliminary phase II trial yielded promising results for metastatic colon cancer, the colon cancer arm proceeded. According to plaintiffs, the defendants repeatedly claimed that perifosine had demonstrated “statistically significant positive results.” In re Keryx at *2, 3.

The plaintiffs alleged that defendants’ statements omitted material facts, including the full extent of multiple testing in the design and conduct of the phase II trial, without adjustments supposedly “required” by regulatory guidance and generally accepted statistical principles. The plaintiffs asserted that the multiple comparisons involved in testing perifosine in so many different kinds of cancer patients, at various doses, with and against so many different types of other cancer therapies, compounded by multiple interim analyses, inflated the risk of Type I errors such that some statistical adjustment should have been applied before claiming that a statistically significant survival benefit had been found in one arm, with colorectal cancer patients. In re Keryx at *2-3, *10.

The trial court dismissed these allegation given that the trial protocol had been published, although over two years after the initial press release, which started the class period, and which failed to disclose the full extent of multiple testing and lack of statistical correction, which omitted this disclosure. In re Keryx at *4, *11. The trial court emphatically rejected the plaintiffs’ efforts to dictate methodology and interpretative strategy. The trial court was loathe to allow securities fraud claims over allegations of improper statistical methodology, which:

“would be equivalent to a determination that if a researcher leaves any of its methodology out of its public statements — how it did what it did or was planning to do — it could amount to an actionable false statement or omission. This is not what the law anticipates or requires.”

In re Keryx at *10[7]. According to the trial court, providing p-values for comparisons between therapies, without disclosing the extent of unplanned interim analyses or the number of multiple comparisons is “not falsity; it is less disclosure than plaintiffs would have liked.” Id. at *11.

“It would indeed be unjust—and could lead to unfortunate consequences beyond a single lawsuit—if the securities laws become a tool to second guess how clinical trials are designed and managed. The law prevents such a result; the Court applies that law here, and thus dismisses these actions.”

Id. at *1.

The court’s characterization of the fraud claims as a challenge to trial methodology rather than data interpretation and communication decidedly evaded the thrust of the plaintiffs’ fraud complaint. Data interpretation will often be part of the methodology outlined in a protocol. The Keryx case also confused criticism of the design and execution of a clinical trial with criticism of the communication of the trial results.


[1] Predictably, some plaintiffs’ counsel accused the MDL trial judge of acting as a statistician and second-guessing the statistical inferences drawn by the party expert witness. See, e.g., Max Kennerly, “Daubert Doesn’t Ask Judges To Become Experts On Statistics” (July 22, 2014). Federal Rule of Evidence 702 requires trial judges to evaluate the methodology used to determine whether it is valid. Kennerly would limit the trial judge to a simple determination of whether the expert witness used statistics, and whether statistics generally are appropriately used. In his words, “[t]o go with the baseball metaphors so often (and wrongly) used in the law, when it comes to Daubert, the judge isn’t an umpire calling balls and strikes, they’re [sic] more like a league official checking to make sure the players are using regulation equipment. Mere disagreements about the science itself, and about the expert’s conclusions, are to be made by the jury in the courtroom.” This position is rejected by the explicit wording of the statute, as well as the Supreme Court opinions leading up to the revision in the statute. To extend Kennerly’s overextended metaphor even further, the trial court must not only make sure that the players are using regulation equipment, but also that pitchers, expert witnesses, aren’t throwing spitballs or balking in their pitching of opinions. Judge Rufe, in the Zoloft MDL, did no more than asked of her by Rule 702 and the Reference Manual.

[2] Perhaps the prosecutor, jury, and trial and appellate judges in United States v. Harkonen would be willing to brand Dr. Bérard a fraudfeasor. U.S. v. Harkonen, 2009 WL 1578712, 2010 WL 2985257 (N.D. Cal.), aff’d, 2013 WL 782354 (9th Cir. Mar. 4, 2013), cert. denied, ___ U.S. ___ (2013).

[3] The trial court also acknowledged the Reference Manual on Scientific Evidence 127-28 (2d ed. 2000). Unfortunately, the court erred in interpreting the meaning of a 95 percent confidence interval as showing “the true relative risk value will be between the high and low ends of the confidence interval 95 percent of the time.” Vanderwerf v. SmithKlineBeecham Corp., 529 F.Supp. 2d at 1302 n.10.

[4] Wells v. Ortho Pharm. Corp., 615 F. Supp. 262 (N.D. Ga. 1985), aff ’d, and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S. 950 (1986).

[5] Matrixx Initiatives, Inc. v. Siracusano, 131 S.Ct. 1309 (2011)

[6] Zeisel and Kaye contrast the lack of appreciation for statistical methodology in Wells with the handling of the multiple comparison issue in an English case, Reay v. British Nuclear Fuels (Q.B. Oct. 8, 1993). In Reay, children of fathers who worked in nuclear power plants and who developed leukemia, sued. Their expert witnesses relied upon a study that reported 50 or so hypotheses. Zeisel and Kaye quote the trial judge as acknowledging that the number of hypotheses considered inflates the nominal value of the p-value and reduces confidence in the study’s result. Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation 93 (1997) (discussing Reay case as published in The Independent, Nov. 22, 1993).

[7] Of course, this is exactly what happened to Dr. Scott Harkonen, who was indicted and convicted under the Wire Fraud Act, despite issuing a press release that included a notice of an investor conference call within a couple of weeks, when investors and others could inquire fully about the clinical trial results.