TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses

October 14th, 2014

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons. See In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (Rufe, J.). The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].

A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.

*   *   *   *   *   *   *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem. Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2].

In one case, a pharmaceutical company described a p-value of 0.058 as statistical significant in a “Dear Doctor” letter, no doubt to avoid a claim of under-warning physicians. Vanderwerf v. SmithKline Beecham Corp., 529 F.Supp. 2d 1294, 1301 & n.9 (D. Kan. 2008), appeal dism’d, 603 F.3d 842 (10th Cir. 2010). The trial court[3], quoting the FDA clinical review, reported that a finding of “significance” at the 0.05 level “must be discounted for the large number of comparisons made. Id. at 1303, 1308.

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:

“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:

“Even if the elevated levels of lung cancer for men had been statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

Boughton v. Cotter Corp., 65 F.3d 823, 835 n. 20 (10th Cir. 1995). See also Novo Nordisk A/S v. Caraco Pharm. Labs., 775 F.Supp. 2d 985, 1019-20 & n.21 (2011) (describing the Bonferroni correction, and noting that expert witness biostatistician Marcello Pagano had criticized the use of post-hoc, “cherry-picked” data that were not part of the prespecified protocol analysis, and the failure to use a “correction factor,” and that another biostatistician expert witness, Howard Tzvi Thaler, had described a “strict set of well-accepted guidelines for correcting or adjusting analysis obtained from the `post hoc’ analysis”).

The notorious Wells[4] case was cited by the Supreme Court in Matrixx Initiatives[5] for the proposition that statistical significance was unnecessary. Ironically, at least one of the studies relied upon by the plaintiffs’ expert witnesses in Wells had some outcomes with p-values below five percent. The problem, addressed by defense expert witnesses and ignored by the plaintiffs’ witnesses and Judge Shoob, was that there were over 20 reported outcomes, and probably many more outcomes analyzed but not reported. Accordingly, some qualitative or quantitative adjustment was required in Wells. See Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation 93 (1997)[6].

Reference Manual on Scientific Evidence

David Kaye’s and the late David Freedman’s chapter on statistics in the third, most recent, edition of Reference Manual, offers some helpful insights into the problem of multiple testing:

4. How many tests have been done?

Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.

Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are significant. …”

Reference Manual on Scientific Evidence at 256-57 (3d ed. 2011).

When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.  When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.  Even a nominally statistically significant finding must be understood in the full context of the study.

Some texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves.  Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”

*     *     *     *     *

Despite the guidance provided by the Reference Manual, some courts have remained resistant to the need to consider multiple comparison issues. Statistical issues arise frequently in securities fraud cases against pharmaceutical cases, involving the need to evaluate and interpret clinical trial data for the benefit of shareholders. In a typical case, joint venturers Aeterna Zentaris Inc. and Keryx Biopharmaceuticals, Inc., were both targeted by investors for alleged Rule 10(b)(5) violations involving statements of clinical trial results, made in SEC filings, press releases, investor presentations and investor conference calls from 2009 to 2012. Abely v. Aeterna Zentaris Inc., No. 12 Civ. 4711(PKC), 2013 WL 2399869 (S.D.N.Y. May 29, 2013); In re Keryx Biopharms, Inc., Sec. Litig., 1307(KBF), 2014 WL 585658 (S.D.N.Y. Feb. 14, 2014).

The clinical trial at issue tested perifosine in conjunction with, and without, other therapies, in multiple arms, which examined efficacy for seven different types of cancer. After a preliminary phase II trial yielded promising results for metastatic colon cancer, the colon cancer arm proceeded. According to plaintiffs, the defendants repeatedly claimed that perifosine had demonstrated “statistically significant positive results.” In re Keryx at *2, 3.

The plaintiffs alleged that defendants’ statements omitted material facts, including the full extent of multiple testing in the design and conduct of the phase II trial, without adjustments supposedly “required” by regulatory guidance and generally accepted statistical principles. The plaintiffs asserted that the multiple comparisons involved in testing perifosine in so many different kinds of cancer patients, at various doses, with and against so many different types of other cancer therapies, compounded by multiple interim analyses, inflated the risk of Type I errors such that some statistical adjustment should have been applied before claiming that a statistically significant survival benefit had been found in one arm, with colorectal cancer patients. In re Keryx at *2-3, *10.

The trial court dismissed these allegation given that the trial protocol had been published, although over two years after the initial press release, which started the class period, and which failed to disclose the full extent of multiple testing and lack of statistical correction, which omitted this disclosure. In re Keryx at *4, *11. The trial court emphatically rejected the plaintiffs’ efforts to dictate methodology and interpretative strategy. The trial court was loathe to allow securities fraud claims over allegations of improper statistical methodology, which:

“would be equivalent to a determination that if a researcher leaves any of its methodology out of its public statements — how it did what it did or was planning to do — it could amount to an actionable false statement or omission. This is not what the law anticipates or requires.”

In re Keryx at *10[7]. According to the trial court, providing p-values for comparisons between therapies, without disclosing the extent of unplanned interim analyses or the number of multiple comparisons is “not falsity; it is less disclosure than plaintiffs would have liked.” Id. at *11.

“It would indeed be unjust—and could lead to unfortunate consequences beyond a single lawsuit—if the securities laws become a tool to second guess how clinical trials are designed and managed. The law prevents such a result; the Court applies that law here, and thus dismisses these actions.”

Id. at *1.

The court’s characterization of the fraud claims as a challenge to trial methodology rather than data interpretation and communication decidedly evaded the thrust of the plaintiffs’ fraud complaint. Data interpretation will often be part of the methodology outlined in a protocol. The Keryx case also confused criticism of the design and execution of a clinical trial with criticism of the communication of the trial results.


[1] Predictably, some plaintiffs’ counsel accused the MDL trial judge of acting as a statistician and second-guessing the statistical inferences drawn by the party expert witness. See, e.g., Max Kennerly, “Daubert Doesn’t Ask Judges To Become Experts On Statistics” (July 22, 2014). Federal Rule of Evidence 702 requires trial judges to evaluate the methodology used to determine whether it is valid. Kennerly would limit the trial judge to a simple determination of whether the expert witness used statistics, and whether statistics generally are appropriately used. In his words, “[t]o go with the baseball metaphors so often (and wrongly) used in the law, when it comes to Daubert, the judge isn’t an umpire calling balls and strikes, they’re [sic] more like a league official checking to make sure the players are using regulation equipment. Mere disagreements about the science itself, and about the expert’s conclusions, are to be made by the jury in the courtroom.” This position is rejected by the explicit wording of the statute, as well as the Supreme Court opinions leading up to the revision in the statute. To extend Kennerly’s overextended metaphor even further, the trial court must not only make sure that the players are using regulation equipment, but also that pitchers, expert witnesses, aren’t throwing spitballs or balking in their pitching of opinions. Judge Rufe, in the Zoloft MDL, did no more than asked of her by Rule 702 and the Reference Manual.

[2] Perhaps the prosecutor, jury, and trial and appellate judges in United States v. Harkonen would be willing to brand Dr. Bérard a fraudfeasor. U.S. v. Harkonen, 2009 WL 1578712, 2010 WL 2985257 (N.D. Cal.), aff’d, 2013 WL 782354 (9th Cir. Mar. 4, 2013), cert. denied, ___ U.S. ___ (2013).

[3] The trial court also acknowledged the Reference Manual on Scientific Evidence 127-28 (2d ed. 2000). Unfortunately, the court erred in interpreting the meaning of a 95 percent confidence interval as showing “the true relative risk value will be between the high and low ends of the confidence interval 95 percent of the time.” Vanderwerf v. SmithKlineBeecham Corp., 529 F.Supp. 2d at 1302 n.10.

[4] Wells v. Ortho Pharm. Corp., 615 F. Supp. 262 (N.D. Ga. 1985), aff ’d, and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S. 950 (1986).

[5] Matrixx Initiatives, Inc. v. Siracusano, 131 S.Ct. 1309 (2011)

[6] Zeisel and Kaye contrast the lack of appreciation for statistical methodology in Wells with the handling of the multiple comparison issue in an English case, Reay v. British Nuclear Fuels (Q.B. Oct. 8, 1993). In Reay, children of fathers who worked in nuclear power plants and who developed leukemia, sued. Their expert witnesses relied upon a study that reported 50 or so hypotheses. Zeisel and Kaye quote the trial judge as acknowledging that the number of hypotheses considered inflates the nominal value of the p-value and reduces confidence in the study’s result. Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation 93 (1997) (discussing Reay case as published in The Independent, Nov. 22, 1993).

[7] Of course, this is exactly what happened to Dr. Scott Harkonen, who was indicted and convicted under the Wire Fraud Act, despite issuing a press release that included a notice of an investor conference call within a couple of weeks, when investors and others could inquire fully about the clinical trial results.

Subgroups — Subpar Statistical Practice versus Fraud

July 24th, 2014

Several people have asked me why I do not enable comments on this blog.  Although some bloggers (e.g., Deborah Mayo’s Error Statistics site) have had great success in generating interesting and important discussions, I have seen too much spam on other websites, and I want to avoid having to police the untoward posts.  Still, I welcome comments and I try to respond to helpful criticism.  If and when I am wrong, I will gladly eat my words, which usually have been quite digestible.

Probably none of the posts here have generated more comments and criticisms than those written about the prosecution of Dr. Harkonen.  In general, critics have argued that defending Harkonen and his press release was tantamount to condoning bad statistical practice.  I have tried to show that Dr. Harkonen’s press release was much more revealing than it was portrayed in abbreviated accounts of his case, and the evidentiary support for his claim of efficacy in a subgroup was deeper and broader than acknowledged. The criticism and condemnation of Dr. Harkonen’s press release in the face of prevalent statistical practice, among leading journals and practitioners, is nothing short of hypocrisy and bad faith. If Dr. Harkonen deserves prison time for a press release, which promised a full analysis and discussion in upcoming conference calls and presentations at scientific meetings, then we can only imagine what criminal sanction awaits the scientists and journal editors who publish purportedly definitive accounts of clinical trials and epidemiologic studies, with subgroup analyses not prespecified and not labeled as post-hoc.

The prevalence of the practice does not transform Dr. Harkonen’s press release into “best practice,” but some allowance must be made for offering a causal opinion in the informal context of a press release rather than in a manuscript for submission to a journal.  And those critics, with prosecutorial temperaments, must recognize that, when the study was presented at conferences, and when manuscript was written up and submitted to the New England Journal of Medicine, the authors did reveal the ad hoc nature of the subgroup.

The Harkonen case will remain important for several reasons. There is an important distinction in the Harkonen case, ignored and violated by the government’s position, between opinion and fact.  If Harkonen is guilty of Wire Fraud, then so are virtually every cleric, minister, priest, rabbi, imam, mullah, and other religious person who makes supernatural claims and predictions.  Add in all politicians, homeopaths, vaccine deniers, and others who reject evidence for superstition, who are much more culpable than a scientist who accurately reports the actual data and p-value.

Then there is the disconnect between what expert witnesses are permitted to say and what resulted in Dr. Harkonen’s conviction. If any good could come from the government’s win, it would be the insistence upon “best practice” for gatekeeping of expert witness opinion testimony.

For better or worse, scientists often describe post-hoc subgroup findings as “demonstrated” effects. Although some scientists would disagree with this reporting, the practice is prevalent.  Some scientists would go further and contest the claim that pre-specified hypotheses are inherently more reliable than post-hoc hypotheses. See Timothy Lash & Jan Vandenbroucke, “Should Preregistration of Epidemiologic Study Protocols Become Compulsory?,” 23 Epidemiology 184 (2012).

One survey compared grant applications with later published papers and found that subgroup analyses were pre-specified in only a minority of cases; in a substantial majority (77%) of the subgroup analyses in the published papers, the analyses were not characterized as either pre-specified or post hoc. Chantal W. B. Boonacker, Arno W. Hoes, Karen van Liere-Visser, Anne G. M. Schilder, and Maroeska M. Rovers, “A Comparison of Subgroup Analyses in Grant

Applications and Publications,” 174 Am. J. Epidem. 291, 291 (2011).  Indeed, this survey’s comparison between grant applications and published papers revealed that most of the published subgroup analyses were post hoc, and that the authors of the published papers rarely reported justifications for their post-hoc subgroup. Id.

Again, for better or worse, the practice of presenting unplanned subgroup analyses, is common in the biomedical literature. Several years ago, the New England Journal of Medicine reported a survey of publication practice in its own pages, with findings similar to those of Boonacker and colleagues. Rui Wang, Stephen W. Lagakos, James H. Ware, David J. Hunter, and Jeffrey M. Drazen, “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Eng. J. Med. 2189 (2007).  In general, Wang, et al.,  were unable to determine the total number of subgroup analyses performed; and in the majority (68%) of trials discussed, Wang could not determine whether the subgroup analyses were prespecified. Id. at 2912. Although Wang proposed guidelines for identifying subgroup analyses as prespecified or post-hoc, she emphasized that the proposals were not “rules” that could be rigidly prescribed. Id. at 2194.

The Wang study is hardly unique; the Journal of the American Medical Association reported a similar set of results. An-Wen Chan, Asbjørn Hrobjartsson, Mette T. Haahr, Peter C. Gøtzsche, and Douglas G. Altman, “Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials Comparison of Protocols to Published Articles,” 291 J. Am. Med. Ass’n 2457 (2004).  Chan and colleagues set out to document and analyze “outcome reporting bias” in studies; that is, the extent to which publications fail to report accurately the pre-specified outcomes in published studies of randomized clinical trials.  The authors compared and analyzed protocols and published reports of randomized clinical trials conducted in Denmark in 1994 and 1995. Their findings document a large discrepancy between idealized notion of pre-specification of study design, outcomes, and analyses, and the actual practice revealed by later publication.

Chan identified 102 clinical trials, with 3,736 outcomes, and found that 50% of efficacy, and 65% of harm outcomes were incompletely reported. There was a statistically significant risk of statistically significant outcomes to be fully reported compared with statistically insignificant results. (pooled odds ratio for efficacy outcomes = 2.4; 95% confidence interval, 1.4 – 4.0, and pooled odds ratio for harm outcomes = 4.7; 95% confidence interval, 1.8 -12.0. Their comparison of protocols with later published articles revealed that a majority of trials (62%) had at least one primary outcome that was changed, omitted, or innovated in the published version. The authors concluded that published accounts of clinical trials were frequently incomplete, biased, and inconsistent with protocols.

This week, an international group of scientists published their analysis of agreement vel non between protocols and corresponding later publications of randomized clinical trials. Matthias Briel, DISCO study group, “Subgroup analyses in randomised controlled trials: cohort study on trial protocols and journal publications,” 349 Brit. Med. J. g4539 (Published 16 July 2014). Predictably, the authors found a good deal of sloppy practice, or worse.  Of the 515 journal articles identified, about half (246 or 47.8%) reported one or more subgroup analysis. Of the articles that reported subgroup analyses, 81 (32.9%) publications stated that the subgroup analyses were prespecified, but in 28 of these articles (34.6%), the corresponding protocols did not identify the subgroup analysis.

In 86 of the publications surveyed, the authors found that the articles claimed a subgroup “effect,” but only 36 of the corresponding protocols reported a planned subgroup analysis.  Briel and the DISCO study group concluded that protocols of randomized clinical trials insufficiently describe subgroup analyses. In over one-third of publications, the articles reported subgroup analyses not pre-specified in earlier protocols. The DISCO study group called for access to protocols and statistical analysis plans for all randomized clinical trials.

In view of these empirical data, the government’s claims against Dr. Harkonen stand out, at best, as vindictive, selective prosecution.

Stanford Conference on Mathematics in Court

June 26th, 2014

Last month, The Stanford Center for Legal Informatics hosted a conference, “Trial With and Without Mathematics: Legal, Philosophical, and Computational Perspectives.” The conference explored the what if any role mathematics plays in the law, and in the training and education of lawyers.

The program was organized by Marcello Di Bello (Stanford Univ., Department of Philosophy), and Bart Verheij (Stanford Univ., CodeX Center for Legal Informatics, and Univ. of Groningen, Institute of Artificial Intelligence). DiBello teaches an undergraduate course, Probability and the Law, at Stanford.

The program featured presentations by:

Sandy L. Zabell (Northwestern Univ.) on “A Tribe of Skeptics: Probability and the 19th Century Law of Evidence,” (Slides; Video), with commentary by Andrea Roth (Univ. California, Berkeley School of Law);

Susan Haack (Univ. of Miami School of Law), on “Legal Probabilism: An Epistemological Dissent,” (Slides; Video), with commentary by Charles H. Brenner (Univ. California, Berkeley School of Law) (Slides);

William C. Thompson (Univ. California, Irvine Dep’t Criminology, Law & Society), on “How Should Forensic Scientists Explain Their Evidence to Juries: Match Probabilities, Likelihood Ratios, or ‘Verbal Equivalents’? (Slides; Video), with commentary by Paul Brest (Stanford Law School);

Henry Prakken (Univ. Groningen), on Models of Legal Proof and Their Cognitive Plausibility,” (Slides; Video), with commentary by Sarah B. Lawsky (Univ. California, Irvine, School of Law) (Slides);

Vern Walker (Hofstra Univ. School of Law), on “Computational Representation of Legal Reasoning at the Law-Fact Interface,” (Slides; Video), with commentary by Bart Verheij (Slides); and

Ronald J. Allen (Northwestern Univ. School of Law) presented onWhat Are We Doing? Reconsidering Juridical Proof Rules,” (Slides; Video), with commentary by Marcello Di Bello.

An interesting collection of presentations and commentary, which I have not yet reviewed carefully.  Professor Haack’s presentation seems to cover much the same ground covered at a conference on Standards of Proof and Scientific Evidence, held at the University of Girona, in Spain.  Her previous lecture can be viewed on-line, and a manuscript of Haack’s paper is available , as well.  Susan Haack, “Legal Probabilism:  An Epistemological Dissent” (2011)(cited here as “Haack”).  SeeHaack Attack on Legal Probabilism” (2012).

Professor Haack’s papers and presentations on law, legal evidence, and probability are slated for republication in book form, this August. Susan Haack, Evidence Matters: Science, Proof, and Truth in the Law (Cambridge 2014). The contents look familiar:

1. Epistemology and the law of evidence: problems and projects

2. Epistemology legalized: or, truth, justice, and the American way

3. Legal probabilism: an epistemological dissent

4. Irreconcilable differences? The troubled marriage of science and law

5. Trial and error: two confusions in Daubert

6. Federal philosophy of science: a deconstruction – and a reconstruction

7. Peer review and publication: lessons for lawyers

8. What’s wrong with litigation-driven science?

9. Proving causation: the weight of combined evidence

10. Correlation and causation: the ‘Bradford Hill Criteria’ in epidemiological, legal, and epistemological perspective

11. Risky business: statistical proof of specific causation

12. Nothing fancy: some simple truths about truth in the law

 

 

 

Goodman v Viljoen – Meeting the Bayesian Challenge Head On

June 11th, 2014

Putting Science On Its Posterior

Plaintiffs’ and Defendants’ counsel both want the scientific and legal standard to be framed as a very high posterior probability of the truth of a claim. Plaintiffs want the scientific posterior probability to be high because they want to push the legal system in the direction of allowing weak or specious claims that are not supported by sufficient scientific evidence to support a causal conclusion.  By asserting that the scientific posterior probability for a causal claim is high, and that the legal and scientific standards are different, they seek to empower courts and juries to support judgments of causality that are deemed inconclusive, speculative, or worse, by scientists themselves.

Defendants want the scientific posterior probability to be high, and claim that the legal standard should be at least as high as the scientific standard.

Both Plaintiffs and Defendants thus find common cause in committing the transposition fallacy by transmuting the coefficient of confidence, typically 95%, into a minimally necessary posterior probability for scientific causal judgments.  “One wanders to the left, another to the right ; both are equally in error, but are seduced by different delusions.”[1]

In the Goodman v. Viljoen[2] case, both sides, plaintiffs and defendants, embraced the claim that science requires a high posterior probability, and that the p-value provided evidence of the posterior probability of the causal claim at issue.  The error came mostly from the parties’ clinical expert witnesses and from the lawyers themselves; the parties’ statistical expert witnesses appeared to try to avoid the transposition fallacy. Clearly, no text would support the conflation of confidence with certainty. No scientific text, treatise, or authority was cited for the notion that scientific “proof” required 95% certainty. This notion was simply an opinion of testifying witnesses.

The principal evidence that antenatal corticosteroid (ACS) therapy can prevent cerebral palsy (CP) came from a Cochrane review and meta-analysis[3] of clinical trials.  The review examined a wide range of outcomes, only one of which was CP.  The trials were apparently not designed to assess CP risk, and they varied significantly in case definition, diagnostic criteria, and length of follow up for case ascertainment. Of the five included studies, four ascertained CP at follow up from two to six years, and the length of follow up was unknown in the fifth study.

Data were sparse in the Cochrane review, as expected for a relatively rare outcome.  The five studies encompassed 904 children, with 490 in the treatment group, and 414 in the control group. There was a total of 48 CP cases, with 20 in the treatment, and 28 in the control, groups. Blinding was apparently not maintained over the extended reporting period.

Professor Andrew Willan, plaintiffs’ testifying expert witness on statistics, sponsored a Bayesian statistical analysis, with which he concluded that there was between a 91 and 97% probability that there was an increased risk of CP from not providing ACS in pre-term labor (or, a decreased risk of CP from administering ACS).[4] Willan’s posterior probabilities was for any increased risk, based upon the Cochrane data.  Willan’s calculations were not provided in his testimony, and no information about his prior probability, was given. The data came from clinical trials, but the nature of the observations and the analyses made these trials little more than observational studies conducted within the context of clinical trials designed to look at other outcomes. The Bayesian analysis did not account for the uncertainty in the case definitions, variations in internal validity and follow up, and biases in the clinical trials. Willan’s posterior probabilities thus described a maximal probability for general causation, which surely needed to be discounted for validity and bias issues.

There was a further issue of external validity. The Goodman twins developed CP from having sustained periventricular leukomalacia (PVL), which is one among several mechanistic pathways by which CP can develop in pre-term infants.  The Cochrane data did not address PVL, and the included trials were silent as to whether any of the CP cases involved PVL mechanisms.  There was no basis for assuming that ACS reduced risk of CP from all mechanisms equally, or even at all.[5] The Willan posterior probabilities did not address the external validity issues as they pertained to the Goodman case itself.

Although Dr. Viljoen abandoned the challenge to the Bayesian analysis at trial, his statistical expert witness, Dr. Robert Platt went further to opine that he agreed with Willan’s calculations.  To agree with his calculations, and the posterior probabilities that came out of those calculations, Platt had to have agreed with the analyses themselves. This agreement seems ill considered given that elsewhere in his testimony, Platt appears to advance important criticisms of the Cochrane data in the form of validity and bias issues.

Certainly, Platt’s concession about the correctness of Willan’s calculations greatly undermined Dr. Viljoen’s position with the trial and appellate court. Dr. Viljoen maintained those criticisms throughout the trial, and on appeal.  See, e.g., Defendant (Appellant) Factum, 2012 CCLTFactum 20936, at ¶14(a)

(“(a) antenatal corticosteroids have never been shown to reduce the incidence or effect of PVL”); id. at ¶14(d)(“at best, even taking the Bayesian approach at face value, the use of antenatal corticosteroids showed only a 40% reduction in the incidence of cerebral palsy, but not PVL”).

How might have things gone better for Dr. Vijoen? For one thing, Platt’s concession about the correctness of Willan’s calculations had to be explained and qualified as conceding only the posterior probability on the doubtful and unproven assumptions made by Willan. Willan’s posterior, as big as it was, represented only an idealized maximal posterior probability, which in reality had to be deeply discounted by important uncertainties, biases, and validity concerns.  The inconclusiveness of the data were “provable” on either a frequentist or a Bayesian analysis.


[1] Horace, in Wood, Dictionary of Quotations 182 (1893).

[2] Goodman v. Viljoen, 2011 ONSC 821 (CanLII), aff’d, 2012 ONCA 896 (CanLII), leave appeal den’d, Supreme Court of Canada No. 35230 (July 11, 2013).

[3] Devender Roberts & Stuart R Dalziel “Antenatal corticosteroids for accelerating fetal lung maturation for women at risk of preterm birth,” Cochrane Database of Systematic Reviews, at 8, Issue 3. Art. No. CD004454 (2006)

[4] Notes of Testimony of Andrew Willan at 34 (April 9, 2010) (concluding that ACS reduces risk of CP, with a probability of 91 to 97 percent, depending upon whether random effects or fixed effect models are used).

[5] See, e.g., Olivier Baud, Laurence Laurence Foix l’Hélias, et al., “Antenatal Glucocorticoid- Treatment and Cystic Periventricular Leukomalacia in Very Premature Infants,” 341 New Engl. J. Med. 1190, 1194 (1999) (“Our results suggest that exposure to betamethasone but not dexamethasone is associated with a decreased risk of cystic periventricular leukomalacia.”).

 

Goodman v Viljoen – Subterfuge to Circumvent Relative Risks Less Than 2

June 6th, 2014

Back in March, I wrote about a “Black Swan” case, in which litigants advanced a Bayesian analysis to support their claims. Goodman v. Viljoen, 2011 ONSC 821 (CanLII), aff’d, 2012 ONCA 896 (CanLII), leave appeal den’d, Supreme Court of Canada No. 35230 (July 11, 2013).

Goodman was a complex medical practice case in which Mrs. Goodman alleged that her obstetrician, Dr. Johan Viljoen, deviated from the standard of care by failing to prescribe antenatal corticosteroids (ACS) sufficiently in advance of delivery to reduce the risks attendant early delivery for her twin boys, of early delivery. Both boys developed cerebral palsy (CP). The parties and their experts agreed that the administration of ACS reduced the risks of respiratory distress and other complications of pre-term birth, but they disputed the efficacy of ACS to avoid or diminish the risk of CP.

According to the plaintiffs, ACS would have, more probably than not, prevented the twins from developing cerebral palsy, or would have diminished the severity of their condition.  Dr. Viljoen disputed both general and specific causation. Evidence of general causation came from both randomized clinical trials (RCTs) and observational studies.

Limitations Issue

There were many peculiar aspects to the Goodman case, not the least of which was that the twins sued Dr. Viljoen over a decade after they were born.  Dr. Viljoen had moved his practice in the passage of time, and he was unable to produce crucial records that supported his account of how his staff responded to Mrs. Goodman’s telephone call about signs and symptoms of labor. The prejudice to Dr. Viljoen illustrates the harshness of broad tolling statutes, the unfairness of which could be reduced by requiring infant plaintiffs to give notice of their intent to sue, even if they wait until the age of majority before filing their complaints.

State of the Art Issue

Dr. Viljoen suffered perhaps a more serious prejudice in the form of hindsight bias that resulted from the evaluation of his professional conduct by evidence that was unavailable when the twins were born in 1995. The following roughly contemporaneous statement from the New England Journal of Medicine is typical of serious thinking at the time of the alleged malpractice:

“Antenatal glucocorticoid therapy decreases the incidence of several complications among very premature infants. However, its effect on the occurrence of cystic periventricular leukomalacia, a major cause of cerebral palsy, remains unknown.”

Olivier Baud, Laurence Laurence Foix l’Hélias, et al., “Antenatal Glucocorticoid- Treatment and Cystic Periventricular Leukomalacia in Very Premature Infants,” 341 New Engl. J. Med. 1190, 1190 (1999) (emphasis added). The findings of this observational study illustrate some of the difficulties with the claim that Dr. Viljoen failed to prevent an avoidable consequence of pre-term delivery:

“Our results suggest that exposure to betamethasone but not dexamethasone is associated with a decreased risk of cystic periventricular leukomalacia.”

Id. at 1194. Results varied among various corticosteroids, among doses, among timing regimens.  There hardly seemed enough data in 1995 to dictate a standard of care.

Meta-Analysis Issues

Over ten years after the Goodman twins were born, the Cochrane collaboration published a meta-analysis that was primarily concerned with the efficacy of ACS for lung maturation. Devender Roberts & Stuart R Dalziel “Antenatal corticosteroids for accelerating fetal lung maturation for women at risk of preterm birth,” Cochrane Database of Systematic Reviews Issue 3. Art. No. CD004454 (2006). The trials included mostly post-dated the birth of the twins, and the alleged malpractice. The relevance of the trials to address the causation of CP in infants who experienced periventricular leukomalacia (PVL) was hotly disputed, but for now, I will gloss over the external validity problem of the Cochrane meta-analysis.

The Cochrane Collaboration usually limits its meta-analyses to the highest quality evidence, or RCTs, but in this instance, the RCTs did not include CP in its primary pre-specified outcomes. Furthermore, the trials were generally designed to ascertain short-term benefits from ACS, and the data in the trials were uncertain with respect to longer-term outcomes, which may have been ascertained differentially. Furthermore, the trials were generally small and were plagued by sparse data.  None of the individual trials was itself statistically significant at the 5 percent level.  The meta-analysis did not show a statistically significant decrease in CP from ACS treatment.  The authors reported:

“a trend towards fewer children having cerebral palsy (RR 0.60, 95% CI 0.34 to 1.03, five studies, 904 children, age at follow up two to six years in four studies, and unknown in one study).”

 Id. at 8 (emphasis added).

The Cochrane authors were appropriately cautious in interpreting the sparse data:

“Results suggest that antenatal corticosteroids result in less neurodevelopmental delay and possibly less cerebral palsy in childhood.”

Id. at 13-14 (emphasis added).

The quality of the trials included in the Cochrane meta-analysis varied, as did the trial methodologies.  Despite the strong clinical heterogeneity, the Cochrane authors performed their meta-analysis with a fixed-effect model. The confidence interval, which included 1.0, reflected a p-value of 0.065, but that p-value would have certainly increased if a more appropriate random-effects model had been used.

Furthermore, the RCTs were often no better than observational studies on the CP outcome. The RCTs here perhaps should not have been relied upon to the apparent exclusion of observational epidemiology.

Relative Risk Less Than Two

There is much to be said about the handling of statistical significance, the Bayesian analysis, the arguments about causal inference, but for now, let us look at one of the clearest errors in the case:  the inference of specific causation from a relative risk less than two.  To be sure, the Cochrane meta-analysis reported a non-statistically significant 40% decrease, but if we were to look at this outcome in terms of the increase in risk of CP from the physician’s failure to administer ACS timely, then the risk ratio would be 1.67, or a 67% increase.  On either interpretation, fewer than half the cases of CP can be attributed to the failure to administer ACS fully and timely in the case.

The parties tried their case before Justice Walters, in St. Catherines, Ontario. Goodman v. Viljoen, 2011 ONSC 821 (CanLII).  Justice Walters recognized that specific causation was essential and at the heart of the parties’ disagreement:

“[47] In order to succeed, the plaintiffs must establish that the failure to receive a full course of ACS materially affected the twins’ outcome. That is, they must establish that “but for” the failureto receive a full course of ACS, the twins would not have suffered from the conditions they now do, or that the severity of these afflictions would have been materially reduced.

[48] Not surprisingly, this was the most contentious issue at trial and the court heard a good deal of evidence with respect to the issue of causation.”

One of the defendant’s expert witnesses, Robert Platt, a professor of statistics at McGill University School of Medicine, testified, according to Justice Walters:

“[144] Dr. Platt also stated that the absolute risk in and of itself does not tell us anything about what might have happened in a specific case absent clinical and mechanistic explanations for that specific case.”

The plaintiffs’ expert witnesses apparently conceded the point.  Professor Andrew Willan, a statistician, testifying for the plaintiffs, attempted to brush Platt’s point aside by suggesting it would render clinical research useless, but that was hardly the point.  Platt embraced clinical research for what it could show about the “averages” in a sample of the population, even if we cannot discern causal efficacy retrospectively in a specific patient:

“[133] Dr. Willan also responded to Dr. Platt’s criticism that it was impossible to determine the distribution of the effect across the population. Professor Willan felt this issue was a red herring, and if it were valid, it would render most clinical research useless. There is really no way of knowing who will benefit from a treatment and who will not. Unless there are reasons to believe otherwise, it is best to apply the population average effect to each person.”

Although Willan labeled Platt’s point as cold-blooded and fishy, he ultimately concurred that the population average effect should be applied to each person in the absence of evidence of risk being sequestered in a subgroup.

A closer look at Willan’s testimony at trial is instructive. Willan acknowledged, on direct examination, that the plaintiffs were at increased risk, even if their mother had received a full course of ACS.  All he would commit to, on behalf of the plaintiffs, was that their risk would have been less had the ACS been given earlier:

“All we can say is that there’s a high probability that that risk would be reduced and that this is probably the best estimate of the excess risk for not being treated and I would say that puts that in the 70 percent range of excess risk and I would say the probability that the risk would have been reduced is into the 90 percentage points.”

Notes of Testimony of Andrew Willan at 62 (April 6, 2010).  The 90 percentage points reference here was Willan’s posterior probability that the claimed effect was real.

On cross-examination, the defense pressed the point:

Q. What you did not do in this, in this report, is provide any quantification for the reduction in the risk, true?

A. That’s correct.

Notes of Testimony of Andrew Willan at 35 (April 9, 2010)

Q. And you stated that there is no evidence that the benefits of steroids is restricted to any particular subgroup of patients?

A. I wasn’t given any. I haven’t seen any evidence of that.

Id. at 43.

Q. And what you’re suggesting with that statement, is that the statistics should be generally, should be considered by the court to be generally applicable, true?

A. That’s correct.

Id. at 44.

Q. But given your report, you can’t offer assistance on the clinical application to the statistics, true?

A. That’s true.

Id. at 46.

With these concessions in hand, defense counsel elicited the ultimate concession relevant to the “but for” standard of causation:

Q. And to do that by looking at an increase in risk, the risk ratio from the data must achieve 2 in order for there to be a 50 percent change in the underlying data, true?

A. Yeah, to double the risk, the risk ratio would have to be 2, to double the risk.

Id. at 63.

* * *

Q. So, none of this data achieves the threshold of a 50 percent change in the underlying data, whether you look at it as an increase in risk or …

A. Sure.

Q …. a decrease in risk …

A. Yeah.

Id. at 66.

Leaping Inferences

The legal standard for causation in Canada is the same counterfactual requirement that applies in most jurisdictions in the United States.  Goodman v. Viljoen, 2011 ONSC 821 (CanLII), at ¶14, 47. The trial court well understood that the plaintiffs’ evidence left them short of showing that their CP would not have occurred but for the delay in administering ACS. Remarkably, the court permitted the plaintiffs to use non-existing evidence to bridge the gap.

According to Dr. Max Perlman, plaintiffs’ expert witness on neonatology and pediatrics, CP is not a dichotomous condition, but a spectrum that is manifested on a continuum of signs and symptoms.  The RCTs relied upon had criteria for ascertaining CP and including it as an outcome.  The result of these criteria was that CP was analyzed as a binary outcome.  Dr. Perlman, however, held forth that “common sense and clinical experience” told him that CP is not a condition that is either present or not, but rather presented on a continuum. Id. at [74].

Without any evidence, Perlman testified that when CP is not avoided by ACS, “it is likely that it is less severe for those who do go on to develop it.” Id. [75].  Indeed, Perlman made the absence of evidence a claimed virtue; with all his experience and common sense, he “could not think of a single treatment which affects a basic biological process that has a yes or no effect; they are all on a continuum.” Id. From here, Perlman soared to his pre-specified conclusion that “that it is more likely than not that the twins would have seen a material advantage had they received the optimal course of steroids.” Id. at [76].

Perlman’s testimony is remarkable for inventing a non-existing feature of biological evidence:  everything is a continuum. Justice Walters could not resist this seductive testimony:

“[195] The statistical information is but one piece of the puzzle; one way of assessing the impact of ACS on CP. Notably, the 40% reduction in CP attributable to ACS represents an all or nothing proposal. In other words, 93.5% of the time, CP is reduced in its entirety by 40%. It was the evidence of Dr. Perlman, which I accept, that CP is not a black and white condition, and, like all biological processes, it can be scaled on a continuum of severity. It therefore follows that in those cases where CP is not reduced in its entirety, it is likely to be less severe for those who go on to develop it. Such cases are not reflected in the Cochrane figure.

[196] Since the figure of 40% represents an all or nothing proposal, it does not accurately reflect the total impact of ACS on CP. Based on this evidence, it is a logical  conclusion that if one were able to measure the total effect of ACS on CP, the statistical measure of that effect would be inflated beyond 40%.

[197] Unfortunately, this common sense conclusion has never and can never be tested by science. As Dr. Perlman testified, such a study would be impossible to conduct because it would require pre-identification of those persons who go on to develop CP.  Furthermore, because the short term benefits of ACS are now widely accepted, it would be unethical to withhold steroids to conduct further studies on long term outcomes.”

Doubly unfortunate, because Perlman’s argument was premised on a counterfactual assumption.  Many biological phenomena are dichotomous.  Pregnancy, for instance, does not admit of degrees.  Disease states are frequently dichotomous, and no evidence was presented that CP was not dichotomous. Threshold effects abound in living organisms. Perlman’s argument further falls apart when we consider that the non-experimental arm of the RCTs would also have had additional “less-severe” CP cases, with no evidence that they occurred disproportionately in the control arms of these RCTs. Furthermore, high-quality observational studies might have greater validity than post-hoc RCTs in this area, and there have been, and likely will continue to be, such studies to attempt better understanding of the efficacy of ACS, as well as differing effects among the various corticosteroids, doses, and patterns of administration.

On appeal, the Justice Walters’ verdict for plaintiffs was affirmed, but over a careful, thoughtful dissent. Goodman v. Viljoen, 2012 ONCA 896 (CanLII) (Doherty, J., dissenting). Justice Doherty caught the ultimate futility of Dr. Perlman’s opinion based upon non-existent evidence: even if there were additional sub-CP cases in the treatment arms of the RCTs, and if they occurred disporportionately more often in the treatment than in the placebo arms, we are still left guessing about the quantitative adjustment to make to the 40% decrease, doubtful as it was, which came from the Cochrane review.

Biostatistics and FDA Regulation: The Convergence of Science and Law

May 29th, 2014

On May 20, 2014, the Food and Drug Law Institute (FDLI), the Drug Information Association (DIA), and the Harvard Law School’s Petrie-Flom Center for Health Law Policy, Biotechnology, and Bioethics, in collaboration with the Harvard School of Public Health Department of Biostatistics and Harvard Catalyst | The Harvard Clinical and Translational Science Center, presented a symposium on“Biostatistics and FDA Regulation: The Convergence of Science and Law.”

The symposium might just as well have been described as the collision of science and law.

The Symposium agenda addressed several cutting-edge issues on statistical evidence in the law, criminal, civil, and regulatory. Names of presenters are hyperlinked to presentations slides that are available.

I. Coleen Klasmeier, of Sidley Austin LLP, introduced and moderated the first section, “Introduction to Statistics and Regulatory Law,” which focused on current biostatistical issues in regulation of drugs, devices, and foods by the Food and Drug Administration (FDA). Qi Jiang, Executive Director of Amgen, Robert T. O’Neill, retired from the FDA, and now Statistical Advisor in CDER, and Jerald S. Schindler, of Merck Research Laboratories, presented.

II. Qi Jiang moderated and introduced the second section on safety issues, and the difficulties presented by meta-analysis and other statistical assessments of safety outcomes in clinical trials and in marketing of drugs and devices. Lee-Jen Wei, of the Harvard School of Public Health, Geoffrey M. Levitt, an Associate General Counsel of Pfizer, Inc., and Janet Wittes, of the Statistics Collaborative, presented.

III. Aaron Katz, of Ropes & Gray LLP, introduced the third section, on “Statistical Disputes in Life Sciences Litigation,” which addressed recent developments in expert witness gatekeeping, the Avandia litigation, and the role of statistics in two recent cases, Matrixx, Inc. v. Siracusano, and United States v. HarkonenAnand Agneshwar, of Arnold & Porter LLP, Lee-Jen Wei, Christina L. Diaz, Assistant General Counsel of GlaxoSmithKline, and Nathan A. Schachtman presented.

IV. Christopher Robertson, a law professor now visiting at Harvard Law School, moderated a talk by Robert O’Neill on “Emerging Issues,” at the FDA.

V. Dr. Wittes moderated a roundtable discussion on “Can We Handle the Truth,” which explored developments in First Amendment and media issues involved in regulation and litigation. Anand Agneshwar, and Freddy A. Jimenez, Assistant General Counsel, Johnson & Johnson, presented.

On The Quaint Notion That Gatekeeping Rules Do Not Apply to Judges

April 27th, 2014

In In re Zurn Pex Plumbing Prods. Liab. Litig., 644 F.3d 604 (8th Cir. 2011), the United States Court of Appeals for the Eighth Circuit rejected the defendant’s argument that a “full and conclusive” Rule 702 gatekeeping procedure was required before a trial court could certify a class action under the Federal Rules. The Circuit remarked that “[t]he main purpose of Daubert exclusion is to protect juries from being swayed by dubious scientific testimony,” an interest “not implicated at the class certification stage where the judge is the decision maker.”  Id. at 613.

Surely, one important purpose of Rule 702 is to protect juries against dubious scientific testimony, but judges are not universally less susceptible to dubious testimony.  There are many examples of judges being misled by fallacious scientific evidence, especially when tendentiously presented by advocates in court.  No jury need be present for dubious science testimony + “zealous” advocacy to combine to create major errors and injustice.  See, e.g., Wells v. Ortho Pharmaceutical Corp., 615 F. Supp. 262 (N.D. Ga. 1985)(rendering verdict for plaintiffs after bench trial), aff’d and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S.950 (1986); Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation § 6.5 n.3, at 271 (1997) (characterizing Wells as “notorious,” and noting that the case became a “lightning rod for the legal system’s ability to handle expert evidence.”).  Clearly Rule 702 does not exist only to protect juries.

Nemo iudex in causa sua! Perhaps others should judge the competence of judges’ efforts at evaluating scientific evidence.  At the very least, within the institutional framework of our rules of civil procedure and evidence, Rule 702 creates a requirement of structured inquiry into expert opinion testimony before the court.  That gatekeeping inquiry, and its requirement of a finding, subject to later appellate review and to public and professional scrutiny, are crucial to the rendering of intellectual due process in cases that involve scientific and technical issues.  The Eighth Circuit was unduly narrow in its statement of the policy bases for Rule 702, and their applicability to class certification.

The case of Obrey v. Johnson, 400 F.3d 691 (9th Cir. 2005) provides another cautionary tale about the inadequacies of judges in the evaluation of scientific and statistical evidence.  The plaintiff, Mr. Obrey, sued the Navy on a claim of race discrimination in promoting managers at the Pearl Harbor Naval Shipyard.  The district court refused plaintiff’s motion to admit the testimony of a statistician, Mr. James Dannemiller, President of the SMS Research & Marketing Services, Inc. The district court also excluded much of plaintiff’s anecdotal evidence, and entered summary judgment.  Id. at 691 – 93.

On appeal, Obrey claimed that Dannemiller’s report showed “a correlation between race and promotion.” Id. at 693. This vague claim seemed good enough for the Ninth Circuit, which reversed the district court’s grant of summary judgment and remanded for trial.

The Ninth Circuit’s opinion does not tell us what sort of correlation was supposedly shown by Mr. Dannemiller. Was it Pearson’s r?  Or Jaspen’s multi-serial coefficient? Spearman’s ρ?  Perhaps Kendall’s τ? Maybe the appellate court was using correlation loosely, and Mr. Dannemiller had conducted some other sort of statistical analysis. The district court’s opinion is not published and is not available on Westlaw.  It is all a mystery. More process is due the litigants and the public.

Even more distressing than the uncertainty as to the nature of the correlation is that the Ninth Circuit does not tell us what the correlation “effect size” was, or whether the correlation was statistically significant.  If the Circuit did not follow strict hypothesis testing, perhaps it might have told us the extent of random error in the so-called correlation.  The Circuit did not provide any information about the extent or the precision of the claim of a “correlation”; nor did the Circuit assess the potential for bias or confounding in Mr. Dannemiller’s analysis.

Indeed, the Ninth Circuit seemed to suggest that Mr. Dannemiller never even showed a correlation; rather the court described Mr. Dannemiller as having opined that there was “no statistical evidence in these data that the selection process for GS-13 through GS-15 positions between 1999 and 2002 was unbiased with respect to race.” Id. at 694. Reading between the lines, it seems that the statistical evidence was simply inconclusive, and Mr. Dannemiller surreptitiously shifted the burden of proof and offered an opinion that the Navy had not ruled out bias. The burden, of course, was on Mr. Obrey to establish a prima facie case, but the appellate court glossed over this fatal gap in plaintiff’s evidence.

On appeal, the Navy pressed its objections to the relevance and reliability of Mr. Dannemiller’s opinions. Brief of the Navy, 2004 WL 1080083, at *1 (April 7, 2004).  There seemed to be no dispute that Mr. Dannemiller’s “study” was based entirely upon “statistical disparities,” which failed to take into account education, experience, and training.  Mr. Dannemiller appeared to have simplistically compared race make up of the promoted workers, ignoring the Navy’s showing of the relevancy of education, experience, and training.  Id. at *13, 18.

The Ninth Circuit not only ignored the facts of the case, it ignored its own precedents.  See Obrey v. Johnson, 400 F.3d at 696 (citing and quoting from Coleman v. Quaker Oats Co., 232 F.3d 1271, 1283 (9th Cir. 2000) (“Because [the statistics] failed to account for many factors pertinent to [the plaintiff], we conclude that the statistics are not enough to take this case to trial.”). The court, in Obrey, made no effort to distinguish its treatment of the parties in Coleman, or to justify its decision as to why the unspecified, unquantified, mysterious statistical analysis of Mr. Dannemiller sufficed under Rule 702. The Circuit cryptically announced that “Obrey’s evidence was not rendered irrelevant under Rule 402 simply because it failed to account for the relative qualifications of the applicant pool.”  Obrey, 400 F.3d at 695.  Citing pre-Daubert decisions for the most part (such as Bazemore), the Ninth Circuit persuaded itself that Rule 702 requires nothing more than simple relevancy. Had the Circuit taken even a cursory look at Bazemore, it would have seen that the case involved a much more involved multiple regression than whatever statistical analysis Mr. Dannemiller propounded.  And the Ninth Circuit would have seen that even the Bazemore decision acknowledged that there may be

“some regressions so incomplete as to be inadmissible as irrelevant… .”

478 U.S. 385, 400 n.10 (1986). It is difficult to imagine a discrimination claim analysis more incomplete than one that did not address education, training, and experience.

Sadly, neither the Navy’s nor Mr. Obrey’s brief, 2004 WL 545873 (Feb. 4, 2004) provided any discussion of the nature, quality, findings, or limits of Mr. Dannemiller’s statistical analysis.  The Navy’s brief referred to Mr. Dannemiller as a “purported” expert.  His resume, available online, shows that Mr. Dannemiller studied history as an undergraduate, and has a master’s degree in sociology. He is the president of SMS Research, a consulting company.

The taxpayers deserved better advocacy from the Department of Justice, and greater attention to statistical methodology from its appellate judges.  See ATA Airlines, Inc. v. Federal Exp. Corp., 665 F.3d 882, 888-96 (2011) (Posner, J.) (calling for lawyers and judges to do better in understanding and explaining, in plain English, the statistical analyses that are essential to their cases). Judges at level need to pay greater attention to the precepts of Rule 702, even when there is no jury around to be snuckered.

A Black Swan Case – Bayesian Analysis on Medical Causation

March 15th, 2014

Last month, I posted about an article that Professor Greenland wrote several years ago about his experience as a plaintiffs’ expert witness in a fenfluramine case. “The Infrequency of Bayesian Analyses in Non-Forensic Court Decisions (Feb. 16, 2014).” Greenland chided a defense expert for having declared that Bayesian analyses are rarely or never used in analyzing clinical trials or in assessments of pharmaco-epidemiologic data.  Greenland’s accusation of ludicrousness appeared mostly to blow back on him, but his stridency for Bayesian analyses did raise the question, whether such analyses have ever moved beyond random-match probability analyses in forensic evidence (DNA, fingerprint, paternity, etc.) or in screening and profiling cases.  I searched Google Scholar and Westlaw for counter-examples and found none, but I did solicit references to “Black Swan” cases. Shortly after I posted about the infrequency of Bayesian analyses, I came across a website that was dedicated to collecting legal citations of cases in which Bayesian analyses were important, but this website appeared to confirm my initial research.

Some months ago, Professor Brian Baigrie, of the Jackman Humanities Institute, at the University of Toronto, invited me to attend a meeting of an Institute working group on The Reliability of Evidence in Science and the Law.  The Institute fosters interdisciplinary scholarship, and this particular working group has a mission statement close to my interests:

The object of this series of workshops is to formulate a clear set of markers governing the reliability of evidence in the life sciences. The notion of evidence is a staple in epistemology and the philosophy of science; the notion of this group will be the way the notion of ‘evidence’ is understood in scientific contexts, especially in the life sciences, and in judicial form as something that ensures the objectivity of scientific results and the institutions that produce these results.

The Reliability of Evidence in Science and the Law. The faculty on the working group represent disciplines of medicine (Andrew Baines), philosophy (James R. Brown, Brian Baigrie), and law (Helena Likwornik, Hamish Stewart), with graduate students in the environmental science (Amy Lemay), history & philosophy of science and technology (Karolyn Koestler, Gwyndaf Garbutt ), and computer science (Maya Kovats).

Coincidentally, in preparation for the meeting, Professor Baigrie sent me links to a Canadian case, Goodman v. Viljoen, which turned out to be a black swan case! The trial court’s decision, in this medical malpractice case focused mostly on a disputed claim of medical causation, in which the plaintiffs’ expert witnesses sponsored a Bayesian analysis of the available epidemiologic evidence; the defense experts maintained that causation was not shown, and they countered with the unreliability of the proffered Bayesian analysis. The trial court resolved the causation dispute in favor of the plaintiffs, and their witnesses’ Bayesian approach. Goodman v. Viljoen, 2011 ONSC 821 (CanLII), aff’d, 2012 ONCA 896 (CanLII).  The Court of Appeals’ affirmance was issued over a lengthy, thoughtful dissent. The Canadian Supreme Court denied leave to appeal.

Goodman was a medical practice case. Mrs. Goodman alleged that her obstetrician deviated from the standard of care by failing to prescribe corticosteroids sufficiently early in advance of delivery to avoid or diminish the risk of cerebral palsy in her twins.  Damages were stipulated, and the breach of duty turned on a claim that Mrs. Goodman, in distress, called her obstetrician.  Given the decade that passed between the event and the lawsuit, the obstetrician was unable to document a response.  Duty and breach were disputed, but were not the focus of the trial.

The medical causation claim, in Goodman, turned upon a claim that the phone call to the obstetrician should have led to an earlier admission to the hospital, and the administration of antenatal corticosteroids.  According to the plaintiffs, the corticosteroids would have, more probably than not, prevented the twins from developing cerebral palsy, or would have diminished the severity of their condition.  The plaintiffs’ expert witnesses relied upon studies that suggested a 40% reduction and risk, and a probabilistic argument that they could infer from this risk ratio that the plaintiffs’ condition would have been avoided.  The case thus raises the issue whether evidence of risk can substitute for evidence of causation.  The Canadian court held that risk sufficed, and it went further, contrary to the majority of courts in the United States, to hold that a 40% reduction in risk sufficed to satisfy the more-likely-than-not standard.  See, e.g., Samaan v. St. Joseph Hosp., 670 F.3d 21 (1st Cir. 2012) (excluding expert witness testimony based upon risk ratios too small to support opinion that failure to administer intravenous tissue plasminogen activator (t-PA) to a patient caused serious stroke sequelae); see also “Federal Rule of Evidence 702 Requires Perscrutations — Samaan v. St. Joseph Hospital (2012)” (Feb. 4, 2012).

The Goodman courts, including the dissenting justice on the Ontario Court of Appeals, wrestled with a range of issues that warrant further consideration.  Here are some that come to mind from my preliminary read of the opinions:

1. Does evidence of risk suffice to show causation in a particular case?

2. If evidence of risk can show causation in a particular case, are there requirements that the magnitude of risk be quantified and of a sufficient magnitude to support the inference of causation in a particular case?

3. The judges and lawyers spoke of scientific “proof.”  When, if ever, is it appropriate to speak of scientific proof of a medical causal association?

4. Did the judges incorrectly dichotomize legal and scientific standards of causation?

5. Did the judges, by rejecting the need for “conclusive proof,” fail to articulate a meaningful standard for scientific evidence in any context, including judicial contexts?

6. What exactly does the “the balance of probabilities” mean, especially in the face of non-quantitative evidence?

7. What is the relationship between “but for” and “substantial factor” standards of causation?

8. Can judges ever manage to define “statistical significance” correctly?

9. What is the role of “common sense” in drawing inferences by judges and expert witnesses in biological causal reasoning?  Is it really a matter of common sense that if a drug did not fully avert the onset of a disease, it would surely have led to a less severe case of the disease?

10. What is the difference between “effect size” and the measure of random or sampling error?

11. Is scientific certainty really a matter of being 95% certain, or is this just another manifestation of the transposition fallacy?

12. Are Bayesian analyses acceptable in judicial settings, and if so, what information about prior probabilities must be documented before posterior probabilities can be given by expert witnesses and accepted by courts?

13. Are secular or ecological trends sufficiently reliable data for expert witnesses to rely upon in court proceedings?

14. Is the ability to identify biological plausibility sufficient to excuse the lack of statistical significance and other factors that are typically needed to support the causality of a putative association?

15. What are the indicia of reliability of meta-analyses used in judicial proceedings?

16. Should courts give full citations to scientific articles that are heavily relied upon as part of the requirement that they publicly explain and justify their decisions?

These are some of the questions that come to mind from my first read of the Goodman case.  The trial judge attempted to explain her decision in a fairly lengthy opinion. Unfortunately, the two judges, of the Ontario Court of Appeals, who voted to affirm, did not write at length. Justice Doherty wrote a thoughtful dissent, but the Supreme Court denied leave to appeal.  Many of the issues are not fully understandable from the opinions, but I hope to be able to read the underlying testimony before commenting.

Thanks to Professor Baigrie for the reference to this case.

“Judges and other lawyers must learn how to deal with scientific evidence and inference.”

March 1st, 2014

Late last year, a panel of 7th Circuit reversed an Administrative Law Judge (ALJ) who had upheld a citation and fine against Caterpillar Logistics, Inc. (Cat).  The panel, in a wonderfully succinct, but meaty decision by Judge Easterbrook, wrote of the importance of judges’ and lawyers’ learning to deal with scientific and statistical evidence. Caterpillar Logistics, Inc. v. Perez, 737 F.3d 1117 (7th Cir. 2013)

Pseudonymous MK, a worker in Cat’s packing department, developed epidcondylitis (tennis elbow).  Id. at 1118. OSHA regulations require employers to report injuries  “the work environment either caused or contributed to the resulting condition”. 29 C.F.R. § 1904.5(a). MK’s work required her to remove items from containers and place items in shipping cartons. The work was repetitive, but MK acknowledged that the work involved little or no impact or force.  Apparently, Cat gave some rather careful consideration to whether MK’s epidcondylitis was work related; it assembled a panel of three specialists in musculoskeletal disorders and two generalists to consider the matter.  The panel, relying upon NIOSH and AMA guidelines, rejected MK’s claim of work relatedness.  Both the NIOSH and the AMA guidelines conclude that repetitive motion in the absence of weight or impact does not cause epicondylitis. Id.

MK called an expert witness, Dr. Robert Harrison, a clinical professor of medicine, at the University of California, San Francisco.  Id. at 1118-1119.  Harrison unequivocally attributed MK’s condition to her work at Cat, but he failed to explain why no one else in Cat’s packing department ever developed the condition.  Id. at 1119.

Harrison acknowledged that epidemiologic evidence could confirm his opinion, but he dismissed such evidence as being able to disconfirm his opinion.  The ALJ echoed Dr. Harrison in holding epidemiologic evidence to be irrelevant:

“none of these [other] people are [sic] MK. Similar to the concept of the ‘eggshell skull’ plaintiff in civil litigation, you take your workers as they are.”

Id. at 1119-20, citing ALJ, at 2012 OSAHRC LEXIS 118 at *32.

Judge Easterbrook found this attempt to disqualify any opposing evidence to lie beyond the pale:

“Judges and other lawyers must learn how to deal with scientific evidence and inference.”

Id. (citing Jackson v. Pollion, 733 F.3d 786 (7th Cir. 2013).

Judge Easterbrook called out the ALJ for misunderstanding the nature of epidemiology and the role of statistics, in the examination of causation of health outcomes that have a baseline incidence or prevalence in the population:

“The way to test whether Harrison is correct is to look at data from thousands of workers in hundreds of workplaces—or at least to look at data about hundreds of worker-years in Caterpillar’s own workplace. Any given worker may have idiosyncratic susceptibility, though there’s no evidence that MK does. But the antecedent question is whether Harrison’s framework is sound, and short of new discoveries about human physiology only statistical analysis will reveal the answer. Any large sample of workers will contain people with idiosyncratic susceptibilities; the Law of Large Numbers ensures that their experience is accounted for. If studies of large numbers of workers show that the incidence of epicondylitis on jobs that entail repetitive motion but not force is no higher than for people who do not work in jobs requiring repetitive motion, then Harrison’s view has been refuted.”

Id. at 1120.

Judge Easterbrook acknowledged that Cat’s workplace evidence may have been a sample too small from which to draw a valid statistical inference, given the low base rate of epicondylitis in the general population.  Dr. Harrison’s and the ALJ’s stubborn refusal, however, to consider any disconfirming evidence, obviating the need to consider sample size and statistical power issues.

Finally,  Judge Easterbrook chastised the ALJ for dismissing Cat’s experience as irrelevant because many other employers will not have sufficient workforces or record keeping to offer similar evidence.  In Judge Easterbrook’s words:

“This is irrational. If the camera in a police car captures the events of a highspeed chase, the judiciary would not ignore that video just because other police cars lack cameras; likewise, if the police record an interrogation, courts will consider that information rather than wait for the day when all interrogations are recorded.”

Id. This decision illustrates why some commentators at places such as the Center for Progressive Reform get their knickers in a knot over the prospect of applying the strictures of Rule 702 to agency fact finding; they know it will make a difference.

As for the “idiosyncratic gambit,” this argument is made all too frequently in tort cases, with similar lack of predicate.  Plaintiffs claim that there may be a genetic or epigenetic susceptibility in a very small subset of the population, and that epidemiologic studies may miss this small, sequestered risk.  Right, and the light in the refrigerator may stay on when you close the door.  Prove it!

The Infrequency of Bayesian Analyses in Non-Forensic Court Decisions

February 16th, 2014

Sander Greenland is a well-known statistician, and no stranger to the courtroom.  I first encountered him as a plaintiffs’ expert witness in the silicone gel breast implant litigation, where he testified for plaintiffs in front of a panel of court-appointed expert witnesses (Drs. Diamond, Hulka, Kerkvliet, and Tugwell).  Professor Greenland has testified for plaintiffs in vaccine, neurontin, fenfluramine, anti-depressant and other pharmaceutical cases.  Although usually on the losing side, Greenland has written engaging post-mortems of several litigations, to attempt to vindicate his positions he took, or deconstruct positions taken by adversary expert witnesses.

In one attempt to “correct the record,” Greenland criticized a defense expert witness for stating that Bayesian methods are rarely used in medicine or in the regulation of medicines. Sander Greenland, “The Need for Critical Appraisal of Expert Witnesses in Epidemiology and Statistics,” 39 Wake Forest Law Rev. 291, 306 (2004).  According to Greenland, his involvement as a plaintiff’s expert witness in a fenfluramine case allowed him to observe a senior professor in Yale University, who served as Wyeth’s statistics expert, make a “ludicrous claim,” id. (emphasis added), that

“the Bayesian method is essentially never used in the medical literature or in the regulatory environments (such as the FDA) for interpreting study results. . . .”

Id. (quoting from Supplemental Affidavit of Prof. Robert Makuch, App. Ex. 114, ¶5, in Smith v. Wyeth-Ayerst Labs., 278 F.Supp. 2d 684 (W.D.N.C. 2003)). Greenland criticizes Professor Makuch’s affidavit as “provid[ing] another disturbing case study of misleading expert testimony regarding current standards and practice.” 39 Wake Forest Law Rev. at 306.

“Ludicrous,” “disturbing,” “misleading,” and “demonstrably quite false”?  Really?

Greenland notes, as a matter of background, that many leading statisticians recommend and adopt Bayesian statistics.  Id. (citing works by Donald Berry, George Box, Bradley Carlin, Andrew Gelman James Berger, and others). Remarkably, however, Greenland failed to cite a single new or supplemental drug application, or even one FDA summary of safety or efficacy, or FDA post-market safety or efficacy review.  At the time Greenland was preparing his indictment, there really was little or no evidence of FDA’s embrace of Bayesian methodologies.  Six years later, in 2010, the agency did promulgate a guidance that set recommended practices for Bayesian analyses in medical device trials. FDA Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials (February 5, 2010); 75 Fed. Reg. 6209 (February 8, 2010); see also Laura A. Thompson, “Bayesian Methods for Making Inferences about Rare Diseases in Pediatric Populations” (2010); Greg Campbell, “Bayesian Statistics at the FDA: The Trailblazing Experience with Medical Devices” (Presentation give by Director, Division of Biostatistics Center for Devices and Radiological Health at Rutgers Biostatistics Day, April 3, 2009).  Even today, Bayesian analysis remains uncommon at the U.S. FDA.

Having ignored the regulatory arena, Greenland purported to do a study of the biomedical journals, “to check the expert’s claim in detail.” 39 Wake Forest Law Rev. at 306. Greenland searched on the word “Bayesian” in the Journal of Clinical Oncology for issues published from 1994-2003, and “found over fifty publications that contain the word in that journal alone.” Greenland does not tell us why he selected this one journal, which was not in the subject matter area of the litigation in which he was serving as a partisan expert witness.  For most the time surveyed, the Journal of Clinical Oncology published 24 issues a year, and occasional supplements. Most volumes contained over 4,000 pages per year.  Finding 50 uses of the word “Bayesian” in over 40,000 pages hardly constitutes resounding evidence to support his charges of “ludicrous,” “misleading,” “disturbing,” and “quite false.”  Greenland further tells us looking at these 50 or so articles “revealed several,” which “had used Bayesian methods to explore statistically nonsignificant results.” 39 Wake Forest Law Rev. at 306-07 & n.61 (citing only one paper, Lisa Licitra et al., Primary Chemotherapy in Resectable Oral Cavity Squamous Cell Cancer: A Randomized Controlled Trial, 21 J. Clin. Oncol. 327 (2003)). So in over 40,000 pages, Greenland found “several” Bayesian analyses, apparently post hoc looks to explore results that did not achieve pre-specified levels of statistical significance. Given the historical evolution of Bayesian analyses at FDA, and Greenland’s own evidence, the posterior odds that Greenland was correct in his charges seem to be disturbingly low.

Greenland tells us that the number of Bayesian analyses could be increased by looking at additional journals, and the Bayesian textbooks he cites.  No doubt this is true, as is his statement that respected statisticians, in prestigious journals, have called for Bayesian analyses to replace frequentist methods. Of course, increasing the scope of his survey, Greenland would be dramatically increasing the denominator of total journal papers with statistical methods.  Odds are that the frequency would remain very low.  Greenland’s empirical evidence hardly contradicts his bête noire for making the quoted purely descriptive statement about the infrequent use of Bayesian analysis in biomedical journals and in regulatory applications.

In lodging charges of ludicrousness, Greenland might have presented a more balanced view from more carefully conducted surveys of the biomedical literature, in the relevant time period.  See, e.g., J. Martin Bland & Douglas G. Altman, “Bayesians and frequentists,” 317 Brit. Med. J. 1151, 1151 (1998) (“almost all the statistical analyses which appear in the British Medical Journal are frequentist”); David S. Moore, “Bayes for Beginners? Some Reasons to Hesitate,” 51 The Am. Statistician 254, 254 (“Bayesian methods are relatively rarely used in practice”); J.D. Emerson & Graham Colditz, “Use of statistical analysis in the New England Journal of Medicine,” in John Bailar & Frederick Mosteler, eds., Medical Uses of Statistics 45 (1992) (surveying 115 original research studies for statistical methods used; no instances of Bayesian approaches counted); Douglas Altman, “Statistics in Medical Journals: Developments in the 1980s,” 10 Statistics in Medicine 1897 (1991); B.S. Everitt, “Statistics in Psychiatry,” 2 Statistical Science 107 (1987) (finding only one use of Bayesian methods in 441 papers with statistical methodology).

Perhaps the balance between frequentist and Bayesian analysis is shifting today, but when Professor Makuch made his affidavit in 2002 or so, he was clearly correct, factually and statistically.

In the legal arena, Bayesian analyses are frequently used in evaluating forensic claims about DNA, paternity, lead-isotopes, and other issues of identification.  Remarkably, Bayesian analyses play virtually no role in litigation of health effects claims, whether based upon medicines, or upon occupational or environmental exposures.  In searching Google scholar and Westlaw I found no cases outside of forensics. Citations to black-swan cases are welcomed.