TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Failed Gatekeeping in Ambrosini v. Labarraque (1996)

December 28th, 2017

The Ambrosini case straddled the Supreme Court’s 1993 Daubert decision. The case began before the Supreme Court clarified the federal standard for expert witness gatekeeping, and ended in the Court of Appeals for the District of Columbia, after the high court adopted the curious notion that scientific claims should be based upon reliable evidence and valid inferences. That notion has only slowly and inconsistently trickled down to the lower courts.

Given that Ambrosini was litigated in the District of Columbia, where the docket is dominated by regulatory controversies, frequently involving dubious scientific claims, no one should be surprised that the D.C. Court of Appeals did not see that the Supreme Court had read “an exacting standard” into Federal Rule of Evidence 702. And so, we see, in Ambrosini, this Court of Appeals citing and purportedly applying its own pre-Daubert decision in Ferebee v. Chevron Chem. Co., 552 F. Supp. 1297 (D.D.C. 1982), aff’d, 736 F.2d 1529 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).1 In 2000, the Federal Rule of Evidence 702 was revised in a way that extinguishes the precedential value of Ambrosini and the broad dicta of Ferebee, but some courts and commentators have failed to stay abreast of the law.

Escolastica Ambrosini was using a synthetic progestin birth control, Depo-Provera, as well as an anti-nausea medication, Bendectin, when she became pregnant. The child that resulted from this pregnancy, Teresa Ambrosini, was born with malformations of her face, eyes, and ears, cleft lip and palate, and vetebral malformations. About three percent of all live births in the United States have a major malformation. Perhaps because the Divine Being has sovereign immunity, Escolastica sued the manufacturers of Bendectin and Depo-Provera, as well as the prescribing physician.

The causal claims were controversial when made, and they still are. The progestin at issue, medroxyprogesterone acetate (MPA), was embryotoxic in the cynomolgus monkey2, but not in the baboon3. The evidence in humans was equivocal at best, and involved mostly genital malformations4; the epidemiologic evidence for the MPA causal claim to this day remains unconvincing5.

At the close of discovery in Ambrosini, Upjohn (the manufacturer of the progestin) moved for summary judgment, with a supporting affidavit of a physician and geneticist, Dr. Joe Leigh Simpson. In his affidavit, Simpson discussed three epidemiologic studies, as well as other published papers, in support of his opinion that the progestin at issue did not cause the types of birth defects manifested by Teresa Ambrosini.

Ambrosini had disclosed two expert witnesses, Dr. Allen S. Goldman and Dr. Brian Strom. Neither Goldman nor Strom bothered to identify the papers, studies, data, or methodology used in arriving at an opinion on causation. Not surprisingly, the district judge was unimpressed with their opposition, and granted summary judgment for the defendant. Ambrosini v. Labarraque, 966 F.2d 1462, 1466 (D.C. Cir. 1992).

The plaintiffs appealed on the remarkable ground that Goldman’s and Strom’s crypto-evidence satisfied Federal Rule of Evidence 703. Even more remarkably, the Circuit, in a strikingly unscholarly opinion by Judge Mikva, opined that disclosure of relied-upon studies was not required for expert witnesses under Rules 703 and 705. Judge Mikva seemed to forget that the opinions being challenged were not given in testimony, but in (late-filed) affidavits that had to satisfy the requirement of Federal Rule of Civil Procedure 26. Id. at 1468-69. At trial, an expert witness may express an opinion without identifying its bases, but of course the adverse party may compel disclosure of those bases. In discovery, the proffered expert witness must supply all opinions and evidence relied upon in reach the opinions. In any event, the Circuit remanded the case for a hearing and further proceedings, at which the two challenged expert witnesses, Goldman and Strom, would have to identify the bases of their opinions. Id. at 1471.

Not long after the case landed back in the district court, the Supreme Court decided Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). With an order to produce entered, plaintiffs’ counsel could no longer hide Goldman and Strom’s evidentiary bases, and their scientific inferences came under judicial scrutiny.

Upjohn moved again to exclude Goldman and Strom’s opinions. The district court upheld Upjohn’s challenges, and granted summary judgment in favor of Upjohn for the second time. The Ambrosinis appealed again, but the second case in the D.C. Circuit resulted in a split decision, with the majority holding that the exclusion of Goldman and Strom’s opinions under Rule 702 was erroneous. Ambrosini v. Labarraque, 101 F.3d 129 (D.C. Cir. 1996).

Although issued two decades ago, the majority’s opinion remains noteworthy as an example of judicial resistance to the existence and meaning of the Supreme Court’s Daubert opinion. The majority opinion uncritically cited the notorious Ferebee6 and other pre-Daubert decisions. The court embraced the Daubert dictum about gatekeeping being limited to methodologic consideration, and then proceeded to interpret methodology as superficially as necessary to sustain admissibility. If an expert witness claimed to have looked at epidemiologic studies, and epidemiology was an accepted methodology, then the opinion of the expert witness must satisfy the legal requirements of Daubert, or so it would seem from the opinion of the U.S. Court of Appeals for the District of Columbia.

Despite the majority’s hand waving, a careful reader will discern that there must have been substantial gaps and omissions in the explanations and evidence cited by plaintiffs’ expert witnesses. Seeing anything clearly in the Circuit’s opinion is made difficult, however, by careless and imprecise language, such as its descriptions of studies as showing, or not showing “causation,” when it could have meant only that such studies showed associations, with more or less random and systematic error.

Dr. Strom’s report addressed only general causation, and even so, he apparently did not address general causation of the specific malformations manifested by the plaintiffs’ child. Strom claimed to have relied upon the “totality of the data,” but his methodologic approach seems to have required him to dismiss studies that failed to show an association.

Dr. Strom first set forth the reasoning he employed that led him to disagree with those studies finding no causal relationship [sic] between progestins and birth defects like Teresa’s. He explained that an epidemiologist evaluates studies based on their ‘statistical power’. Statistical power, he continued, represents the ability of a study, based on its sample size, to detect a causal relationship. Conventionally, in order to be considered meaningful, negative studies, that is, those which allege the absence of a causal relationship, must have at least an 80 to 90 percent chance of detecting a causal link if such a link exists; otherwise, the studies cannot be considered conclusive. Based on sample sizes too small to be reliable, the negative studies at issue, Dr. Strom explained, lacked sufficient statistical power to be considered conclusive.”

Id. at 1367.

Putting aside the problem of suggesting that an observational study detects a “causal relationship,” as opposed to an association in need of further causal evaluation, the Court’s précis of Strom’s testimony on power is troublesome, and typical of how other courts have misunderstood and misapplied the concept of statistical power. Statistical power is a probability of observing an association of a specified size at a specified level of statistical significance. The calculation of statistical power turns indeed on sample size, the level of significance probability preselected for “statistical significance, an assumed probability distribution of the sample, and, critically, an alternative hypothesis. Without a specified alternative hypothesis, the notion of statistical power is meaningless, regardless of what probability (80% or 90% or some other percentage) is sought for finding the alternative hypothesis. Furthermore, the notion that the defense must adduce studies with “sufficient statistical power to be considered conclusive” creates an unscientific standard that can never be met, while subverting the law’s requirement that the claimant establish causation.

The suggestion that the studies that failed to find an association cannot be considered conclusive because they “lacked sufficient statistical power” is troublesome because it distorts and misapplies the very notion of statistical power. No attempt was made to describe the confidence intervals surrounding the point estimates of the null studies; nor was there any discussion whether the studies could be aggregated to increase their power to rule out meaningful associations.

The Circuit court’s scientific jurisprudence was thus seriously flawed. Without a discussion of the end points observed, the relevant point estimates of risk ratios, and the confidence intervals, the reader cannot assess the strength of the claims made by Goldman and Strom, or by defense expert Simpson, in their reports. Without identifying the study endpoints, the reader cannot evaluate whether the plaintiffs’ expert witnesses relied upon relevant outcomes in formulating their opinions. The court viewed the subject matter from 30,000 feet, passing over at 600 mph, without engagement or care. A strong dissent, however, suggested serious mischaracterizations of the plaintiffs’ evidence by the majority.

The only specific causation testimony to support plaintiff’s claims came from Goldman, in what appears to have been a “differential etiology.” Goldman purported to rule out a genetic cause, even though he had not conducted a critical family history or ordered a state-of-the-art chromosomal study. Id. at 140. Of course, nothing in a differential etiology approach would allow a physician to rule out “unknown” causes, which, for birth defects, make up the most prevalent and likely causes to explain any particular case. The majority acknowledged that these were short comings, but rhetorically characterized them as substantive, not methodologic, and therefore as issues for cross-examination, not for consideration by a judicial gatekeeping. All this is magical thinking, but it continues to infect judicial approaches to specific causation. See, e.g., Green Mountain Chrysler Plymouth Dodge Jeep v. Crombie, 508 F. Supp. 2d 295, 311 (D.Vt. 2007) (citing Ambrosini for the proposition that “the possibility of uneliminated causes goes to weight rather than admissibility, provided that the expert has considered and reasonably ruled out the most obvious”). In Ambrosini, however, Dr. Goldman had not ruled out much of anything.

Circuit Judge Karen LeCraft Henderson dissented in a short, but pointed opinion that carefully marshaled the record evidence. Drs. Goldman and Strom had relied upon a study by Greenberg and Matsunaga, whose data failed to show a statistically significant association between MPA and cleft lip and palate, when the crucial issue of timing of exposure was taken into consideration. Ambrosini, 101 F.3d at 142.

Beyond the specific claims and evidence, Judge Henderson anticipated the subsequent Supreme Court decisions in Joiner, Kumho Tire, and Weisgram, and the year 2000 revision of Rule 702, in noting that the majority’s acceptance of glib claims to have used a “traditional methodology” would render Daubert nugatory. Id. at 143-45 (characterizing Strom and Goldman’s methodologies as “wispish”). Even more importantly, Judge Henderson refused to indulge the assumption that somehow the length of Goldman’s C.V. substituted for evidence that his methods satisfied the legal (or scientific) standard of reliability. Id.

The good news is that little or nothing in Ambrosini survives the 2000 amendment to Rule 702. The bad news is that not all federal judges seem to have noticed, and that some commentators continue to cite the case, as lovely.

Probably no commentator has promiscuously embraced Ambrosini as warmly as Carl Cranor, a philosopher, and occasional expert witness for the lawsuit industry, in several publications and presentations.8 Cranor has been particularly enthusiastic about Ambrosini’s approval of expert witness’s testimony that failed to address “the relative risk between exposed and unexposed populations of cleft lip and palate, or any other of the birth defects from which [the child] suffers,” as well as differential etiologies that exclude nothing.9 Somehow Cranor, as did the majority in Ambrosini, believes that testimony that fails to identify the magnitude of the point estimate of relative risk can “assist the trier of fact to understand the evidence or to determine a fact in issue.”10 Of course, without that magnitude given, the trier of fact could not evaluate the strength of the alleged association; nor could the trier assess the probability of individual causation to the plaintiff. Cranor also has written approvingly of lumping unrelated end points, which defeats the assessment of biological plausibility and coherence by the trier of fact. When the defense expert witness in Ambrosini adverted to the point estimates for relevant end points, the majority, with Cranor’s approval, rejected the null findings as “too small to be significant.”11 If the null studies were, in fact, too small to be useful tests of the plaintiffs’ claims, intellectual and scientific honesty required an acknowledgement that the evidentiary display was not one from which a reasonable scientist would draw a causal conclusion.


1Ambrosini v. Labarraque, 101 F.3d 129, 138-39 (D.C. Cir. 1996) (citing and applying Ferebee), cert. dismissed sub nom. Upjohn Co. v. Ambrosini, 117 S.Ct. 1572 (1997) See also David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89Notre Dame L. Rev. 27, 31 (2013).

2 S. Prahalada, E. Carroad, M. Cukierski, and A.G. Hendrickx, “Embryotoxicity of a single dose of medroxyprogesterone acetate (MPA) and maternal serum MPA concentrations in cynomolgus monkey (Macaca fascicularis),” 32 Teratology 421 (1985).

3 S. Prahalada, E. Carroad, and A.G. Hendrick, “Embryotoxicity and maternal serum concentrations of medroxyprogesterone acetate (MPA) in baboons (Papio cynocephalus),” 32 Contraception 497 (1985).

4 See, e.g., Z. Katz, M. Lancet, J. Skornik, J. Chemke, B.M. Mogilner, and M. Klinberg, “Teratogenicity of progestogens given during the first trimester of pregnancy,” 65 Obstet Gynecol. 775 (1985); J.L. Yovich, S.R. Turner, and R. Draper, “Medroxyprogesterone acetate therapy in early pregnancy has no apparent fetal effects,” 38 Teratology 135 (1988).

5 G. Saccone, C. Schoen, J.M. Franasiak, R.T. Scott, and V. Berghella, “Supplementation with progestogens in the first trimester of pregnancy to prevent miscarriage in women with unexplained recurrent miscarriage: a systematic review and meta-analysis of randomized, controlled trials,” 107 Fertil. Steril. 430 (2017).

6 Ferebee v. Chevron Chemical Co., 736 F.2d 1529, 1535 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).

7 Dr. Strom was also quoted as having provided a misleading definition of statistical significance: “whether there is a statistically significant finding at greater than 95 percent chance that it’s not due to random error.” Ambrosini at 101 F.3d at 136. Given the majority’s inadequate description of the record, the description of witness testimony may not be accurate, and error cannot properly be allocated.

8 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 320, 327-28 (2006); see also Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 238 (2d ed. 2016).

9 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 320 (2006).

10 Id.

11 Id. ; see also Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 238 (2d ed. 2016).

Gatekeeping of Expert Witnesses Needs a Bair Hug

December 20th, 2017

For every Rule 702 (“Daubert”) success story, there are multiple gatekeeping failures. See David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89 Notre Dame L. Rev. 27 (2013).1 Exemplars of inadequate expert witness gatekeeping in state or federal court abound, and overwhelm the bar. The only solace one might find is that the abuse-of-discretion appellate standard of review keeps the bad decisions from precedentially outlawing the good ones.

Judge Joan Ericksen recently provided another Berenstain Bears’ example of how not to keep the expert witness gate, in litigation claims that the Bair Hugger forced air warming devices (“Bair Huggers”) cause infections. In re Bair Hugger Forced Air Warming, MDL No. 15-2666, 2017 WL 6397721 (D. Minn. Dec. 13, 2017). Although Her Honor properly cited and quoted Rule 702 (2000), a new standard is announced in a bold heading:

Under Federal Rule of Evidence 702, the Court need only exclude expert testimony that is so fundamentally unsupported that it can offer no assistance to the jury.”

Id. at *1. This new standard thus permits largely unsupported opinion that can offer bad assistance to the jury. As Judge Ericksen demonstrates, this new standard, which has no warrant in the statutory text of Rule 702 or its advisory committee notes, allows expert witnesses to rely upon studies that have serious internal and external validity flaws.

Jonathan Samet, a specialist in pulmonary medicine, not infectious disease or statistics, is one of the plaintiffs’ principal expert witnesses. Samet relies in large measure upon an observational study2, which purports to find an increased odds ratio for use of the Bair Hugger among infection cases in one particular hospital. The defense epidemiologist, Jonathan B. Borak, criticized the McGovern observational study on several grounds, including that the study was highly confounded by the presence of other known infection risks. Id. at *6. Judge Ericksen characterized Borak’s opinion as an assertion that the McGovern study was an “insufficient basis” for the plaintiffs’ claims. A fair reading of even Judge Ericksen’s précis of Borak’s proffered testimony requires the conclusion that Borak’s opinion was that the McGovern study was invalid because of data collection errors and confounding. Id.

Judge Ericksen’s judicial assessment, taken from the disagreement between Samet and Borak, is that there are issues with the McGovern study, which go to “weight of the evidence.” This finding obscures, however, that there were strong challenges to the internal and external validity of the study. Drawing causal inferences from an invalid observational study is a methodological issue, not a weight-of-the-evidence problem for the jury to resolve. This MDL opinion never addresses the Rule 703 issue, whether an epidemiologic expert would reasonably rely upon such a confounded study.

The defense proffered the opinion of Theodore R. Holford, who criticized Dr. Samet for drawing causal inferences from the McGovern observational study. Holford, a professor of biostatistics at Yale University’s School of Public Health, analyzed the raw data behind the McGovern study. Id. at *8. The plaintiffs challenged Holford’s opinions on the ground that he relied on data in “non-final” form, from a temporally expanded dataset. Even more intriguingly, given that the plaintiffs did not present a statistician expert witness, plaintiffs argued that Holford’s opinions should be excluded because

(1) he insufficiently justified his use of a statistical test, and

(2) he “emphasizes statistical significance more than he would in his professional work.”

Id.

The MDL court dismissed the plaintiffs’ challenge on the mistaken conclusion that the alleged contradictions between Holford’s practice and his testimony impugn his credibility at most.” If there were truly such a deviation from the statistical standard of care, the issue is methodological, not a credibility issue of whether Holford was telling the truth. And as for the alleged over-emphasis on statistical significance, the MDL court again falls back to the glib conclusions that the allegation goes to the weight, not the admissibility of expert witness opinion testimony, and that plaintiffs can elicit testimony from Dr Samet as to how and why Professor Holford over-emphasized statistical significance. Id. Inquiring minds, at the bar, and in the academy, are left with no information about what the real issues are in the case.

Generally, both sides’ challenges to expert witnesses were denied.3 The real losers, however, were the scientific and medical communities, bench, bar, and general public. The MDL court glibly and incorrectly treated methodological issues as “credibility” issues, confused sufficiency with validity, and banished methodological failures to consideration by the trier of fact for “weight.” Confounding was mistreated as simply a debating point between the parties’ expert witnesses. The reader of Judge Ericksen’s opinion never learns what statistical test was used by Professor Holford, what justification was needed but allegedly absent for the test, why the justification was contested, and what other test was alleged by plaintiffs to have been a “better” statistical test. As for the emphasis given statistical significance, the reader is left in the dark about exactly what that emphasis was, and how it led to Holford’s conclusions and opinions, and what the proper emphasis should have been.

Eventually appellate review of the Bair Hugger MDL decision must turn on whether the district court abused its discretion. Although appellate courts give trial judges discretion to resolve Rule 702 issues, the appellate courts cannot reach reasoned decisions when the inferior courts fail to give even a cursory description of what the issues were, and how and why they were resolved as they were.


2 P. D. McGovern, M. Albrecht, K. G. Belani, C. Nachtsheim, P. F. Partington, I. Carluke, and M. R. Reed, “Forced-Air Warming and Ultra-Clean Ventilation Do Not Mix: An Investigation of Theatre Ventilation, Patient Warming and Joint Replacement Infection in Orthopaedics,” 93 J. Bone Joint 1537 (2011). The article as published contains no disclosures of potential or actual conflicts of interest. A persistent rumor has it that the investigators were funded by a commercial rival to the manufacturer of the Bair Hugger at issue in Judge Ericksen’s MDL. See generally, Melissa D. Kellam, Loraine S. Dieckmann, and Paul N. Austin, “Forced-Air Warming Devices and the Risk of Surgical Site Infections,” 98 Ass’n periOperative Registered Nurses (AORN) J. 354 (2013).

3 A challenge to plaintiffs’ expert witness Yadin David was sustained to the extent he sought to offer opinions about the defendant’s state of mind. Id. at *5.

Multiplicity in the Third Circuit

September 21st, 2017

In Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283 (W. D. Pa.), plaintiffs claimed that their employer’s reduction in force unlawfully targeted workers over 50 years of age. Plaintiffs lacked any evidence of employer animus against old folks, and thus attempted to make out a statistical disparate impact claim. The plaintiffs placed their chief reliance upon an expert witness, Michael A. Campion, to analyze a dataset of workers agreed to have been the subject of the R.I.F. For the last 30 years, Campion has been on the faculty in Purdue University. His academic training and graduate degrees are in industrial and organizational psychology. Campion has served an editor of Personnel Psychology, and as a past president of the Society for Industrial and Organizational Psychology. Campion’s academic website page notes that he manages a small consulting firm, Campion Consulting Services1.

The defense sought to characterize Campion as not qualified to offer his statistical analysis2. Campion did, however, have some statistical training as part of his master’s level training in psychology, and his professional publications did occasionally involve statistical analyses. To be sure, Campion’s statistical acumen paled in comparison to the defense expert witness, James Rosenberger, a fellow and a former vice president of the American Statistical Association, as well as a full professor of statistics in Pennsylvania State University. The threshold for qualification, however, is low, and the defense’s attack on Campion’s qualifications failed to attract the court’s serious attention.

On the merits, the defense subjected Campion to a strong challenge on whether he had misused data. The defense’s expert witness, Prof. Rosenberger, filed a report that questioned Campion’s data handling and statistical analyses. The defense claimed that Campion had engaged in questionable data manipulation by including, in his RIF analysis, workers who had been terminated when their plant was transferred to another company, as well as workers who retired voluntarily.

Using simple z-score tests, Campion compared the ages of terminated and non-terminated employees in four subgroups, ages 40+, 45+, 50+, and 55+. He did not conduct an analysis of the 60+ subgroup on the claim that this group had too few members for the test to have sufficient power3Campion found a small z-score for the 40+ versus <40 age groups comparison (z =1.51), which is not close to statistical significance at the 5% level. On the defense’s legal theory, this was the crucial comparison to be made under the Age Discrimination in Employment Act (ADEA). The plaintiffs, however, maintained that they could make out a case of disparate impact by showing age discrimination at age subgroups that started above the minimum specified by the ADEA. Although age is a continuous variable, Campion decided to conduct z-scores on subgroups that were based upon five-year increments. For the 45+, 50+, and 55+ age subgroups, he found z-scores that ranged from 2.15 to 2.46, and he concluded that there was evidence of disparate impact in the higher age subgroups4. Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283, 2015 WL 4232600, at *11 (W.D. Pa. July 13, 2015) (McVerry, S.J.)

The defense, and apparently the defense expert witnesses, branded Campion’s analysis as “data snooping,” which required correction for multiple comparisons. In the defense’s view, the multiple age subgroups required a Bonferroni correction that would have diminished the critical p-value for “significance” by a factor of four. The trial court agreed with the defense contention about data snooping and multiple comparisons, and excluded Campion’s opinion of disparate impact, which had been based upon finding statistically significant disparities in the 45+, 50+, and 55+ age subgroups. 2015 WL 4232600, at *13. The trial court noted that Campion, in finding significant disparities in terminations in the subgroups, but not in the 40+ versus <40 analysis:

[did] not apply any of the generally accepted statistical procedures (i.e., the Bonferroni procedure) to correct his results for the likelihood of a false indication of significance. This sort of subgrouping ‘analysis’ is data-snooping, plain and simple.”

Id. After excluding Campion’s opinions under Rule 702, as well as other evidence in support of plaintiffs’ disparate impact claim, the trial court granted summary judgment on the discrimination claims. Karlo v. Pittsburgh Glass Works, LLC, No. 2:10–cv–1283, 2015 WL 5156913 (W. D. Pa. Sept. 2, 2015).

On plaintiffs’ appeal, the Third Circuit took the wind out of the attack on Campion by holding that the ADEA prohibits disparate impacts based upon age, which need not necessarily be on workers’ being over 40 years old, as opposed to being at least 40 years old. Karlo v. Pittsburgh Glass Works, LLC, 849 F.3d 61, 66-68 (3d Cir. 2017). This holding took the legal significance out of the statistical insignificance of Campion’s comparison 40+ versus <40 age-group termination rates. Campion’s subgroup analyses were back in play, but the Third Circuit still faced the question whether Campion’s conclusions, based upon unadjusted z-scores and p-values, offended Rule 702.

The Third Circuit noted that the district court had identified three grounds for excluding Campion’s statistical analyses:

(1) Dr. Campion used facts or data that were not reliable;

(2) he failed to use a statistical adjustment called the Bonferroni procedure; and

(3) his testimony lacks ‘‘fit’’ to the case because subgroup claims are not cognizable.

849 F.3d at 81. The first issue was raised by the defense’s claims of Campion’s sloppy data handling, and inclusion of voluntarily retired workers and workers who were terminated when their plant was turned over to another company. The Circuit did not address these data handling issues, which it left for the trial court on remand. Id. at 82. The third ground went out of the case with the appellate court’s resolution of the scope of the ADEA. The Circuit did, however, engage on the issue whether adjustment for multiple comparisons was required by Rule 702.

On the “data-snooping” issue, the Circuit concluded that the trial court had applied “an incorrectly rigorous standard for reliability.” Id. The Circuit acknowledged that

[i]n theory, a researcher who searches for statistical significance in multiple attempts raises the probability of discovering it purely by chance, committing Type I error (i.e., finding a false positive).”

849 F.3d at 82. The defense expert witness contended that applying the Bonferroni adjustment, which would have reduced the critical significance probability level from 5% to 1%, would have rendered Campion’s analyses not statistically significant, and thus not probative of disparate impact. Given that plaintiffs’ cases were entirely statistical, the adjustment would have been fatal to their cases. Id. at 82.

At the trial level and on appeal, plaintiffs and Campion had objected to the data-snooping charge on ground that

(1) he had engaged in only four subgroups;

(2) virtually all subgroups were statistically significant;

(3) his methodology was “hypothesis driven” and involved logical increments in age to explore whether the strength of the evidence of age disparity in terminations continued in each, increasingly older subgroup;

(4) his method was analogous to replications with different samples; and

(5) his result was confirmed by a single, supplemental analysis.

Id. at 83. According to the plaintiffs, Campion’s approach was based upon the reality that age is a continuous, not a dichotomous variable, and he was exploring a single hypothesis. A.240-241; Brief of Appellants at 26. Campion’s explanations do mitigate somewhat the charge of “data snooping,” but they do not explain why Campion did not use a statistical analysis that treated age as a continuous variable, at the outset of his analysis. The single, supplemental analysis was never described or reported by the trial or appellate courts.

The Third Circuit concluded that the district court had applied a ‘‘merits standard of correctness,’’ which is higher than what Rule 702 requires. Specifically, the district court, having identified a potential methodological flaw, did not further evaluate whether Campion’s opinion relied upon good grounds. 849 F.3d at 83. The Circuit vacated the judgment below, and remanded the case to the district court for the opportunity to apply the correct standard.

The trial court’s acceptance that an adjustment was appropriate or required hardly seems a “merits standard.” The use of a proper adjustment for multiple comparisons is very much a methodological concern. If Campion could reach his conclusion only by way of an inappropriate methodology, then his conclusion surely would fail the requirements of Rule 702. The trial court did, however, appear to accept, without explicit evidence, that the failure to apply the Bonferroni correction made it impossible for Campion to present sound scientific argument for his conclusion that there had been disparate impact. The trial court’s opinion also suggests that the Bonferroni correction itself, as opposed to some more appropriate correction, was required.

Unfortunately, the reported opinions do not provide the reader with a clear account of what the analyses would have shown on the correct data set, without improper inclusions and exclusions, and with appropriate statistical adjustments. Presumably, the parties are left to make their cases on remand.

Based upon citations to sources that described the Bonferroni adjustment as “good statistical practice,” but one that is ‘‘not widely or consistently adopted’’ in the behavioral and social sciences, the Third Circuit observed that in some cases, failure to adjust for multiple comparisons may “simply diminish the weight of an expert’s finding.”5 The observation is problematic given that Kumho Tire suggests that an expert witness must use “in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.” Kumho Tire Co. v. Carmichael, 526 U.S. 137, 150, (1999). One implication is that courts are prisoners to prevalent scientific malpractice and abuse of statistical methodology. Another implication is that courts need to look more closely at the assumptions and predicates for various statistical tests and adjustments, such as the Bonferroni correction.

These worrisome implications are exacerbated by the appellate court’s insistence that the question whether a study’s result was properly calculated or interpreted “goes to the weight of the evidence, not to its admissibility.”6 Combined with citations to pre-Daubert statistics cases7, judicial comments such as these can appear to be a general disregard for the statutory requirements of Rules 702 and 703. Claims of statistical significance, in studies with multiple exposure and multiple outcomes, are frequently not adjusted for multiple comparisons, without notation, explanation, or justification. The consequence is that study results are often over-interpreted and over-sold. Methodological errors related to multiple testing or over-claiming statistical significance are commonplace in tort litigation over “health-effects” studies of birth defects, cancer, and other chronic diseases that require epidemiologic evidence8.

In Karlo, the claimed methodological error is beset by its own methodological problems. As the court noted, adjustments for multiple comparisons are not free from methodological controversy9. One noteworthy textbook10 labels the Bonferroni correction as an “awful response” to the problem of multiple comparisons. Aside from this strident criticism, there are alternative approaches to statistical adjustment for multiple comparisons. In the context of the Karlo case, the Bonferroni might well be awful because Campion’s four subgroups are hardly independent tests. Because each subgroup is nested within the next higher age subgroup, the subgroup test results will be strongly correlated in a way that defeats the mathematical assumptions of the Bonferroni correction. On remand, the trial court in Karlo must still make his Rule 702 gatekeeping decision on the methodological appropriateness of whether Campion’s properly considered the role of multiple subgroups, and multiple anaslyses run on different models.


1 Although Campion describes his consulting business as small, he seems to turn up in quite a few employment discrimination cases. See, e.g., Chen-Oster v. Goldman, Sachs & Co., 10 Civ. 6950 (AT) (JCF) (S.D.N.Y. 2015); Brand v. Comcast Corp., Case No. 11 C 8471 (N.D. Ill. July 5, 2014); Powell v. Dallas Morning News L.P., 776 F. Supp. 2d 240, 247 (N.D. Tex. 2011) (excluding Campion’s opinions), aff’d, 486 F. App’x 469 (5th Cir. 2012).

2 See Defendant’s Motion to Bar Dr. Michael Campion’s Statistical Analysis, 2013 WL 11260556.

3 There was no mention of an effect size for the lower aged subgroups, and a power calculation for the 60+ subgroup’s probability of showing a z-score greater than two. Similarly, there was no discussion or argument about why this subgroup could not have been evaluated with Fisher’s exact test. In deciding the appeal, the Third Circuit observed that “Dr. Rosenberger test[ed] a subgroup of sixty-and-older employees, which Dr. Campion did not include in his analysis because ‘[t]here are only 14 terminations, which means the statistical power to detect a significant effect is very low’. A.244–45.” Karlo v. Pittsburgh Glass Works, LLC, 849 F.3d 61, 82 n.15 (3d Cir. 2017).

4 In the trial court’s words, the z-score converts the difference in termination rates into standard deviations. Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283, 2015 WL 4232600, at *11 n.13 (W.D. Pa. July 13, 2015). According to the trial court, Campion gave a rather dubious explanation of the meaning of the z-score: “[w]hen the number of standard deviations is less than –2 (actually–1.96), there is a 95% probability that the difference in termination rates of the subgroups is not due to chance alone” Id. (internal citation omitted).

5 See 849 F.3d 61, 83 (3d Cir. 2017) (citing and quoting from Paetzold & Willborn § 6:7, at 308 n.2) (describing the Bonferroni adjustment as ‘‘good statistical practice,’’ but ‘‘not widely or consistently adopted’’ in the behavioral and social sciences); see also E.E.O.C. v. Autozone, Inc., No. 00-2923, 2006 WL 2524093, at *4 (W.D. Tenn. Aug. 29, 2006) (‘‘[T]he Court does not have a sufficient basis to find that … the non-utilization [of the Bonferroni adjustment] makes [the expert’s] results unreliable.’’). And of course, the Third Circuit invoked the Daubert chestnut: ‘‘Vigorous cross-examination, presentation of contrary evidence, and careful instruction on the burden of proof are the traditional and appropriate means of attacking shaky but

admissible evidence.’’ Daubert, 509 U.S. 579, 596 (1993).

6 See 849 F.3d at 83 (citing Leonard v. Stemtech Internat’l Inc., 834 F.3d 376, 391 (3d Cir. 2016).

7 See 849 F.3d 61, 83 (3d Cir. 2017), citing Bazemore v. Friday, 478 U.S. 385, 400 (1986) (‘‘Normally, failure to include variables will affect the analysis’ probativeness, not its admissibility.’’).

8 See Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and Litigation 93 & n.3 (1997) (criticizing the “notorious” case of Wells v. Ortho Pharmaceutical Corp., 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S. 950 (1986), for its erroneous endorsement of conclusions based upon “statistically significant” studies that explored dozens of congenital malformation outcomes, without statistical adjustment). The authors do, however, give an encouraging example of a English trial judge who took multiplicity seriously. Reay v. British Nuclear Fuels (Q.B. Oct. 8,1993) (published in The Independent, Nov. 22,1993). In Reay, the trial court took seriously the multiplicity of hypotheses tested in the study relied upon by plaintiffs. Id. (“the fact that a number of hypotheses were considered in the study requires an increase in the P-value of the findings with consequent reduction in the confidence that can be placed in the study result … .”), quoted in Zeisel & Kaye at 93. Zeisel and Kaye emphasize that courts should not be overly impressed with claims of statistically significant findings, and should pay close attention to how expert witnesses developed their statistical models. Id. at 94.

9 See David B. Cohen, Michael G. Aamodt, and Eric M. Dunleavy, Technical Advisory Committee Report on Best Practices in Adverse Impact Analyses (Center for Corporate Equality 2010).

10 Kenneth J. Rothman, Sander Greenland, and Timoth L. Lash, Modern Epidemiology 273 (3d ed. 2008); see also Kenneth J. Rothman, “No Adjustments Are Needed for Multiple Comparisons,” 1 Epidemiology 43, 43 (1990)

 Another Haack Article on Daubert

October 14th, 2016

In yet another law review article on Daubert, Susan Haack has managed mostly to repeat her past mistakes, while adding a few new ones to her exegesis of the law of expert witnesses. See Susan Haack, “Mind the Analytical Gap! Tracing a Fault Line in Daubert,” 654 Wayne L. Rev. 653 (2016) [cited as Gap].  Like some other commentators on the law of evidence, Haack purports to discuss this area of law without ever citing or quoting the current version of the relevant statute, Federal Rule of Evidence 703. She pours over Daubert and Joiner, as she has done before, with mostly the same errors of interpretation. In discussing Joiner, Haack misses the importance of the Supreme Court’s reversal of the 11th Circuit’s asymmetric standard of Rule 702 trial court decisions. Gap at 677. And Haack’s analysis of this area of law omits any mention of Rule 703, and its role in Rule 702 determinations. Although you can safely skip yet another Haack article, you should expect to see this one, along with her others, cited in briefs, right up there with David Michael’s Manufacturing Doubt.

A Matter of Degree

“It may be said that the difference is only one of degree. Most differences are, when nicely analyzed.”[1]

Quoting Holmes, Haack appears to complain that the courts’ admissibility decisions on expert witnesses’s opinions are dichotomous and categorical, whereas the component parts of the decisions, involving relevance and reliability, are qualitative and gradational. True, true, and immaterial.

How do you boil a live frog so it does not jump out of the water?  You slowly turn up the heat on the frog by degrees.  The frog is lulled into complacency, but at the end of the process, the frog is quite, categorically, and sincerely dead. By a matter of degrees, you can boil a frog alive in water, with a categorically ascertainable outcome.

Humans use categorical assignments in all walks of life.  We rely upon our conceptual abilities to differentiate sinners and saints, criminals and paragons, scholars and skells. And we do this even though IQ, and virtues, come in degrees. In legal contexts, the finder of fact (whether judge or jury) must resolve disputed facts and render a verdict, which will usually be dichotomous, not gradational.

Haack finds “the elision of admissibility into sufficiency disturbing,” Gap at 654, but that is life, reason, and the law. She suggests that the difference in the nature of relevancy and reliability on the one hand, and admissibility on the other, creates a conceptual “mismatch.” Gap at 669. The suggestion is rubbish, a Briticism that Haack is fond of using herself.  Clinical pathologists may diagnose cancer by counting the number of mitotic spindles in cells removed from an organ on biopsy.  The number may be characterized by as a percentage of cells in mitosis, a gradational that can run from zero to 100 percent, but the conclusion that comes out of the pathologist’s review is a categorical diagnosis.  The pathologist must decide whether the biopsy result is benign or malignant. And so it is with many human activities and ways of understanding the world.

The Problems with Daubert (in Haack’s View)

Atomism versus Holism

Haack repeats a litany of complaints about Daubert, but she generally misses the boat.  Daubert was decisional law, in 1993, which interpreted a statute, Federal Rule of Evidence 702.  The current version of Rule 702, which was not available to, or binding on, the Court in Daubert, focuses on both validity and sufficiency concerns:

A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if:

(a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;

(b) the testimony is based on sufficient facts or data;

(c) the testimony is the product of reliable principles and methods; and

(d) the expert has reliably applied the principles and methods to the facts of the case.

Subsection (b) renders most of Haack’s article a legal ignoratio elenchi.

Relative Risks Greater Than Two

Modern chronic disease epidemiology has fostered an awareness that there is a legitimate category of disease causation that involves identifying causes that are neither necessary nor sufficient to produce their effects. Today it is a commonplace that an established cause of lung cancer is cigarette smoking, and yet, not all smokers develop lung cancer, and not all lung cancer patients were smokers.  Epidemiology can identify lung cancer causes such as smoking because it looks at stochastic processes that are modified from base rates, or population rates. This model of causation is not expected to produce uniform and consistent categorical outcomes in all exposed individuals, such as lung cancer in all smokers.

A necessary implication of categorizing an exposure or lifestyle variable as a “cause,” in this way is that the evidence that helps establish causation cannot answer whether a given individual case of the outcome of interest was caused by the exposure of interest, even when that exposure is a known cause.  We can certainly say that the exposure in the person was a risk for developing the disease later, but we often have no way to make the individual attribution.  In some cases, more the exception than the rule, there may be an identified mechanism that allows the detection of a “fingerprint” of causation. For the most part, however, risk and cause are two completely different things.

The magnitude of risk, expressed as a risk ratio, can be used to calculate a population attributable risk, which can in turn, with some caveats, be interpreted as approximating a probability of causation.  When the attributable risk is 95%, as it would be for people with light smoking habits and lung cancer, treating the existence of the prior risk as evidence of specific causation seems perfectly reasonable.  Treating a 25% attributable risk as evidence to support a conclusion of specific causation, without more, is simply wrong.  A simple probabilistic urn model would tell us that we would most likely be incorrect if we attributed a random case to the risk based upon such a low attributable risk.  Although we can fuss over whether the urn model is correct, the typical case in litigation allows no other model to be asserted, and it would be the plaintiffs’ burden of proof to establish the alternative model in any event.

As she has done many times before, Haack criticizes Judge Kozinski’s opinion in Daubert,[2] on remand, where he entered judgment for the defendant because further proceedings were futile given the small relative risks claimed by plaintiffs’ expert witnesses.  Those relative risks, advanced by Shanna Swan and Alan Done, lacked reliability; they were the product of a for-litigation juking of the stats that were the original target of the defendant and the medical community in the Supreme Court briefing.  Judge Kozinski simplified the case, using a common legal strategem of assuming arguendo that general causation was established.  With this assumption favorable to plaintiffs made, but never proven or accepted, Judge Kozinski could then shine his analytical light on the fatal weakness of the specific causation opinions.  When all the hand waving was put to rest, all that propped up the plaintiff’s specific causation claim was the existence of a claimed relative risk, which was less than two. Haack is unhappy with the analytical clarity achieved by Kozinski, and implicitly urges a conflation of general and specific causation so that “all the evidence” can be counted.  The evidence of general causation, however, does not advance plaintiff’s specific causation case when the nature of causation is the (assumed) existence of a non-necessary and non-sufficient risk. Haack quotes Dean McCormick as having observed that “[a] brick is not a wall,” and accuses Judge Kozinski of an atomistic fallacy of ruling out a wall simply because the party had only bricks.  Gap at 673, quoting from Charles McCormick, Handbook of the Law of Evidence at 317 (1954).

There is a fallacy opposite to the atomistic fallacy, however, namely the holistic “too much of nothing fallacy” so nicely put by Poincaré:

“Science is built up with facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house.”[3]

Poincaré’s metaphor is more powerful than Haack’s call for holistic evidence because it acknowledges that interlocking pieces of evidence may cohere as a building, or they may be no more than a pile of rubble.  Poorly constructed walls may soon revert to the pile of stones from which they came.

Haack proceeds to criticize Judge Kozinski for his “extraordinary argument” that

“(a) equates degrees of proof with statistical probabilities;

(b) assesses each expert’s testimony individually; and

(c) raises the standard of admissibility under the relevance prong to the standard of proof.”

Gap at 672.

Haack misses the point that a low relative risk, with no other valid evidence of specific causation, translates into a low probability of specific causation, even if general causation were apodictically certain. Aggregating the testimony, say between  animal toxicologists and epidemiologists, simply does not advance the epistemic ball on specific causation because all the evidence collectively does not help identify the cause of Jason Daubert’s birth defects on the very model of causation that plaintiffs’ expert witnesses advanced.

All this would be bad enough, but Haack then goes on to commit a serious category mistake in confusing the probabilistic inference (for specific causation) of an urn model with the prosecutor’s fallacy of interpreting a random match probability as the evidence of innocence. (Or the complement of the random match probability as the evidence of guilt.) Judge Kozinski was not working with random match probabilities, and he did not commit the prosecutor’s fallacy.

Take Some Sertraline and Call Me in the Morning

As depressing as Haack’s article is, she manages to make matters even gloomier by attempting a discussion of Judge Rufe’s recent decision in the sertraline birth defects litigation. Haack’s discussion of this decision illustrates and typifies her analyses of other cases, including various decisions on causation opinion testimony on phenylpropanolamine, silicone, bendectin, t-PA, and other occupational, environmental, and therapeutic exposures. Maybe 100 mg sertraline is in order.

Haack criticizes what she perceives to be the conflation of admissibility and sufficiency issues in how the sertraline MDL court addressed the defendants’ motion to exclude the proffered testimony of Dr. Anick Bérard. Gap at 683. The conflation is imaginary, however, and the direct result of Haack’s refusal to look at the specific, multiple methodological flaws in plaintiffs’ expert witness Anick Bérard’s methodologic approach taken to reach a causal conclusion. These flaws are not gradational, and they are detailed in the MDL court’s opinion[4] excluding Anick Bérard. Haack, however, fails to look at the details. Instead Haack focuses on what she suggests is the sertraline MDL court’s conclusion that epidemiology was necessary:

“Judge Rufe argues that reliable testimony about human causation should generally be supported by epidemiological studies, and that ‘when epidemiological studies are equivocal or inconsistent with a causation opinion, experts asserting causation opinions must thoroughly analyze the strengths and weaknesses of the epidemiological research and explain why [it] does not contradict or undermine their opinion’. * * *

Judge Rufe acknowledges the difference between admissibility and sufficiency but, when it comes to the part of their testimony he [sic] deems inadmissible, his [sic] argument seems to be that, in light of the defendant’s epidemiological evidence, the plaintiffs’ expert testimony is insufficient.”

Gap at 682.

This précis is a remarkable distortion of the material facts of the case. There was no plaintiffs’ epidemiology evidence and defendants’ epidemiologic evidence.  Rather there was epidemiologic evidence, and Bérard ignored, misreported, or misrepresented a good deal of the total evidentiary display. Bérard embraced studies when she could use their risk ratios to support her opinions, but criticized or ignored the same studies when their risk ratios pointed in the direction of no association or even of a protective association. To add to this methodological duplicity, Anick Bérard published many statements, in peer-reviewed journals, that sertraline was not shown to cause birth defects, but then changed her opinion solely for litigation. The court’s observation that there was a need for consistent epidemiologic evidence flowed not only from the conception of causation (non-necessary, not sufficient), but from Berard’s and her fellow plaintiffs’ expert witnesses’ concessions that epidemiology was needed.  Haack’s glib approach to criticizing judicial opinions fails to do justice to the difficulties of the task; nor does she advance any meaningful criteria to separate successful from unsuccessful efforts.

In attempting to make her case for the gradational nature of relevance and reliability, Haack acknowledges that the details of the evidence relied upon can render the evidence, and presumably the conclusion based thereon, more or less reliable.  Thus, we are told that epidemiologic studies based upon self-reported diagnoses are highly unreliable because such diagnoses are often wrong. Gap at 667-68. Similarly, we are told that in consider a claim that a plaintiff suffered an adverse effect from a medication, that epidemiologic evidence showing a risk ratio of three would not be reliable if it had inadequate or inappropriate controls,[5] was not double blinded, and lacked randomization. Gap at 668-69. Even if the boundaries between reliable and unreliable are not always as clear as we might like, Haack fails to show that the gatekeeping process lacks a suitable epistemic, scientific foundation.

Curiously, Haack calls out Carl Cranor, plaintiffs’ expert witness in the Milward case, for advancing a confusing, vacuous “weight of the evidence” rationale for the methodology employed by the other plaintiffs’ causation expert witnesses in Milward.[6] Haack argues that Cranor’s invocation of “inference to the best explanation” and “weight of the evidence” fails to answer the important questions at issue in the case, namely how to weight the inference to causation as strong, weak, or absent. Gap at 688 & n. 223, 224. And yet, when Haack discusses court decisions that detailed voluminous records of evidence about how causal inferences should be made and supported, she flies over the details to give us confused, empty conclusions that the trial courts conflated admissibility with sufficiency.


[1] Rideout v. Knox, 19 N.E. 390, 392 (Mass. 1892).

[2] Daubert v. Merrell Dow Pharm., Inc., 43 F.3d 1311, 1320 (9th Cir. 1995).

[3] Jules Henri Poincaré, La Science et l’Hypothèse (1905) (chapter 9, Les Hypothèses en Physique)( “[O]n fait la science avec des faits comme une maison avec des pierres; mais une accumulation de faits n’est pas plus une science qu’un tas de pierres n’est une maison.”).

[4] In re Zoloft Prods. Liab. Litig., 26 F. Supp. 3d 466 (E.D. Pa. 2014).

[5] Actually Haack’s suggestion is that a study with a relative risk of three would not be very reliable if it had no controls, but that suggestion is incoherent.  A risk ratio could not have been calculated at all if there had been no controls.

[6] Milward v. Acuity Specialty Prods., 639 F.3d 11, 17-18 (1st Cir. 2011), cert. denied, 132 S.Ct. 1002 (2012).