Multiplicity in the Third Circuit

In Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283 (W. D. Pa.), plaintiffs claimed that their employer’s reduction in force unlawfully targeted workers over 50 years of age. Plaintiffs lacked any evidence of employer animus against old folks, and thus attempted to make out a statistical disparate impact claim. The plaintiffs placed their chief reliance upon an expert witness, Michael A. Campion, to analyze a dataset of workers agreed to have been the subject of the R.I.F. For the last 30 years, Campion has been on the faculty in Purdue University. His academic training and graduate degrees are in industrial and organizational psychology. Campion has served an editor of Personnel Psychology, and as a past president of the Society for Industrial and Organizational Psychology. Campion’s academic website page notes that he manages a small consulting firm, Campion Consulting Services1.

The defense sought to characterize Campion as not qualified to offer his statistical analysis2. Campion did, however, have some statistical training as part of his master’s level training in psychology, and his professional publications did occasionally involve statistical analyses. To be sure, Campion’s statistical acumen paled in comparison to the defense expert witness, James Rosenberger, a fellow and a former vice president of the American Statistical Association, as well as a full professor of statistics in Pennsylvania State University. The threshold for qualification, however, is low, and the defense’s attack on Campion’s qualifications failed to attract the court’s serious attention.

On the merits, the defense subjected Campion to a strong challenge on whether he had misused data. The defense’s expert witness, Prof. Rosenberger, filed a report that questioned Campion’s data handling and statistical analyses. The defense claimed that Campion had engaged in questionable data manipulation by including, in his RIF analysis, workers who had been terminated when their plant was transferred to another company, as well as workers who retired voluntarily.

Using simple z-score tests, Campion compared the ages of terminated and non-terminated employees in four subgroups, ages 40+, 45+, 50+, and 55+. He did not conduct an analysis of the 60+ subgroup on the claim that this group had too few members for the test to have sufficient power3Campion found a small z-score for the 40+ versus <40 age groups comparison (z =1.51), which is not close to statistical significance at the 5% level. On the defense’s legal theory, this was the crucial comparison to be made under the Age Discrimination in Employment Act (ADEA). The plaintiffs, however, maintained that they could make out a case of disparate impact by showing age discrimination at age subgroups that started above the minimum specified by the ADEA. Although age is a continuous variable, Campion decided to conduct z-scores on subgroups that were based upon five-year increments. For the 45+, 50+, and 55+ age subgroups, he found z-scores that ranged from 2.15 to 2.46, and he concluded that there was evidence of disparate impact in the higher age subgroups4. Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283, 2015 WL 4232600, at *11 (W.D. Pa. July 13, 2015) (McVerry, S.J.)

The defense, and apparently the defense expert witnesses, branded Campion’s analysis as “data snooping,” which required correction for multiple comparisons. In the defense’s view, the multiple age subgroups required a Bonferroni correction that would have diminished the critical p-value for “significance” by a factor of four. The trial court agreed with the defense contention about data snooping and multiple comparisons, and excluded Campion’s opinion of disparate impact, which had been based upon finding statistically significant disparities in the 45+, 50+, and 55+ age subgroups. 2015 WL 4232600, at *13. The trial court noted that Campion, in finding significant disparities in terminations in the subgroups, but not in the 40+ versus <40 analysis:

[did] not apply any of the generally accepted statistical procedures (i.e., the Bonferroni procedure) to correct his results for the likelihood of a false indication of significance. This sort of subgrouping ‘analysis’ is data-snooping, plain and simple.”

Id. After excluding Campion’s opinions under Rule 702, as well as other evidence in support of plaintiffs’ disparate impact claim, the trial court granted summary judgment on the discrimination claims. Karlo v. Pittsburgh Glass Works, LLC, No. 2:10–cv–1283, 2015 WL 5156913 (W. D. Pa. Sept. 2, 2015).

On plaintiffs’ appeal, the Third Circuit took the wind out of the attack on Campion by holding that the ADEA prohibits disparate impacts based upon age, which need not necessarily be on workers’ being over 40 years old, as opposed to being at least 40 years old. Karlo v. Pittsburgh Glass Works, LLC, 849 F.3d 61, 66-68 (3d Cir. 2017). This holding took the legal significance out of the statistical insignificance of Campion’s comparison 40+ versus <40 age-group termination rates. Campion’s subgroup analyses were back in play, but the Third Circuit still faced the question whether Campion’s conclusions, based upon unadjusted z-scores and p-values, offended Rule 702.

The Third Circuit noted that the district court had identified three grounds for excluding Campion’s statistical analyses:

(1) Dr. Campion used facts or data that were not reliable;

(2) he failed to use a statistical adjustment called the Bonferroni procedure; and

(3) his testimony lacks ‘‘fit’’ to the case because subgroup claims are not cognizable.

849 F.3d at 81. The first issue was raised by the defense’s claims of Campion’s sloppy data handling, and inclusion of voluntarily retired workers and workers who were terminated when their plant was turned over to another company. The Circuit did not address these data handling issues, which it left for the trial court on remand. Id. at 82. The third ground went out of the case with the appellate court’s resolution of the scope of the ADEA. The Circuit did, however, engage on the issue whether adjustment for multiple comparisons was required by Rule 702.

On the “data-snooping” issue, the Circuit concluded that the trial court had applied “an incorrectly rigorous standard for reliability.” Id. The Circuit acknowledged that

[i]n theory, a researcher who searches for statistical significance in multiple attempts raises the probability of discovering it purely by chance, committing Type I error (i.e., finding a false positive).”

849 F.3d at 82. The defense expert witness contended that applying the Bonferroni adjustment, which would have reduced the critical significance probability level from 5% to 1%, would have rendered Campion’s analyses not statistically significant, and thus not probative of disparate impact. Given that plaintiffs’ cases were entirely statistical, the adjustment would have been fatal to their cases. Id. at 82.

At the trial level and on appeal, plaintiffs and Campion had objected to the data-snooping charge on ground that

(1) he had engaged in only four subgroups;

(2) virtually all subgroups were statistically significant;

(3) his methodology was “hypothesis driven” and involved logical increments in age to explore whether the strength of the evidence of age disparity in terminations continued in each, increasingly older subgroup;

(4) his method was analogous to replications with different samples; and

(5) his result was confirmed by a single, supplemental analysis.

Id. at 83. According to the plaintiffs, Campion’s approach was based upon the reality that age is a continuous, not a dichotomous variable, and he was exploring a single hypothesis. A.240-241; Brief of Appellants at 26. Campion’s explanations do mitigate somewhat the charge of “data snooping,” but they do not explain why Campion did not use a statistical analysis that treated age as a continuous variable, at the outset of his analysis. The single, supplemental analysis was never described or reported by the trial or appellate courts.

The Third Circuit concluded that the district court had applied a ‘‘merits standard of correctness,’’ which is higher than what Rule 702 requires. Specifically, the district court, having identified a potential methodological flaw, did not further evaluate whether Campion’s opinion relied upon good grounds. 849 F.3d at 83. The Circuit vacated the judgment below, and remanded the case to the district court for the opportunity to apply the correct standard.

The trial court’s acceptance that an adjustment was appropriate or required hardly seems a “merits standard.” The use of a proper adjustment for multiple comparisons is very much a methodological concern. If Campion could reach his conclusion only by way of an inappropriate methodology, then his conclusion surely would fail the requirements of Rule 702. The trial court did, however, appear to accept, without explicit evidence, that the failure to apply the Bonferroni correction made it impossible for Campion to present sound scientific argument for his conclusion that there had been disparate impact. The trial court’s opinion also suggests that the Bonferroni correction itself, as opposed to some more appropriate correction, was required.

Unfortunately, the reported opinions do not provide the reader with a clear account of what the analyses would have shown on the correct data set, without improper inclusions and exclusions, and with appropriate statistical adjustments. Presumably, the parties are left to make their cases on remand.

Based upon citations to sources that described the Bonferroni adjustment as “good statistical practice,” but one that is ‘‘not widely or consistently adopted’’ in the behavioral and social sciences, the Third Circuit observed that in some cases, failure to adjust for multiple comparisons may “simply diminish the weight of an expert’s finding.”5 The observation is problematic given that Kumho Tire suggests that an expert witness must use “in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.” Kumho Tire Co. v. Carmichael, 526 U.S. 137, 150, (1999). One implication is that courts are prisoners to prevalent scientific malpractice and abuse of statistical methodology. Another implication is that courts need to look more closely at the assumptions and predicates for various statistical tests and adjustments, such as the Bonferroni correction.

These worrisome implications are exacerbated by the appellate court’s insistence that the question whether a study’s result was properly calculated or interpreted “goes to the weight of the evidence, not to its admissibility.”6 Combined with citations to pre-Daubert statistics cases7, judicial comments such as these can appear to be a general disregard for the statutory requirements of Rules 702 and 703. Claims of statistical significance, in studies with multiple exposure and multiple outcomes, are frequently not adjusted for multiple comparisons, without notation, explanation, or justification. The consequence is that study results are often over-interpreted and over-sold. Methodological errors related to multiple testing or over-claiming statistical significance are commonplace in tort litigation over “health-effects” studies of birth defects, cancer, and other chronic diseases that require epidemiologic evidence8.

In Karlo, the claimed methodological error is beset by its own methodological problems. As the court noted, adjustments for multiple comparisons are not free from methodological controversy9. One noteworthy textbook10 labels the Bonferroni correction as an “awful response” to the problem of multiple comparisons. Aside from this strident criticism, there are alternative approaches to statistical adjustment for multiple comparisons. In the context of the Karlo case, the Bonferroni might well be awful because Campion’s four subgroups are hardly independent tests. Because each subgroup is nested within the next higher age subgroup, the subgroup test results will be strongly correlated in a way that defeats the mathematical assumptions of the Bonferroni correction. On remand, the trial court in Karlo must still make his Rule 702 gatekeeping decision on the methodological appropriateness of whether Campion’s properly considered the role of multiple subgroups, and multiple anaslyses run on different models.


1 Although Campion describes his consulting business as small, he seems to turn up in quite a few employment discrimination cases. See, e.g., Chen-Oster v. Goldman, Sachs & Co., 10 Civ. 6950 (AT) (JCF) (S.D.N.Y. 2015); Brand v. Comcast Corp., Case No. 11 C 8471 (N.D. Ill. July 5, 2014); Powell v. Dallas Morning News L.P., 776 F. Supp. 2d 240, 247 (N.D. Tex. 2011) (excluding Campion’s opinions), aff’d, 486 F. App’x 469 (5th Cir. 2012).

2 See Defendant’s Motion to Bar Dr. Michael Campion’s Statistical Analysis, 2013 WL 11260556.

3 There was no mention of an effect size for the lower aged subgroups, and a power calculation for the 60+ subgroup’s probability of showing a z-score greater than two. Similarly, there was no discussion or argument about why this subgroup could not have been evaluated with Fisher’s exact test. In deciding the appeal, the Third Circuit observed that “Dr. Rosenberger test[ed] a subgroup of sixty-and-older employees, which Dr. Campion did not include in his analysis because ‘[t]here are only 14 terminations, which means the statistical power to detect a significant effect is very low’. A.244–45.” Karlo v. Pittsburgh Glass Works, LLC, 849 F.3d 61, 82 n.15 (3d Cir. 2017).

4 In the trial court’s words, the z-score converts the difference in termination rates into standard deviations. Karlo v. Pittsburgh Glass Works, LLC, C.A. No. 2:10-cv-01283, 2015 WL 4232600, at *11 n.13 (W.D. Pa. July 13, 2015). According to the trial court, Campion gave a rather dubious explanation of the meaning of the z-score: “[w]hen the number of standard deviations is less than –2 (actually–1.96), there is a 95% probability that the difference in termination rates of the subgroups is not due to chance alone” Id. (internal citation omitted).

5 See 849 F.3d 61, 83 (3d Cir. 2017) (citing and quoting from Paetzold & Willborn § 6:7, at 308 n.2) (describing the Bonferroni adjustment as ‘‘good statistical practice,’’ but ‘‘not widely or consistently adopted’’ in the behavioral and social sciences); see also E.E.O.C. v. Autozone, Inc., No. 00-2923, 2006 WL 2524093, at *4 (W.D. Tenn. Aug. 29, 2006) (‘‘[T]he Court does not have a sufficient basis to find that … the non-utilization [of the Bonferroni adjustment] makes [the expert’s] results unreliable.’’). And of course, the Third Circuit invoked the Daubert chestnut: ‘‘Vigorous cross-examination, presentation of contrary evidence, and careful instruction on the burden of proof are the traditional and appropriate means of attacking shaky but

admissible evidence.’’ Daubert, 509 U.S. 579, 596 (1993).

6 See 849 F.3d at 83 (citing Leonard v. Stemtech Internat’l Inc., 834 F.3d 376, 391 (3d Cir. 2016).

7 See 849 F.3d 61, 83 (3d Cir. 2017), citing Bazemore v. Friday, 478 U.S. 385, 400 (1986) (‘‘Normally, failure to include variables will affect the analysis’ probativeness, not its admissibility.’’).

8 See Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and Litigation 93 & n.3 (1997) (criticizing the “notorious” case of Wells v. Ortho Pharmaceutical Corp., 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S. 950 (1986), for its erroneous endorsement of conclusions based upon “statistically significant” studies that explored dozens of congenital malformation outcomes, without statistical adjustment). The authors do, however, give an encouraging example of a English trial judge who took multiplicity seriously. Reay v. British Nuclear Fuels (Q.B. Oct. 8,1993) (published in The Independent, Nov. 22,1993). In Reay, the trial court took seriously the multiplicity of hypotheses tested in the study relied upon by plaintiffs. Id. (“the fact that a number of hypotheses were considered in the study requires an increase in the P-value of the findings with consequent reduction in the confidence that can be placed in the study result … .”), quoted in Zeisel & Kaye at 93. Zeisel and Kaye emphasize that courts should not be overly impressed with claims of statistically significant findings, and should pay close attention to how expert witnesses developed their statistical models. Id. at 94.

9 See David B. Cohen, Michael G. Aamodt, and Eric M. Dunleavy, Technical Advisory Committee Report on Best Practices in Adverse Impact Analyses (Center for Corporate Equality 2010).

10 Kenneth J. Rothman, Sander Greenland, and Timoth L. Lash, Modern Epidemiology 273 (3d ed. 2008); see also Kenneth J. Rothman, “No Adjustments Are Needed for Multiple Comparisons,” 1 Epidemiology 43, 43 (1990)