Gatekeeping in many courtrooms has been reduced to requiring expert witnesses to swear an oath and testify that they have followed a scientific method. The federal rules of evidence and most state evidence codes require more. The law, in most jurisdictions, requires that judges actively engage with, and inspect, the bases for expert witnesses’ opinions and claims to determine whether expert witnesses who want to heard in a courtroom have actually, faithfully followed a scientific methodology. In other words, the law requires judges to assess the scientific reasonableness of reliance upon the actual data cited, and to evaluate whether the inferences drawn from the data, to reach a stated conclusion, are valid.
We are getting close to a quarter of a century since the United States Supreme Court outlined the requirements of gatekeeping, in Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993). Since the Daubert decision, the Supreme Court’s decisional law, and changes in the evidence rules themselves, have clarified the nature and extent of the inquiry judges must conduct into the reasonable reliance upon facts and data, and into the inferential steps leading to a conclusion. And yet, many judges resist, and offer up excuses and dodges for shirking their gatekeeping obligations. See generally David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89 Notre Dame L. Rev. 27 (2013).
There is a courtroom in New Jersey, in which gatekeeping is taken seriously from beginning to end. There is at least one trial judge who encourages and even demands that the expert witnesses appear and explain their methodologies and actually show their methodological compliance. Judge Johnson first distinguished himself in In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div. Atlantic Cty. Feb. 20, 2015).[1] And more recently, in two ovarian cancer cases, Judge Johnson dusted two expert witnesses, who thought they could claim their turn in the witness chair by virtue of their credentials and some rather glib hand waving. Judge Johnson conducted the New Jersey analogue of a Federal Rule of Evidence 104(a) Daubert hearing, as required by the New Jersey Supreme Court’s decision in Kemp v. The State of New Jersey, 174 N.J. 412 (2002). The result was disastrous for the two expert witnesses who opined that use of talcum powder by women causes ovarian cancer. Carl v. Johnson & Johnson, No. ATL-L-6546-14, 2016 WL 4580145 (N.J. Super. Ct. Law Div., Atl. Cty., Sept. 2, 2016) [cited as Carl].
Judge Johnson obviously had a good epidemiology teacher in Professor Stephen Goodman, who testified in the Accutane case. Against this standard, it is easy to see how the plaintiffs’ talc expert witnesses, Drs. Daniel Cramer and Dr. Graham Colditz, fell “significantly” short. After presiding over seven days of court hearings, and reviewing extensive party submissions, including the actual studies relied upon by the expert witnesses and the parties, Judge Johnson made no secret of his disappointment with the lack of rigor in the analyses proffered by Cramer and Colditz:
“Throughout these proceedings the court was disappointed in the scope of Plaintiffs’ presentation; it almost appeared as if counsel wished the court to wear blinders. Plaintiffs’ two principal witnesses on causation, Dr. Daniel Cramer and Dr. Graham Colditz, were generally dismissive of anything but epidemiological studies, and within that discipline of scientific investigation they confined their analyses to evidence derived only from small retrospective case-control studies. Both witnesses looked askance upon the three large cohort studies presented by Defendants. As confirmed by studies listed at Appendices A and B, the participants in the three large cohort studies totaled 191,090 while those case-control studies advanced by Plaintiffs’ witnesses, and which were the ones utilized in the two meta-analyses performed by Langseth and Terry, total 18,384 participants. As these proceedings drew to a close, two words reverberated in the court’s thinking:
“narrow and shallow.” It was almost as if counsel and the expert witnesses were saying, Look at this, and forget everything else science has to teach as.”
Carl at *12.
Judge Johnson did what for so many judges is unthinkable; he looked behind the curtain put up by highly credentialed Oz expert witnesses in his courtroom. What he found was unexplained, unjustified selectivity in their reliance upon some but not all the available data, and glib conclusions that gloss over significant limits in the resolving power of the available epidemiologic studies. Judge Johnson was particularly unsparing of Graham Colditz, a capable scientist, who deviated from the standards he set for himself in the work he had published in the scientific community:
“Dr. Graham Colditz is a brilliant scientist and a dazzling witness. His vocal inflection, cadence, and adroit use of histrionics are extremely effective. Dr. Colditz’s reputation for his breadth of knowledge about cancer and the esteem in which he is held by his peers is well deserved. Yet, at times, it seemed that issues raised in these proceedings, and the questions posed to him, were a bit mundane for a scientist of his caliber.”
Carl at *15. Dr. Colditz and the plaintiffs’ cause were not helped by Dr. Colditz’s own previous publications of studies and reviews that failed to support any “substantial association between perineal talc use and ovarian cancer risk overall,” and failed to conclude that talc was even a “risk factor” for ovarian cancer. Carl at *18.
Relative Risk Size
Many courts have fumbled their handling of the issue whether applicable relative risks must exceed two before fact finders may infer specific causation between claimed exposures and specific diseases. There certainly can be causal associations that involve relative risks between 1.0, up to and including 2.0. Eliminating validity concerns may be more difficult with such smaller relative risks, but there is nothing theoretically insuperable about having a causal association based upon such small relative risks. Judge Johnson apparently saw the diversity of opinions on this relative risk issue, many of which opinions are stridently maintained, and thoroughly fallacious.
Judge Johnson ultimately did not base his decision, with respect to general or specific causation, on the magnitude of relative risk, or the covering Bradford Hill factor of “strength of association.” Dr. Cramer appropriately acknowledged that his meta-analysis result, of an odds ratio of 1.29 was “weak,” Carl at *19, and Judge Johnson was critical of Dr. Colditz for failing to address the lack of strength of the association, and for engaging in a constant refrain that the association was “significant,” which is a precision not a size estimate for the measurement. Carl at *17.
Aware of the difficulty that New Jersey appellate courts have had with the issues surrounding relative risks greater than two, Judge Johnson was realistic to steer clear of any specific judicial reliance on the small size of the relative risk. His Honor’s prudence is unfortunate however because ultimately small relative risks, even assuming that general causation is established, do nothing to support specific causation. Indeed, relative risks of 1.29 (and odds ratios generally overstate the size of the underlying relative risk) would on a stochastic model support the conclusion that specific causation was less than 50% probable. Critics have pointed out that risk may not be stochastically distributed, which is a great point, except that
(1) plaintiffs often have no idea how the risk, if real, is distributed in the observed sample, and
(2) the upshot of the point is that even for relative risks greater than 2.0, there is no warrant for inferring specific causation in a given case.
Judge Johnson did wade into the relative risk waters by noting that when relative risks were “significantly” less than two, establishing biological plausibility became essential. Carl at *11. This pronouncement is muddled on at least two fronts. First, the relative risk scale is a continuum, and there is no standard reference for what relative risks greater than 1.0 are “significantly” less than 2.0. Presumably, Judge Johnson thought that 1.29 was in the “significantly less than 2.0” range, but he did not say so; nor did he cite a source that supported this assessment. Perhaps he was suggesting that the upper bound of some meta-analysis was less than two. Second, and more troubling, the claim that biological plausibility becomes “essential” in the face of small relative risks is also unsupported. Judge Johnson does not cite any support for this claim, and I am not aware of any. Elsewhere in his opinion, Judge Johnson noted that
“When a scientific rationale doesn’t exist to explain logically the biological mechanism by which an agent causes a disease, courts may consider epidemiologic studies as an alternate [sic] means of proving general causation.”
Carl at *8. So it seems that biological plausibility is not essential after all.
This glitch in the Carl opinion is likely of no lasting consequence, however, because epidemiologists are rarely at a loss to posit some biologically plausible mechanism. As the Dictionary of Epidemiology explains the matter:
“The causal consideration that an observed, potentially causal association between an exposure and a health outcome may plausibly be attributed to causation on the basis of existing biomedical and epidemiological knowledge. On a schematic continuum including possible, plausible, compatible, and coherent, the term plausible is not a demanding or stringent requirement, given the many biological mechanisms that often can be hypothesized to underlie clinical and epidemiological observations; hence, in assessing causality, it may be logically more appropriate to require coherence (biological as well as clinical and epidemiological). Plausibility should hence be used cautiously, since it could impede development or acceptance of new knowledge that does not fit existing biological evidence, pathophysiological reasoning, or other evidence.”
Miquel Porta, et al., eds., “Biological plausibility,” in A Dictionary of Epidemiology at 24 (6th ed. 2014). Most capable epidemiologists have thought up half a dozen biologically plausible mechanisms each morning before they have had their first cup of coffee. But the most compelling reason that this judicial hiccup is inconsequential is that the plaintiffs’ expert witnesses’ postulated mechanism, inflammation, was demonstrably absent in the tissue of the specific plaintiffs. Carl at *13. The glib invocation of “inflammation” would seem bound to fail even as the most liberal test of plausibility when talc has anti-cancer properties that result from its ability to inhibit new blood vessel formation, a necessity of solid tumor growth, and the completely unexplained selectivity for ovarian tissue to the postulated effect, which leaves vaginal, endometrial, or fallopian tissues unaffected. Carl at *13-14. On at least two occasions, the United States Food and Drug Administration rejected “Citizen Petitions” for ovarian cancer warnings on talc products, advanced by the dubious Samuel S. Epstein for the Cancer Prevention Coalition, in large measure because of Epstein’s undue selectivity in citing epidemiologic studies and because a “cogent biological mechanism by which talc might lead to ovarian cancer is lacking… .” Carl at *15, citing Stephen M. Musser, Directory FDA Director, Letter Denying Citizens’ Petition (April 1, 2014).
Large Studies
Judge Johnson quoted the Reference Manual on Scientific Evidence (3d ed. 2011) for his suggestion that establishing causation requires large studies. The quoted language, however, really does not bear on his suggestion:
“Common sense leads one to believe that a large enough sample of individuals must be studied if the study is to identify a relationship between exposure to an agent and disease that truly exists. Common sense also suggests that by enlarging the sample size (the size of the study group), researchers can form a more accurate conclusion and reduce the chance of random error in their results…With large numbers, the outcome of test is less likely to be influenced by random error, and the researcher would have greater confidence in the inferences drawn from the data.”
Reference Manual at page 576. What the Reference Manual simply calls for studies with “large enough” samples. How large is large enough is a variable that depends upon the magnitude of the association to be detected, the length of follow up, and the base rate or incidence of the outcome of interest. As far as “common sense,” goes, the Reference Manual is correct only insofar as larger is better with respect to sampling error. Increasing sample size does nothing to address internal or external validity of studies, and may lead to erroneous interpretations by allowing results to achieve statistical significance at predetermined levels, when the observed associations result from bias or confounding, and not from any underlying relationship between exposure and disease outcome.
There is a more disturbing implication in Judge Johnson’s criticism of Graham Colditz for relying upon the smaller number of subjects in the case-control studies than are found in the available cohort studies. Ovarian cancer is a relatively rare cancer (compared with breast and colon cancer), and case-control studies are more efficient at assessing increased risk than are cohort studies for a rare outcome. The number of cases in a case-control study represents an implied population many times larger than the number of actual cases in a case-control study. If Judge Johnson had looked at the width of the confidence intervals for the “small” case-control studies, and compared those widths to the interval widths of the cohort studies, he would have seen that “smaller” case-control studies (fewer cases, as well as fewer total subjects) can generate more statistical precision than the larger cohort studies (with many more cohort and control subjects). A more useful comparison would have been to the number of actual ovarian cancer cases in the meta-analyzed case-control studies with the number of actual ovarian cancer cases in the cohort studies. On this comparison, the cohort studies might not fare so well.
The size of the cohort for a rare outcome is thus fairly meaningless in terms of the statistical precision generated. Smaller case-control studies will likely have much more power, and that should be reflected in the confidence intervals of the respective studies.
The issue, as I understand the talc litigation, is not size of the case-control versus cohort studies, but rather their analytical resolving power. Case-control studies for this sort of exposure and outcome will be plagued by recall and other biases, as well as difficulty in selecting the right control group. And the odds ratio will tend to overestimate the relative risk, in both directions. Cohort studies, with good, pre-morbid exposure assessments, would thus be much more rigorous and accurate in estimating the true rate ratios. In the final analysis, Judge Johnson was correct to be critical of Graham Colditz for dismissing the cohort studies, but his rationale for this criticism was, in a few places, confused and confusing. There was nothing subtle about the analytical gaps, ipse dixits, and cherry picking shown by these plaintiffs’ expert witnesses.
[1] See “Johnson of Accutane – Keeping the Gate in the Garden State” (Mar. 28, 2015).