TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

The “Rothman” Amicus Brief in Daubert v. Merrill Dow Pharmaceuticals

November 17th, 2018

Then time will tell just who fell
And who’s been left behind”

                  Dylan, “Most Likely You Go Your Way” (1966)

 

When the Daubert case headed to the Supreme Court, it had 22 amicus briefs in tow. Today that number is routine for an appeal to the high court, but in 1992, it was a signal of intense interest in the case among both the scientific and legal community. To the litigation industry, the prospect of judicial gatekeeping of expert witness testimony was an anathema. To the manufacturing industry, the prospect was precious to defend against specious claiming.

With the benefit of 25 years of hindsight, a look at some of those amicus briefs reveals a good deal about the scientific and legal acumen of the “friends of the court.” Not all amicus briefs in the case were equal; not all have held up well in the face of time. The amicus brief of the American Association for the Advancement of Science and the National Academy of Science was a good example of advocacy for the full implementation of gatekeeping on scientific principles of valid inference.1 Other amici urged an anything goes approach to judicial oversight of expert witnesses.

One amicus brief often praised by Plaintiffs’ counsel was submitted by Professor Kenneth Rothman and colleagues.2 This amicus brief is still cited by parties who find support in the brief for their excuses for not having consistent, valid, strong, and statistically significance evidence to support their claims of causation. To be sure, Rothman did target statistical significance as a strict criterion of causal inference, but there is little support in the brief for the loosey-goosey style of causal claiming that is so prevalent among lawyers for the litigation industry. Unlike the brief filed by the AAAS and the National Academy of Science, Rothman’s brief abstained from the social policies implied by judicial gatekeeping or its rejection. Instead, Rothman’s brief wet out to make three narrow points:

(1) courts should not rely upon strict statistical significance testing for admissibility determinations;

(2) peer review is not an appropriate touchstone for the validity of an expert witness’s opinion; and

(3) unpublished, non-peer-reviewed “reanalysis” of studies is a routine part of the scientific process, and regularly practiced by epidemiologists and other scientists.

Rothman was encouraged to target these three issues by the lower courts’ opinions in the Daubert case, in which the courts made blanket statements about the role of absent statistical significance and peer review, and the illegitimacy of “re-analyses” of published studies.

Professor Rothman has made many admirable contributions to epidemiologic practice, but the amicus brief submitted by him and his colleagues falls into the trap of making the sort of blanket general statements that they condemned in the lower courts’ opinions. Of the brief’s three points, the first, about statistical significance is the most important for epidemiologic and legal practice. Despite reports of an odd journal here or there “abolishing” p-values, most medical journals continue to require the presentation of either p-values or confidence intervals. In the majority of medical journals, 95% confidence intervals that exclude a null hypothesis risk ratio of 1.0, or risk difference of 0, are labelled “statistically significant,” sometimes improvidently in the presence of multiple comparisons and lack of pre-specification of outcome.

For over three decades, Rothman has criticized the prevailing practice on statistical significance. Professor Rothman is also well known for his advocacy for the superiority of confidence intervals over p-values in conveying important information about what range of values are reasonably compatible with the observed data.3 His criticisms of p-values and his advocacy for estimation with intervals have pushed biomedical publishing to embrace confidence intervals as more informative than just p-values. Still, his views on statistical significance have never gained complete acceptance at most clinical journals. Biomedical scientists continue to interpret 95% confidence intervals, at least in part, as to whether they show “significance” by excluding the null hypothesis value of no risk difference or of risk ratios equal to 1.0.

The first point in Rothman’s amicus brief is styled:

THE LOWER COURTS’ FOCUS ON SIGNIFICANCE TESTING IS BASED ON THE INACCURATE ASSUMPTION THAT ‘STATISTICAL SIGNIFICANCE’ IS REQUIRED IN ORDER TO DRAW INFERENCES FROM EPIDEMIOLOGICAL INFORMATION”

The challenge by Rothman and colleagues to the “assumption” that statistical significance is necessary is what, of course, has endeared this brief to the litigation industry. A close read of the brief, however, shows that Rothman’s critique of the assumption is equivocal. Rothman et amici characterized the lower courts as having given:

blind deference to inappropriate and arcane publication standards and ‘significance testing’.”4

The brief is silent about what might be knowing deference, or appropriate publication standards. To be sure, judges have often poorly expressed their reasoning for deciding scientific evidentiary issues, and perhaps poor communication or laziness by judges was responsible for Rothman’s interest in joining the Daubert fray. Putting aside the unclear, rhetorical, and somewhat hyperbolic use of “arcane” in the quote above, the suggestion of inappropriate blind deference is itself expressed in equivocal terms in the brief. At times the authors rail at the use of statistical significance as the “sole” criterion, and at times, they seem to criticize its use at all.

At least twice in their brief, Rothman and friends declare that the lower court:

misconstrues the validity and appropriateness of significance testing as a decision making tool, apparently deeming it the sole test of epidemiological hypotheses.”5

* * * * * *

this Court should reject significance testing as the sole acceptable criterion of scientific validity in epidemiology.”6

Characterizing “statistical significance” as not the sole test or criterion of scientific inference is hardly controversial, and it implies that statistical significance is one test, criterion, or factor among others. This position is consistent with the current ASA Statement on Significance Testing.7 There is, of course, much more to evaluate in a study or a body of studies, than simply whether they individually or collectively help us to exclude chance as an explanation for their findings.

Statistical Significance Is Not Necessary At All

Elsewhere, Rothman and friends take their challenge to statistical significance testing beyond merely suggesting that such testing is only one test or criterion among others. Indeed, their brief in other places states their opinion that significance testing is not necessary at all:

Testing for significance, however, is often mistaken for a sine qua non of scientific inference.”8

And at other times, Rothman and friends go further yet and claim not only that significance is not necessary, but that it is not even appropriate or useful:

Significance testing, however, is neither necessary nor appropriate as a requirement for drawing inferences from epidemiologic data.”9

Rothman compares statistical significance testing with “scientific inference,” which is not a mechanical, mathematical procedure, but rather a “thoughtful evaluation[] of possible explanations for what is being observed.”10 Significance testing, in contrast,” is “merely a statistical tool,” used inappropriately “in the process of developing inferences.”11 Rothman suggests that the term “statistical significance” could be eliminated from scientific discussions without loss of meaning, and this linguistic legerdemain shows that the phrase is unimportant in science and in law.12 Rothman’s suggestion, however, ignores that causal assessments have always required an evaluation of the play of chance, especially for putative causes, which are neither necessary nor sufficient, and which modify underlying stochastic processes by increasing or decreasing the probability of a specified outcome. Asserting that statistical significance is misleading because it never describes the size of an association, which the Rothman brief does, is like telling us that color terms tell us nothing about the mass of a body.

The Rothman brief does make the salutary point that labeling a study outcome as not “statistically significant” carries the danger that the study’s data have no value, or that the study may be taken to reject the hypothesized association. In 1992, such an interpretation may have been more common, but today, in the face of the proliferation of meta-analyses, the risk of such interpretations of single study outcomes is remote.

Questionable History of Statistics

Rothman suggests that the development of statistical hypothesis testing occurred in the context of agricultural and quality-control experiments, which required yes-no answers for future action.13 This suggestion clearly points at Sir Ronald Fisher and Jerzy Neyman, and their foundational work on frequentist statistical theory and practice. In part, the amici correctly identified the experimental milieu in which Fisher worked, but the description of Fisher’s work is neither accurate nor fair. Fisher spent a lifetime thinking and writing about statistical tests, in much more nuanced ways than implied by the claim that such testing occurred in context of agricultural and quality-control experiments. Although Fisher worked on agricultural experiments, his writings acknowledged that when statistical tests and analyses were applied to observational studies, much more searching analyses of bias and confounding were required. Fisher’s and Berkson’s reactions to the observational studies of Hill and Doll on smoking and lung cancer are telling in this regard. These statisticians criticized the early smoking lung cancer studies, not for lack of statistical significance, but for failing to address confounding by a potential common genetic propensity to smoke and to develop lung cancer.

Questionable History of Drug Development

Twice in Rothman’s amicus brief, the authors suggest that “undue reliance” on statistical significance has resulted in overlooking “effective new treatments” because observed benefits were considered “not significant,” despite an “indication” of efficacy.14 The brief never provided any insight on what is due reliance and what is undue reliance on statistical significance. Their criticism of “undue reliance” implies that there are modes or instances of “due reliance” upon statistical significance. The amicus brief fails also to inform readers exactly what “effective new treatments” have been overlooked because the outcomes were considered “not significant.” This omission is regrettable because it leaves the reader with only abstract recommendations, without concrete examples of what such effective treatments might be. The omission was unfortunate because Rothman almost certainly could have marshalled examples. Recently, Rothman tweeted just such an example:15

“30% ↓ in cancer risk from Vit D/Ca supplements ignored by authors & editorial. Why? P = 0.06. http://bit.ly/2oanl6w http://bit.ly/2p0CRj7. The 95% confidence interval for the risk ratio was 0.42–1.02.”

Of course, this was a large, carefully reported randomized clinical trial, with a narrow confidence interval that just missed “statistical significance.” It is not an example that would have given succor to Bendectin plaintiffs, who were attempting to prove an association by identifying flaws in noisy observational studies that generally failed to show an association.

Readers of the 1992 amicus brief can only guess at what might be “indications of efficacy”; no explanation or examples are provided.16 The reality of FDA approvals of new drugs is that pre-specified 5% level of statistical significance is virtually always enforced.17 If a drug sponsor has “indication of efficacy,” it is, of course, free to follow up with an additional, larger, better-designed clinical trial. Rothman’s recent tweet about the vitamin D clinical trial does provide some context and meaning to what the amici may have meant over 25 years ago by indication of efficacy. The tweet also illustrates Rothman’s acknowledgment of the need to address random variability in a data set, whether by p-value or confidence interval, or both. Clearly, Rothman was criticizing the authors of the vitamin D trial for stopping short of claiming that they had shown (or “demonstrated”) a cancer survival benefit. There is, however, a rich literature on vitamin D and cancer outcomes, and such a claim could be made, perhaps, in the context of a meta-analysis or meta-regression of multiple clinical trials, with a synthesis of other experimental and observational data.18

Questionable History of Statistical Analyses in Epidemiology

Rothman’s amicus brief deserves credit for introducing a misinterpretation of Sir Austin Bradford Hill’s famous paper on inferring causal associations, which has become catechism in the briefs of plaintiffs in pharmaceutical and other products liability cases:

No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 290 (1965) (quoted at Rothman Brief at *6).

As exegesis of Hill’s views, this quote is misleading. The language quoted above was used by Hill in the context of his nine causal viewpoints or criteria. The Rothman brief ignores Hill’s admonition to his readers, that before reaching the nine criteria, there is a serious, demanding predicate that must be shown:

Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Id. at 295 (emphasis added). Rothman and co-authors did not have to invoke the prestige and authority of Sir Austin, but once they did, they were obligated to quote him fully and with accurate context. Elsewhere, in his famous textbook, Hill expressed his view that common sense was insufficient to interpret data, and that the statistical method was necessary to interpret data in medical studies.19

Rothman complains that statistical significance focuses the reader on conjecture on the role of chance in the observed data rather than the information conveyed by the data themselves.20 The “incompleteness” of statistical analysis for arriving at causal conclusions, however, is not an argument against its necessity.

The Rothman brief does make the helpful point that statistical significance cannot be sufficient to support a conclusion of causation because many statistically significant associations or correlations will be non-causal. They give a trivial example of wearing dresses and breast cancer, but the point is well-taken. Associations, even when statistically significant, are not necessarily causal conclusions. Who ever suggested otherwise, other than expert witnesses for the litigation industry?

Unnecessary Fears

The motivation for Rothman’s challenge to the assumption that statistical significance is necessary is revealed at the end of the argument on Point I. The authors plainly express their concern that false negatives will shut down important research:

To give weight to the failure of epidemiological studies to meet strict ‘statistical significant’ standards — to use such studies to close the door on further inquiry — is not good science.”21

The relevance of this concern to the proceedings is a mystery. The judicial decisions in the case are not referenda on funding initiatives. Scientists were as free in 1993, after Daubert was decided, as they were in 1992, when Rothman wrote, to pursue the hypothesis that Bendectin caused birth defects. The decision had the potential to shut down tort claims, and left scientists to their tasks.

Reanalyses Are Appropriate Scientific Tools to Assess and Evaluate Data, and to Forge Causal Opinions

The Rothman brief took issue with the lower courts’ dismissal of plaintiffs’ expert witnesses’ re-analyses of data in published studies. The authors argued that reanalyses were part of the scientific method, and not “an arcane or specialized enterprise,” deserving of heightened or skeptical scrutiny.22

Remarkably, the Rothman brief, if accepted by the Supreme Court on the re-analysis point, would have led to the sort of unthinking blanket acceptance of a methodology, which the brief’s authors condemned in the context of blanket acceptance of significance testing. The brief covertly urges “blind deference” to its authors on the blanket approval of re-analyses.

Although amici have tight page limits, the brief’s authors made clear that they were offering no substantive opinions on the data involved in the published epidemiologic studies on Bendectin, or on the plaintiffs’ expert witnesses’ re-analyses. With the benefit of hindsight, we can see that the sweeping language used by the Ninth Circuit on re-analyses might have been taken to foreclose important and valid meta-analyses or similar approaches. The Rothman brief is not terribly explicit on what re-analysis techniques were part of the scientific method, but meta-analyses surely had been on the authors’ minds:

by focusing on inappropriate criteria applied to determine what conclusions, if any, can be reached from any one study, the trial court forecloses testimony about inferences that can be drawn from the combination of results reported by many such studies, even when those studies, standing alone, might not justify such inferences.”23

The plaintiffs’ statistical expert witness in Daubert had proffered a re-analysis of at least one study by substituting a different control sample, as well as a questionable meta-analyses. By failing to engage on the propriety of the specific analyses at issue in Daubert, the Rothman brief failed to offer meaningful guidance to the appellate court.

Reanalyses Are Not Invalid Just Because They Have Not Been Published

Rothman was certainly correct that the value of peer review was overstated by the defense in Bendectin litigation.24 The quality of pre-publication peer review is spotty, at best. Predatory journals deploy a pay-to-play scheme, which makes a mockery of scientific publishing. Even at respectable journals, peer review cannot effectively guard against fraud, or ensure that statistical analyses have been appropriately done.25 At best, peer review is a weak proxy for study validity, and an unreliable one at that.

The Rothman brief may have moderated the Supreme Court’s reaction to the defense’s argument that peer review is a requirement for studies, or “re-analyses,” relied upon by expert witnesses. The Court in Daubert opined, in dicta, that peer review is a non-dispositive consideration:

The fact of publication (or lack thereof) in a peer reviewed journal … will be a relevant, though not dispositive, consideration in assessing the scientific validity of a particular technique or methodology on which an opinion is premised.”26

To the extent that Rothman and colleagues might have been disappointed in this outcome, they missed some important context of the Bendectin cases. Most of the cases had been resolved by a consolidated causation issues trial, but many opt-out cases had to be tried in state and federal courts around the country.27 The expert witnesses challenged in Daubert (Drs. Swan and Done) participated in many of these opt-out cases, and in each case, they opined that Bendectin was a public health hazard. The failure of these witnesses to publish their analyses and re-analyses spoke volumes about their bona fides. Courts (and juries if the Swan and Done proffered testimony were admissible) could certainly draw negative inferences from the plaintiffs’ expert witnesses’ failure to publish their opinions and re-analyses.

The Fate of the “Rothman Approach” in the Courts

The so-called “Rothman approach” was urged by Bendectin plaintiffs in opposing summary judgment in a case pending in federal court, in New Jersey, before the Supreme Court decided Daubert. Plaintiffs resisted exclusion of their expert witnesses, who had relied upon inconsistent and statistically non-significant studies on the supposed teratogenicity of Bendectin. The trial court excluded the plaintiffs’ witnesses, and granted summary judgment.28

On appeal, the Third Circuit reversed and remanded the DeLucas’s case for a hearing under Rule 702:

by directing such an overall evaluation, however, we do not mean to reject at this point Merrell Dow’s contention that a showing of a .05 level of statistical significance should be a threshold requirement for any statistical analysis concluding that Bendectin is a teratogen regardless of the presence of other indicia of reliability. That contention will need to be addressed on remand. The root issue it poses is what risk of what type of error the judicial system is willing to tolerate. This is not an easy issue to resolve and one possible resolution is a conclusion that the system should not tolerate any expert opinion rooted in statistical analysis where the results of the underlying studies are not significant at a .05 level.”29

After remand, the district court excluded the DeLuca plaintiffs’ expert witnesses, and granted summary judgment, based upon the dubious methods employed by plaintiffs’ expert witnesses in cherry picking data, recalculating risk ratios in published studies, and ignoring bias and confounding in studies. The Third Circuit affirmed the judgment for Merrell Dow.30

In the end, the decisions in the DeLuca case never endorsed the Rothman approach, although Professor Rothman can take credit perhaps for forcing the trial court, on remand, to come to grips with the informational content of the study data, and the many threats to validity, which severely undermined the relied-upon studies and the plaintiffs’ expert witnesses’ opinions.

More recently, in litigation over alleged causation of birth defects in offspring of mothers who used Zoloft during pregnancy, plaintiffs’ counsel attempted to resurrect, through their expert witnesses, the Rothman approach. The multidistrict court saw through counsel’s assertions that the Rothman approach had been adopted in DeLuca, or that it had become generally accepted.31 After protracted litigation in the Zoloft cases, the district court excluded plaintiffs’ expert witnesses and entered summary judgment for the defense. The Third Circuit found that the district court’s handling of the statistical significance issues was fully consistent with the Circuit’s previous pronouncements on the issue of statistical significance.32


1 filed in Daubert v. Merrell Dow Pharmaceuticals, Inc., U.S. Supreme Court No. 92-102 (Jan. 19, 1993), was submitted by Richard A. Meserve and Lars Noah, of Covington & Burling, and by Bert Black, 12 Biotechnology Law Report 198 (No. 2, March-April 1993); see Daubert’s Silver Anniversary – Retrospective View of Its Friends and Enemies” (Oct. 21, 2018).

2 Brief Amici Curiae of Professors Kenneth Rothman, Noel Weiss, James Robins, Raymond Neutra and Steven Stellman, in Support of Petitioners, 1992 WL 12006438, Daubert v. Merrell Dow Pharmaceuticals, Inc., U.S. S. Ct. No. 92-102 (Dec. 2, 1992). [Rothman Brief].

3 Id. at *7.

4 Rothman Brief at *2.

5 Id. at *2-*3 (emphasis added).

6 Id. at *7 (emphasis added).

7 See Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The American Statistician 129 (2016)

8 Id. at *3.

9 Id. at *2.

10 Id. at *3 – *4.

11 Id. at *3.

12 Id. at *3.

13 Id. at *4 -*5.

14 Id. at*5, *6.

15 at <https://twitter.com/ken_rothman/status/855784253984051201> (April 21, 2017). The tweet pointed to: Joan Lappe, Patrice Watson, Dianne Travers-Gustafson, Robert Recker, Cedric Garland, Edward Gorham, Keith Baggerly, and Sharon L. McDonnell, “Effect of Vitamin D and Calcium Supplementation on Cancer Incidence in Older WomenA Randomized Clinical Trial,” 317 J. Am. Med. Ass’n 1234 (2017).

16 In the case of United States v. Harkonen, Professors Ken Rothman and Tim Lash, and I made common cause in support of Dr. Harkonen’s petition to the United States Supreme Court. The circumstances of Dr. Harkonen’s indictment and conviction provide a concrete example of what Dr. Rothman probably was referring to as “indication of efficacy.” I supported Dr. Harkonen’s appeal because I agreed that there had been a suggestion of efficacy, even if Harkonen had overstated what his clinical trial, standing alone, had shown. (There had been a previous clinical trial, which demonstrated a robust survival benefit.) From my perspective, the facts of the case supported Dr. Harkonen’s exercise of speech in a press release, but it would hardly have justified FDA approval for the indication that Dr. Harkonen was discussing. If Harkonen had indeed committed “wire fraud,” as claimed by the federal prosecutors, then I had (and still have) a rather long list of expert witnesses who stand in need of criminal penalties and rehabilitation for their overreaching opinions in court cases.

17 Robert Temple, “How FDA Currently Makes Decisions on Clinical Studies,” 2 Clinical Trials 276, 281 (2005); Lee Kennedy-Shaffer, “When the Alpha is the Omega: P-Values, ‘Substantial Evidence’, and the 0.05 Standard at FDA,” 72 Food & Drug L.J. 595 (2017); see alsoThe 5% Solution at the FDA” (Feb. 24, 2018).

18 See, e.g., Stefan Pilz, Katharina Kienreich, Andreas Tomaschitz, Eberhard Ritz, Elisabeth Lerchbaum, Barbara Obermayer-Pietsch, Veronika Matzi, Joerg Lindenmann, Winfried Marz, Sara Gandini, and Jacqueline M. Dekker, “Vitamin D and cancer mortality: systematic review of prospective epidemiological studies,” 13 Anti-Cancer Agents in Medicinal Chem. 107 (2013).

19 Austin Bradford Hill, Principles of Medical Statistics at 2, 10 (4th ed. 1948) (“The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”) (emphasis added).

20 Id. at *6 -*7.

21 Id. at *9.

22 Id.

23 Id. at *10.

24 Rothman Brief at *12.

25 See William Childs, “Peering Behind The Peer Review Curtain,” Law360 (Aug. 17, 2018).

26 Daubert v. Merrell Dow Pharms., 509 U.S. 579, 594 (1993).

27 SeeDiclegis and Vacuous Philosophy of Science” (June 24, 2015).

28 DeLuca v. Merrell Dow Pharms., Inc., 131 F.R.D. 71 (D.N.J. 1990).

29 DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 955 (3d Cir. 1990).

30 DeLuca v. Merrell Dow Pharma., Inc., 791 F. Supp. 1042 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993).

31 In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2015 WL 314149 (E.D. Pa. Jan. 23, 2015) (Rufe, J.) (denying PSC’s motion for reconsideration), aff’d, 858 F.3d 787 (3d Cir. 2017) (affirming exclusion of plaintiffs’ expert witnesses’ dubious opinions, which involved multiple methodological flaws and failures to follow any methodology faithfully). See generallyZoloft MDL Relieves Matrixx Depression” (Jan. 30, 2015); “WOE — Zoloft Escapes a MDL While Third Circuit Creates a Conceptual Muddle” (July 31, 2015).

32 See Pritchard v. Dow Agro Sciences, 430 F. App’x 102, 104 (3d Cir. 2011) (excluding Concussion hero, Dr. Bennet Omalu).

The American Statistical Association Statement on Significance Testing Goes to Court – Part I

November 13th, 2018

It has been two and one-half years since the American Statistical Association (ASA) issued its statement on statistical significance. Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 The American Statistician 129 (2016) [ASA Statement]. When the ASA Statement was published, I commended it as a needed counterweight to the exaggerated criticisms of significance testing.1 Lawyers and expert witnesses for the litigation industry had routinely poo-poohed the absence of statistical significance, but over-endorsed its presence in poorly designed and biased studies. Courts and lawyers from all sides routinely misunderstand, misstated, and misrepresented the meaning of statistical significance.2

The ASA Statement had potential to help resolve judicial confusion. It is written in non-technical language, which is easily understood by non-statisticians. Still, the Statement has to be read with care. The principle of charity led me to believe that lawyers and judges would read the Statement carefully, and that it would improve judicial gatekeeping of expert witnesses’ opinion testimony that involved statistical evidence. I am less sanguine now about the prospect of progress.

No sooner had the ASA issued its Statement than the spinning started. One scientist, and an editor PLoS Biology, blogged that “the ASA notes, the importance of the p-value has been greatly overstated and the scientific community has become over-reliant on this one – flawed – measure.”3 Lawyers for the litigation industry were even less restrained in promoting wild misrepresentations about the Statement, with claims that the ASA had condemned the use of p-values, significance testing, and significance probabilities, as “flawed.”4 And yet, no where in the ASA’s statement does the group suggest that the the p-value was a “flawed” measure.

Criminal Use of the ASA Statement

Where are we now, two plus years out from the ASA Statement? Not surprisingly, the Statement has made its way into the legal arena. The Statement has been used in any number of depositions, relied upon in briefs, and cited in at least a couple of judicial decisions, in the last two years. The empirical evidence of how the ASA Statement has been used, or might be used in the future, is still sparse. Just last month, the ASA Statement was cited by the Washington State Supreme Court, in a ruling that held the death penalty unconstitutional. State of Washington v. Gregory, No. 88086-7, (Wash. S.Ct., Oct. 11, 2018) (en banc). Mr. Gregory, who was facing the death penalty, after being duly convicted or rape, robbery, and murder. The prosecution was supported by DNA matches, fingerprint identification, and other evidence. Mr. Gregory challenged the constitutionality of his imposed punishment, not on per se grounds of unconstitutionality, but on race disparities in the imposition of the death penalty. On this claim, the Washington Supreme Court commented on the empirical evidence marshalled on Mr. Gregory’s behalf:

The most important consideration is whether the evidence shows that race has a meaningful impact on imposition of the death penalty. We make this determination by way of legal analysis, not pure science. At the very most, there is an 11 percent chance that the observed association between race and the death penalty in Beckett’s regression analysis is attributed to random chance rather than true association. Commissioner’s Report at 56-68 (the p-values range from 0.048-0.111, which measures the probability that the observed association is the result of random chance rather than a true association).[8] Just as we declined to require ‘precise uniformity’ under our proportionality review, we decline to require indisputably true social science to prove that our death penalty is impermissibly imposed based on race.

Id. (internal citations omitted).

Whatever you think of the death penalty, or how it is imposed in the United States, you will have to agree that the Court’s discussion of statistics is itself criminal. In the above quotation from the Court’s opinion, the Court badly misinterpreted the p-values generated in various regression analyses that were offered to support claims of race disparity. The Court’s equating statistically significant evidence of race disparity in these regression analyses with “indisputably true social science” also reflects a rhetorical strategy that imputes ridiculously high certainty (indisputably true) to social science conclusions in order to dismiss the need for them in order to accept a causal race disparity claim on empirical evidence.5

Gregory’s counsel had briefed the Washington Court on statistical significance, and raised the ASA Statement as excuse and justification for not presenting statistically significant empirical evidence of race disparity.6 Footnote 8, in the above quote from the Gregory decision shows that the Court was aware of the ASA Statement, which makes the Court’s errors even more unpardonable: 

[8] The most common p-value used for statistical significance is 0.05, but this is not a bright line rule. The American Statistical Association (ASA) explains that the ‘mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making’.”7

Conveniently, Gregory’s counsel did not cite to other parts of the ASA Statement, which would have called for a more searching review of the statistical regression analyses:

“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”8

The Supreme Court of Washington first erred in its assessment of what scientific evidence requires in terms of a burden of proof. It then accepted spurious arguments to excuse the absence of statistical significance in the statistical evidence before it, on the basis of a distorted representation of the ASA Statement. Finally, the Court erred in claiming support from social science evidence, by ignoring other methodological issues in Gregory’s empirical claims. Ironically, the Court had made significance testing the end all and be all of its analysis, and when it dispatched statistical significance as a consideration, the Court jumped to the conclusion it wanted to reach. Clearly, the intended message of the ASA Statement had been subverted by counsel and the Court.

2 See, e.g., In re Ephedra Prods. Liab. Litig., 393 F.Supp. 2d 181, 191 (S.D.N.Y. 2005). See alsoConfidence in Intervals and Diffidence in the Courts” (March 4, 2012); “Scientific illiteracy among the judiciary” (Feb. 29, 2012).

5 Moultrie v. Martin, 690 F.2d 1078, 1082 (4th Cir. 1982) (internal citations omitted) (“When a litigant seeks to prove his point exclusively through the use of statistics, he is borrowing the principles of another discipline, mathematics, and applying these principles to the law. In borrowing from another discipline, a litigant cannot be selective in which principles are applied. He must employ a standard mathematical analysis. Any other requirement defies logic to the point of being unjust. Statisticians do not simply look at two statistics, such as the actual and expected percentage of blacks on a grand jury, and make a subjective conclusion that the statistics are significantly different. Rather, statisticians compare figures through an objective process known as hypothesis testing.”).

6 Supplemental Brief of Allen Eugene Gregory, at 15, filed in State of Washington v. Gregory, No. 88086-7, (Wash. S.Ct., Jan. 22, 2018).

7 State of Washington v. Gregory, No. 88086-7, (Wash. S.Ct., Oct. 11, 2018) (en banc) (internal citations omitted).

8 ASA Statement at 132.

The Hazard of Composite End Points – More Lumpenepidemiology in the Courts

October 20th, 2018

One of the challenges of epidemiologic research is selecting the right outcome of interest to study. What seems like a simple and obvious choice can often be the most complicated aspect of the design of clinical trials or studies.1 Lurking in this choice of end point is a particular threat to validity in the use of composite end points, when the real outcome of interest is one constituent among multiple end points aggregated into the composite. There may, for instance, be strong evidence in favor of one of the constituents of the composite, but using the composite end point results to support a causal claim for a different constituent begs the question that needs to be answered, whether in science or in law.

The dangers of extrapolating from one disease outcome to another is well-recognized in the medical literature. Remarkably, however, the problem received no meaningful discussion in the Reference Manual on Scientific Evidence (3d ed. 2011). The handbook designed to help judges decide threshold issues of admissibility of expert witness opinion testimony discusses the extrapolation from sample to population, from in vitro to in vivo, from one species to another, from high to low dose, and from long to short duration of exposure. The Manual, however, has no discussion of “lumping,” or on the appropriate (and inappropriate) use of composite or combined end points.

Composite End Points

Composite end points are typically defined, perhaps circularly, as a single group of health outcomes, which group is made up of constituent or single end points. Curtis Meinert defined a composite outcome as “an event that is considered to have occurred if any of several different events or outcomes is observed.”2 Similarly, Montori defined composite end points as “outcomes that capture the number of patients experiencing one or more of several adverse events.”3 Composite end points are also sometimes referred to as combined or aggregate end points.

Many composite end points are clearly defined for a clinical trial, and the component end points are specified. In some instances, the composite nature of an outcome may be subtle or be glossed over by the study’s authors. In the realm of cardiovascular studies, for example, investigators may look at stroke as a single endpoint, without acknowledging that there are important clinical and pathophysiological differences between ischemic strokes and hemorrhagic strokes (intracerebral or subarachnoid). The Fletchers’ textbook4 on clinical epidemiology gives the example:

In a study of cardiovascular disease, for example, the primary outcomes might be the occurrence of either fatal coronary heart disease or non-fatal myocardial infarction. Composite outcomes are often used when the individual elements share a common cause and treatment. Because they comprise more outcome events than the component outcomes alone, they are more likely to show a statistical effect.”

Utility of Composite End Points

The quest for statistical “power” is often cited as a basis for using composite end points. Reduction in the number of “events,” such as myocardial infarction (MI), through improvements in medical care has led to decreased rates of MI in studies and clinical trials. These low event rates have caused power issues for clinical trialists, who have responded by turning to composite end points to capture more events. Composite end points permit smaller sample sizes and shorter follow-up times, without sacrificing power, the ability to detect a statistically significant increased rate of a prespecified size and Type I error. Increasing study power, while reducing sample size or observation time, is perhaps the most frequently cited rationale for using composite end points.

Competing Risks

Another reason sometimes offered in support of using composite end points is composites provide a strategy to avoid the problem of competing risks.5 Death (any cause) is sometimes added to a distinct clinical morbidity because patients who are taken out of the trial by death are “unavailable” to experience the morbidity outcome.

Multiple Testing

By aggregating several individual end points into a single pre-specified outcome, trialists can avoid corrections for multiple testing. Trials that seek data on multiple outcomes, or on multiple subgroups, inevitably raise concerns about the appropriate choice of the measure for the statistical test (alpha) to determine whether to reject the null hypothesis. According to some authors, “[c]omposite endpoints alleviate multiplicity concerns”:

If designated a priori as the primary outcome, the composite obviates the multiple comparisons associated with testing of the separate components. Moreover, composite outcomes usually lead to high event rates thereby increasing power or reducing sample size requirements. Not surprisingly, investigators frequently use composite endpoints.”6

Other authors have similarly acknowledged that the need to avoid false positive results from multiple testing is an important rationale for composite end points:

Because the likelihood of observing a statistically significant result by chance alone increases with the number of tests, it is important to restrict the number of tests undertaken and limit the type 1 error to preserve the overall error rate for the trial.”7

Indecision about an Appropriate Single Outcome

The International Conference on Harmonization suggests that the inability to select a single outcome variable may lead to the adoption of a composite outcome:

If a single primary variable cannot be selected …, another useful strategy is to integrate or combine the multiple measurements into a single or composite variable.”8

The “indecision” rationale has also been criticized as “generally not a good reason to use a composite end point.”9

Validity of Composite End Points

The validity of composite end points depends upon methodological assumptions, which will have to be made at the time of the study design and protocol creation. After the data are collected and analyzed, the assumptions may or may not be supported. Among the supporting assumptions about the validity of using composites are:10

  • similarity in patient importance for included component end points,

  • similarity of association size of the components, and

  • number of events across the components.

The use of composite end points can sometimes be appropriate in the “first look” at a class of diseases or disorders, with the understanding that further research will sort out and refine the associated end point. Research into the causes of human birth defects, for instance, often starts out with a look at “all major malformations,” before focusing in on specific organ and tissue systems. To some extent, the legal system, in its gatekeeping function, has recognized the dangers and invalidity of lumping in the epidemiology of birth defects.11 The Frischhertz decision, for instance, clearly acknowledged that given the clear evidence that different birth defects arise at different times, based upon interference with different embryological processes, “lumping” of end points was methodologically inappropriate. 2012 U.S. Dist. LEXIS 181507, at *8 (citing Chamber v. Exxon Corp., 81 F. Supp. 2d 661 (M.D. La. 2000), aff’d, 247 F.3d 240 (5th Cir. 2001) (unpublished)).

The Chamber decision involved a challenge to the causation opinion of frequent litigation industry witness, Peter Infante,12 who attempted to defend his opinion about benzene and chronic myelogenous leukemia, based upon epidemiology of benzene and acute myelogenous leukemia. Plaintiffs’ witnesses and counsel sought to evade the burden of producing evidence of an AML association by pointing to a study that reported “excess leukemias,” without specifying the relevant type. Chamber, 81 F. Supp. 2d at 664. The trial court, however, perspicaciously recognized the claimants’ failure to identify relevant evidence of the specific association needed to support the causal claim.

The Frischhertz and Chamber cases are hardly unique. Several state and federal courts have concurred in the context of cancer causation claims.13 In the context of birth defects litigation, the Public Affairs Committee of the Teratology Society has weighed in with strong guidance that counsels against extrapolation between different birth defects in litigation:

Determination of a causal relationship between a chemical and an outcome is specific to the outcome at issue. If an expert witness believes that a chemical causes malformation A, this belief is not evidence that the chemical causes malformation B, unless malformation B can be shown to result from malformation A. In the same sense, causation of one kind of reproductive adverse effect, such as infertility or miscarriage, is not proof of causation of a different kind of adverse effect, such as malformation.”14

The threat to validity in attributing a suggested risk for a composite end point to all included component end points is not, unfortunately, recognized by all courts. The trial court, in Ruff v. Ensign-Bickford Industries, Inc.,15 permitted plaintiffs’ expert witness to reanalyze a study by grouping together two previously distinct cancer outcomes to generate a statistically significant result. The result in Ruff is disappointing, but not uncommon. The result is also surprising, considering the guidance provided by the American Law Institute’s Restatement:

Even when satisfactory evidence of general causation exists, such evidence generally supports proof of causation only for a specific disease. The vast majority of toxic agents cause a single disease or a series of biologically-related diseases. (Of course, many different toxic agents may be combined in a single product, such as cigarettes.) When biological-mechanism evidence is available, it may permit an inference that a toxic agent caused a related disease. Otherwise, proof that an agent causes one disease is generally not probative of its capacity to cause other unrelated diseases. Thus, while there is substantial scientific evidence that asbestos causes lung cancer and mesothelioma, whether asbestos causes other cancers would require independent proof. Courts refusing to permit use of scientific studies that support general causation for diseases other than the one from which the plaintiff suffers unless there is evidence showing a common biological mechanism include Christophersen v. Allied-Signal Corp., 939 F.2d 1106, 1115-1116 (5th Cir. 1991) (applying Texas law) (epidemiologic connection between heavy-metal agents and lung cancer cannot be used as evidence that same agents caused colon cancer); Cavallo v. Star Enters., 892 F. Supp. 756 (E.D. Va. 1995), aff’d in part and rev’d in part, 100 F.3d 1150 (4th Cir. 1996); Boyles v. Am. Cyanamid Co., 796 F. Supp. 704 (E.D.N.Y. 1992). In Austin v. Kerr-McGee Ref. Corp., 25 S.W.3d 280, 290 (Tex. Ct. App. 2000), the plaintiff sought to rely on studies showing that benzene caused one type of leukemia to prove that benzene caused a different type of leukemia in her decedent. Quite sensibly, the court insisted that before plaintiff could do so, she would have to submit evidence that both types of leukemia had a common biological mechanism of development.”

Restatement (Third) of Torts § 28 cmt. c, at 406 (2010). Notwithstanding some of the Restatement’s excesses on other issues, the guidance on composites, seems sane and consonant with the scientific literature.

Role of Mechanism in Justifying Composite End Points

A composite end point may make sense when the individual end points are biologically related, and the investigators can reasonably expect that the individual end points would be affected in the same direction, and approximately to the same extent:16

Confidence in a composite end point rests partly on a belief that similar reductions in relative risk apply to all the components. Investigators should therefore construct composite endpoints in which the biology would lead us to expect similar effects across components.”

The important point, missed by some investigators and many courts, is that the assumption of similar “effects” must be tested by examining the individual component end points, and especially the end point that is the harm claimed by plaintiffs in a given case.

Methodological Issues

The acceptability of composite end points is often a delicate balance between the statistical power and efficiency gained and the reliability concerns raised by using the composite. As with any statistical or interpretative tool, the key questions turn on how the tool is used, and for what purpose. The reliability issues raised by the use of composites are likely to be highly contextual.

For instance, there is an important asymmetry between justifying the use of a composite for measuring efficacy and the use of the same composite for safety outcomes. A biological improvement in type 2 diabetes might be expected to lead to a reduction in all the macrovascular complications of that disease, but a medication for type 2 diabetes might have a very specific toxicity or drug interaction, which affects only one constituent end point among all macrovascular complications, such as myocardial infarction. The asymmetry between efficacy and safety outcomes is specifically addressed by cardiovascular epidemiologists in an important methodological paper:17

Varying definitions of composite end points, such as MACE, can lead to substantially different results and conclusions. There, the term MACE, in particular, should not be used, and when composite study end points are desired, researchers should focus separately on safety and effectiveness outcomes, and construct separate composite end points to match these different clinical goals.”

There are many clear, published statements that caution consumers of medical studies against being misled by claims based upon composite end points. Several years ago, for example, the British Medical Journal published a paper with six methodological suggestions for consumers of studies, one of which deals explicitly with composite end points:18

“Guide to avoid being misled by biased presentation and interpretation of data

1. Read only the Methods and Results sections; bypass the Discuss section

2. Read the abstract reported in evidence based secondary publications

3. Beware faulty comparators

4. Beware composite endpoints

5. Beware small treatment effects

6. Beware subgroup analyses”

The paper elaborates on the problems that arise from the use of composite end points:19

Problems in the interpretation of these trials arise when composite end points include component outcomes to which patients attribute very different importance… .”

Problems may also arise when the most important end point occurs infrequently or when the apparent effect on component end points differs.”

When the more important outcomes occur infrequently, clinicians should focus on individual outcomes rather than on composite end points. Under these circumstances, inferences about the end points (which because they occur infrequently will have very wide confidence intervals) will be weak.”

Authors generally acknowledge that “[w]hen large variations exist between components the composite end point should be abandoned.”20

Methodological Issues Concerning Causal Inferences from Composite End Points to Individual End Points

Several authors have criticized pharmaceutical companies for using composite end points to “game” their trials. Composites allow smaller sample size, but they lend themselves to broader claims for outcomes included within the composite. The same criticism applies to attempts to infer that there is risk of an individual endpoint based upon a showing of harm in the composite endpoint.

If a trial report specifies a composite endpoint, the components of the composite should be in the well-known pathophysiology of the disease. The researchers should interpret the composite endpoint in aggregate rather than as showing efficacy of the individual components. However, the components should be specified as secondary outcomes and reported beside the results of the primary analysis.”21

Virtually the entire field of epidemiology and clinical trial study has urged caution in inferring risk for a component end point from suggested risk in a composite end point:

In summary, evaluating trials that use composite outcome requires scrutiny in regard to the underlying reasons for combining endpoints and its implications and has impact on medical decision-making (see below in Sect. 47.8). Composite endpoints are credible only when the components are of similar importance and the relative effects of the intervention are similar across components (Guyatt et al. 2008a).”22

Not only do important methodologists urge caution in the interpretation of composite end points,23 they emphasize a basic point of scientific (and legal) relevancy:

[A] positive result for a composite outcome applies only to the cluster of events included in the composite and not to the individual components.”24

Even regular testifying expert witnesses for the litigation industry insist upon the “principle of full disclosure”:

The analysis of the effect of therapy on the combined end point should be accompanied by a tabulation of the effect of the therapy for each of the component end points.”25

Gatekeepers in our judicial system need to be more vigilant against bait-and-switch inferences based upon composite end points. The quest for statistical power hardly justifies larding up an end point with irrelevant data points.


1 See, e.g., Milton Packer, “Unbelievable! Electrophysiologists Embrace ‘Alternative Facts’,” MedPage (May 16, 2018) (describing clinical trialists’ abandoning pre-specified intention-to-treat analysis).

2 Curtis Meinert, Clinical Trials Dictionary (Johns Hopkins Center for Clinical Trials 1996).

3 Victor M. Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596 (2005).

4 R. Fletcher & S. Fletcher, Clinical Epidemiology: The Essentials at 109 (4th ed. 2005).

5 Neaton, et al., “Key issues in end point selection for heart failure trials: composite end points,” 11 J. Cardiac Failure 567, 569a (2005).

6 Schulz & Grimes, “Multiplicity in randomized trials I: endpoints and treatments,” 365 Lancet 1591, 1593a (2005).

7 Freemantle & Calvert, “Composite and surrogate outcomes in randomized controlled trials,” 334 Brit. Med. J. 756, 756a – b (2007).

8 International Conference on Harmonisation of Technical Requrements for Registration of Pharmaceuticals for Human Use; “ICH harmonized tripartite guideline: statistical principles for clinical trials,” 18 Stat. Med. 1905 (1999).

9 Neaton, et al., “Key issues in end point selection for heart failure trials: composite end points,” 11 J. Cardiac Failure 567, 569b (2005).

10 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596, Summary Point No. 2 (2005).

11 SeeLumpenepidemiology” (Dec. 24, 2012), discussing Frischhertz v. SmithKline Beecham Corp., 2012 U.S. Dist. LEXIS 181507 (E.D. La. 2012).Frischhertz was decided in the same month that a New York City trial judge ruled Dr. Shira Kramer out of bounds in the commission of similarly invalid lumping, in Reeps v. BMW of North America, LLC, 2012 NY Slip Op 33030(U), N.Y.S.Ct., Index No. 100725/08 (New York Cty. Dec. 21, 2012) (York, J.), 2012 WL 6729899, aff’d on rearg., 2013 WL 2362566, aff’d, 115 A.D.3d 432, 981 N.Y.S.2d 514 (2013), aff’d sub nom. Sean R. v. BMW of North America, LLC, ___ N.E.3d ___, 2016 WL 527107 (2016). See also New York Breathes Life Into Frye Standard – Reeps v. BMW(Mar. 5, 2013).

12Infante-lizing the IARC” (May 13, 2018).

13 Knight v. Kirby Inland Marine, 363 F.Supp. 2d 859, 864 (N.D. Miss. 2005), aff’d, 482 F.3d 347 (5th Cir. 2007) (excluding opinion of B.S. Levy on Hodgkin’s disease based upon studies of other lymphomas and myelomas); Allen v. Pennsylvania Eng’g Corp., 102 F.3d 194, 198 (5th Cir. 1996) (noting that evidence suggesting a causal connection between ethylene oxide and human lymphatic cancers is not probative of a connection with brain cancer);Current v. Atochem North America, Inc., 2001 WL 36101283, at *3 (W.D. Tex. Nov. 30, 2001) (excluding expert witness opinion of Michael Gochfeld, who asserted that arsenic causes rectal cancer on the basis of studies that show association with lung and bladder cancer; Hill’s consistency factor in causal inference does not apply to cancers generally); Exxon Corp. v. Makofski, 116 S.W.3d 176, 184-85 (Tex. App. Houston 2003) (“While lumping distinct diseases together as ‘leukemia’ may yield a statistical increase as to the whole category, it does so only by ignoring proof that some types of disease have a much greater association with benzene than others.”).

14The Public Affairs Committee of the Teratology Society, “Teratology Society Public Affairs Committee Position Paper Causation in Teratology-Related Litigation,” 73 Birth Defects Research (Part A) 421, 423 (2005).

15 168 F. Supp. 2d 1271, 1284–87 (D. Utah 2001).

16 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 595b (2005).

17 Kevin Kip, et al., “The problem with composite end points in cardiovascular studies,” 51 J. Am. Coll. Cardiol. 701, 701 (2008) (Abstract – Conclusions) (emphasis in original).

18 Montori, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004) (emphasis added).

19 Id. at 1094b, 1095a.

20 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596 (2005).

21 Schulz & Grimes, “Multiplicity in randomized trials I: endpoints and treatments,” 365 Lancet 1591, 1595a (2005) (emphasis added). These authors acknowledge that composite end points often lack clinical relevancy, and that the gain in statistical efficiency comes at the high cost of interpretational difficulties. Id. at 1593.

22 Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 1840 (2d ed. 2014) (47.5.8 Use of Composite Endpoints).

23 See, e.g., Stuart J. Pocock, John J.V. McMurray, and Tim J. Collier, “Statistical Controversies in Reporting of Clinical Trials: Part 2 of a 4-Part Series on Statistics for Clinical Trials,” 66 J. Am. Coll. Cardiol. 2648, 2650-51 (2015) (“Interpret composite endpoints carefully.”)(“COMPOSITE ENDPOINTS. These are commonly used in CV RCTs to combine evidence across 2 or more outcomes into a single primary endpoint. But, there is a danger of oversimplifying the evidence by putting too much emphasis on the composite, without adequate inspection of the contribution from each separate component.”); Eric Lim, Adam Brown, Adel Helmy, Shafi Mussa, and Douglas G. Altman, “Composite Outcomes in Cardiovascular Research: A Survey of Randomized Trials,” 149 Ann. Intern. Med. 612, 612, 615-16 (2008) (“Individual outcomes do not contribute equally to composite measures, so the overall estimate of effect for a composite measure cannot be assumed to apply equally to each of its individual outcomes.”) (“Therefore, readers are cautioned against assuming that the overall estimate of effect for the composite outcome can be interpreted to be the same for each individual outcome.”); Freemantle, et al., “Composite outcomes in randomized trials: Greater precision but with greater uncertainty.” 289 J. Am. Med. Ass’n 2554, 2559a (2003) (“To avoid the burying of important components of composite primary outcomes for which on their own no effect is concerned, . . . the components of a composite outcome should always be declared as secondary outcomes, and the results described alongside the result for the composite outcome.”).

24 Freemantle & Calvert, “Composite and surrogate outcomes in randomized controlled trials.” 334 Brit. Med. J. 757a (2007).

25 Lem Moyé, “Statistical Methods for Cardiovascular Researchers,” 118 Circulation Research 439, 451 (2016).

Carl Cranor’s Conflicted Jeremiad Against Daubert

September 23rd, 2018

Carl Cranor’s Conflicted Jeremiad Against Daubert

It seems that authors who have the most intense and refractory conflicts of interest (COI) often fail to see their own conflicts and are the most vociferous critics of others for failing to identify COIs. Consider the spectacle of having anti-tobacco activists and tobacco plaintiffs’ expert witnesses assert that the American Law Institute had an ethical problem because Institute members included some tobacco defense lawyers.1 Somehow these authors overlooked their own positional and financial conflicts, as well as the obvious fact that the Institute’s members included some tobacco plaintiffs’ lawyers as well. Still, the complaint was instructive because it typifies the abuse of ethical asymmetrical standards, as well as ethical blindspots.2

Recently, Raymond Richard Neutra, Carl F. Cranor, and David Gee published a paper on the litigation use of Sir Austin Bradford Hill’s considerations for evaluating whether an association is causal or not.3 See Raymond Richard Neutra, Carl F. Cranor, and David Gee, “The Use and Misuse of Bradford Hill in U.S. Tort Law,” 58 Jurimetrics 127 (2018) [cited here as Cranor]. Their paper provides a startling example of hypocritical and asymmetrical assertions of conflicts of interests.

Neutra is a self-styled public health advocate4 and the Chief of the Division of Environmental and Occupational Disease Control (DEODC) of the California Department of Health Services (CDHS). David Gee, not to be confused with the English artist or the Australian coin forger, is with the European Environment Agency, in Copenhagen, Denmark. He is perhaps best known for his precautionary principle advocacy and his work with trade unions.5

Carl Cranor is with the Center for Progressive Reform, and he teaches philosophy at one of the University of California campuses. Although he is neither a lawyer nor a scientist, he participates with some frequency as a consultant, and as an expert witness, in lawsuits, on behalf of claimants. Perhaps Cranor’s most notorious appearance as an expert witness resulted in the decision of Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11 (1st Cir. 2011), cert. denied sub nom., U.S. Steel Corp. v. Milward, 132 S. Ct. 1002 (2012). Probably less generally known is that Cranor was one of the founders of an organization, the Council for Education and Research on Toxics (CERT), which recently was the complaining party in a California case in which CERT sought money damages for Starbucks’ failure to label each cup of coffee sold as known to the State of California as causing cancer.6 Having a so-called not-for-profit corporation can also be pretty handy, especially when it holds itself out as a scientific organization and files amicus briefs in support of reversing Daubert exclusions of the founding members of the corporation, as CERT did on behalf of its founding member in the Milward case.7 The conflict of interest, in such an amicus brief, however, is no longer potential or subtle, and violates the duty of candor to the court.

In this recent article on Hill’s considerations for judging causality, Cranor followed CERT’s lead from Milward. Cranor failed to disclose that he has been a party expert witness for plaintiffs, in cases in which he was advocating many of the same positions put forward in the Jurimetrics article, including the Milward case, in which he was excluded from testifying by the trial court. Cranor’s lack of candor with the readers of the Jurimetrics article is all the more remarkable in that Cranor and his co-authors give conflicts of interest outsize importance in substantive interpretations of scholarship:

the desired reliability for evidence evaluation requires that biases that derive from the financial interests and ideological commitments of the investigators and editors that control the gateways to publication be considered in a way that Hill did not address.”

Cranor at 137 & n.59. Well, we could add that Cranor’s financial interests and ideological commitments might well be considered in evaluating the reliability of the opinions and positions advanced in this most recent work by Cranor and colleagues. If you believe that COIs disqualify a speaker from addressing important issues, then you have all the reason you need to avoid reading Cranor’s recent article.

Dubious Scholarship

The more serious problem with Cranor’s article is not his ethically strained pronouncements about financial interests, but the dubious scholarship he and his colleagues advance to thwart judicial gatekeeping of even more dubious expert witness opinion testimony. To begin with, the authors disparage the training and abilities of federal judges to assess the epistemic warrant and reliability of proffered causation opinions:

With their enhanced duties to review scientific and technical testimony federal judges, typically not well prepared by legal education for these tasks, have struggled to assess the scientific support for—and the reliability and relevance of—expert testimony.”

Cranor at 147. Their assessment is fair but hides the authors’ cynical agenda to remove gatekeeping and leave the assessment to lay juries, who are less well prepared for the task, and whose function ensures no institutional accountability, review, or public evaluation.

Similarly, the authors note the temporal context and limitations of Bradford Hill’s 1965 paper, which date and limit the advice provided over 50 years ago in a discipline that has changed dramatically with the advancement of biological, epidemiologic, and genetic science.8 Even at the time of its original publication in 1965, Bradford Hill’s paper, which was based upon an informal lecture, was not designed or intended to be a definitive treatment of causal inference. Cranor and his colleagues make no effort to review Bradford Hill’s many other publications, both before and after his 1965 dinner speech, for evidence of his views on the factors for causal inference, including the role of statistical testing and inference.

Nonetheless, Bradford Hill’s 1965 paper has become a landmark, even if dated, because of its author’s iconic status in the world of public health, earned for his showing that tobacco smoking causes lung cancer,9 and for advancing the role of double-blind randomized clinical trials.10 Cranor and his colleagues made no serious effort to engage with the large body of Bradford Hill’s writings, including his immensely important textbook, The Principles of Medical Statistics, which started as a series of articles in The Lancet, and went through 12 editions in print.11 Hill’s reputation will no doubt survive Cranor’s bowdlerized version of Sir Austin’s views.

Epidemiology is Dispensable When It Fails to Support Causal Claims

The egregious aspect of Cranor’s article is its bill of particulars against the federal judiciary for allegedly errant gatekeeping, which for these authors translates really into any gatekeeping at all. Cranor at 144-45. Indeed, the authors provide not a single example of what was a “proper” exclusion of an expert witness, who was contending for some doubtful causal claim. Perhaps they have never seen a proper exclusion, but doesn’t that speak volumes about their agenda and their biases?

High on the authors’ list of claimed gatekeeping errors is the requirement that a causal claim be supported with epidemiologic evidence. Although some causal claims may be supported by strong evidence of a biological process with mechanistic evidence, such claims are not common in United States tort litigation.

In support of the claim that epidemiology is dispensable, Cranor suggests that:

Some courts have recognized this, and distinguished scientific committees often do not require epidemiological studies to infer harm to humans. For example, the International Agency for Research on Cancer (IRAC) [sic], the National Toxicology Program, and California’s Proposition 65 Scientific Advisory Panel, among others, do not require epidemiological data to support findings that a substance is a probable or—in some cases—a known human carcinogen, but it is welcomed if available.”

Cranor at 149. California’s Proposition 65!??? Even IARC is hard to take seriously these days with its capture by consultants for the litigation industry, but if we were to accept IARC as an honest broker of causal inferences, what substance “known” to IARC to cause cancer in humans (Category I) was branded as a “known carcinogen” without the support of epidemiologic studies? Inquiring minds might want to know, but they will not learn the answer from Cranor and his co-authors.

When it comes to adverting to legal decisions that supposedly support the authors’ claim that epidemiology is unnecessary, their scholarship is equally wanting. The paper cites the notorious Wells case, which was so roundly condemned in scientific circles, that it probably helped ensure that a decision such as Daubert would ultimately be handed down by the Supreme Court. The authors seemingly cannot read, understand, and interpret even the most straightforward legal decisions. Here is how they cite Wells as support for their views:

Wells v. Ortho Pharm. Corp., 788 F.2d 741, 745 (11th Cir. 1986) (reviewing a district court’s decision deciding not to require the use of epidemiological evidence and instead allowing expert testimony).”

Cranor at 149-50 n.122. The trial judge in Wells never made such a decision; indeed, the case was tried by the bench, before the Supreme Court decided Daubert. There was no gatekeeping involved at all. More important, however, and contrary to Cranor’s explanatory parenthetical, both sides presented epidemiologic evidence in support of their positions.12

Cranor and his co-authors similarly misread and misrepresent the trial court’s decision in the litigation over maternal sertraline use and infant birth defects. Twice they cite the Multi-District Litigation trial court’s decision that excluded plaintiffs’ expert witnesses:

In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., 26 F. Supp. 3d 449, 455 (E.D. Pa. 2014) (expert may not rely on nonstatistically significant studies to which to apply the [Bradford Hill] factors).”

Cranor at 144 n.85; 158 n.179. The MDL judge, Judge Rufe, decidedly never held that an expert witness may not rely upon a statistically non-significant study in a “Bradford Hill” analysis, and the Third Circuit, which affirmed the exclusions of the plaintiffs’ expert witnesses’ testimony, was equally clear in avoiding the making of such a pronouncement.13

Who Needs Statistical Significance

Part of Cranor’s post-science agenda is to intimidate judges into believing that statistical significance is unnecessary and a wrong-headed criterion for judging the validity of relied upon research. In their article, Cranor and friends suggest that Hill agreed with their radical approach, but nothing could be further from the truth. Although these authors parse almost every word of Hill’s 1965 article, they conveniently omit Hill’s views about the necessary predicates for applying his nine considerations for causal inference:

Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965). Cranor’s radicalism leaves no room for assessing whether a putative association is “beyond what we would care to attribute to the play of chance,” and his poor scholarship ignores Hill’s insistence that this statistical analysis be carried out.14

Hill’s work certainly acknowledged the limitations of statistical method, which could not compensate for poorly designed research:

It is a serious mistake to rely upon the statistical method to eliminate disturbing factors at the completion of the work.  No statistical method can compensate for a badly planned experiment.”

Austin Bradford Hill, Principles of Medical Statistics at 4 (4th ed. 1948). Hill was equally clear, however, that the limits on statistical methods did not imply that statistical methods are not needed to interpret a properly planned experiment or study. In the summary section of his textbook’s first chapter, Hill removed any doubt about his view of the importance, and the necessity, of statistical methods:

The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”

Id. at 10 (emphasis added).

In his efforts to eliminate judicial gatekeeping of expert witness testimony, Cranor has struggled with understanding of statistical inference and testing.15 In an early writing, a 1993 book, Cranor suggests that we “can think of type I and II error rates as “standards of proof,” which begs the question whether they are appropriately used to assess significance or posterior probabilities.16 Indeed, Cranor goes further, in confusing significance and posterior probabilities, when he described the usual level of alpha (5%) as the “95%” rule, and claimed that regulatory agencies require something akin to proof “beyond a reasonable doubt,” when they require two “statistically significant” studies.17

Cranor has persisted in this fallacious analysis in his writings. In a 2006 book, he erroneously equated the 95% coefficient of statistical confidence with 95% certainty of knowledge.18 Later in this same text, Cranor again asserted his nonsense that agency regulations are written when supported by “beyond a reasonable doubt.”19 Given that Cranor has consistently confused significance and posterior probability, he really should not be giving advice to anyone about statistical or scientific inference. Cranor’s persistent misunderstandings of basic statistical concepts do, however, explain his motivation for advocating the elimination of statistical significance testing, even if these misunderstandings make his enterprise intellectually unacceptable.

Cranor and company fall into a similar muddle when they offer advice on post-hoc power calculations, which advice ignores standard statistical learning for interpreting completed studies.20 Another measure of the authors’ failed scholarship is their omission of any discussion of recent efforts by many in the scientific community to lower the threshold for statistical significance, based upon the belief that the customary 5% p-value is an order of magnitude too high.21

 

Relative Risks Greater Than Two

There are other tendentious arguments and treatments in Cranor’s brief against gatekeeping, but I will stop with one last example. The inference of specific causation from study risk ratios has provoked a torrent of verbiage from Sander Greenland (who is cited copiously by Cranor). Cranor, however, does not even scratch the surface of the issue and fails to cite the work of epidemiologists, such as Duncan C. Thomas, who have defended the use of probabilities of (specific) causation. More important, however, Cranor fails to speak out against the abuse of using any relative risk greater than 1.0 to support an inference of specific causation, when the nature of the causal relationship is neither necessary nor sufficient. In this context, Kenneth Rothman has reminded us that someone can be exposed to, or have, a risk, and then develop the related outcome, without there being any specific causation:

An elementary but essential principle to keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between the exposure and the disease. For this reason, we cannot consider the incidence proportion or the incidence rate among exposed people to measure a causal effect.”

Kenneth J. Rothman, Epidemiology: An Introduction at 57 (2d ed. 2012).

The danger in Cranor’s article in Jurimetrics is that some readers will not realize the extreme partisanship in its ipse dixit, and erroneous, pronouncements. Caveat lector


1 Elizabeth Laposata, Richard Barnes & Stanton Glantz, “Tobacco Industry Influence on the American Law Institute’s Restatements of Torts and Implications for Its Conflict of Interest Policies,” 98 Iowa L. Rev. 1 (2012).

2 The American Law Institute responded briefly. See Roberta Cooper Ramo & Lance Liebman, “The ALI’s Response to the Center for Tobacco Control Research & Education,” 98 Iowa L. Rev. Bull. 1 (2013), and the original authors’ self-serving last word. Elizabeth Laposata, Richard Barnes & Stanton Glantz, “The ALI Needs to Implement Modern Conflict of Interest Policies,” 98 Iowa L. Rev. Bull. 17 (2013).

3 Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295 (1965).

4 Raymond Richard Neutra, “Epidemiology Differs from Public Health Practice,” 7 Epidemiology 559 (1996).

7From Here to CERT-ainty” (June 28, 2018).

8 Kristen Fedak, Autumn Bernal, Zachary Capshaw, and Sherilyn A Gross, “Applying the Bradford Hill Criteria in the 21st Century: How Data Integration Has Changed Causal Inference in Molecular Epidemiology,” Emerging Themes in Epidemiol. 12:14 (2015); John P. A. Ioannides, “Exposure Wide Epidemiology, Revisiting Bradford Hill,” 35 Stats. Med. 1749 (2016).

9 Richard Doll & Austin Bradford Hill, “Smoking and Carcinoma of the Lung,” 2(4682) Brit. Med. J. (1950).

10 Geoffrey Marshall (chairman), “Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation,” 2 Brit. Med. J. 769, 769–71 (1948).

11 Vern Farewell & Anthony Johnson,The origins of Austin Bradford Hill’s classic textbook of medical statistics,” 105 J. Royal Soc’y Med. 483 (2012). See also Hilary E. Tillett, “Bradford Hill’s Principles of Medical Statistics,” 108 Epidemiol. Infect. 559 (1992).

13 In re Zoloft Prod. Liab. Litig., No. 16-2247 , __ F.3d __, 2017 WL 2385279, 2017 U.S. App. LEXIS 9832 (3d Cir. June 2, 2017) (affirming exclusion of biostatistician Nichols Jewell’s dodgy opinions, which involved multiple methodological flaws and failures to follow any methodology faithfully).

14 See Bradford Hill on Statistical Methods” (Sept. 24, 2013).

16 Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34 (1993) (arguing incorrectly that one can think of α, β (the chances of type I and type II errors, respectively and 1- β as measures of the “risk of error” or “standards of proof.”); see also id. at 44, 47, 55, 72-76. At least one astute reviewer called Cranor on his statistical solecisms. Michael D. Green, “Science Is to Law as the Burden of Proof is to Significance Testing: Book Review of Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law,” 37 Jurimetrics J. 205 (1997) (taking Cranor to task for confusing significance and posterior (burden of proof) probabilities).

17 Id. (squaring 0.05 to arrive at “the chances of two such rare events occurring” as 0.0025, which impermissibly assumes independence between the two studies).

18 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 100 (2006) (incorrectly asserting that “[t]he practice of setting α =.05 I call the “95% rule,” for researchers want to be 95% certain that when knowledge is gained [a study shows new results] and the null hypothesis is rejected, it is correctly rejected.”).

19 Id. at 266.

21 See, e.g., John P. A. Ioannidis, “The Proposal to Lower P Value Thresholds to .005,” 319 J. Am. Med. Ass’n 1429 (2018); Daniel J. Benjamin, James O. Berger, Valen E. Johnson, et al., “Redefine statistical significance,” 2 Nature Human Behavior 6 (2018).