Schachtman Law » statistical evidence

TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

The Hazard of Composite End Points – More Lumpenepidemiology in the Courts

October 20th, 2018

One of the challenges of epidemiologic research is selecting the right outcome of interest to study. What seems like a simple and obvious choice can often be the most complicated aspect of the design of clinical trials or studies.¹ Lurking in this choice of end point is a particular threat to validity in the use of composite end points, when the real outcome of interest is one constituent among multiple end points aggregated into the composite. There may, for instance, be strong evidence in favor of one of the constituents of the composite, but using the composite end point results to support a causal claim for a different constituent begs the question that needs to be answered, whether in science or in law.

The dangers of extrapolating from one disease outcome to another is well-recognized in the medical literature. Remarkably, however, the problem received no meaningful discussion in the Reference Manual on Scientific Evidence (3d ed. 2011). The handbook designed to help judges decide threshold issues of admissibility of expert witness opinion testimony discusses the extrapolation from sample to population, from in vitro to in vivo, from one species to another, from high to low dose, and from long to short duration of exposure. The Manual, however, has no discussion of “lumping,” or on the appropriate (and inappropriate) use of composite or combined end points.

Composite End Points

Composite end points are typically defined, perhaps circularly, as a single group of health outcomes, which group is made up of constituent or single end points. Curtis Meinert defined a composite outcome as “an event that is considered to have occurred if any of several different events or outcomes is observed.”² Similarly, Montori defined composite end points as “outcomes that capture the number of patients experiencing one or more of several adverse events.”³ Composite end points are also sometimes referred to as combined or aggregate end points.

Many composite end points are clearly defined for a clinical trial, and the component end points are specified. In some instances, the composite nature of an outcome may be subtle or be glossed over by the study’s authors. In the realm of cardiovascular studies, for example, investigators may look at stroke as a single endpoint, without acknowledging that there are important clinical and pathophysiological differences between ischemic strokes and hemorrhagic strokes (intracerebral or subarachnoid). The Fletchers’ textbook⁴ on clinical epidemiology gives the example:

“In a study of cardiovascular disease, for example, the primary outcomes might be the occurrence of either fatal coronary heart disease or non-fatal myocardial infarction. Composite outcomes are often used when the individual elements share a common cause and treatment. Because they comprise more outcome events than the component outcomes alone, they are more likely to show a statistical effect.”

Utility of Composite End Points

The quest for statistical “power” is often cited as a basis for using composite end points. Reduction in the number of “events,” such as myocardial infarction (MI), through improvements in medical care has led to decreased rates of MI in studies and clinical trials. These low event rates have caused power issues for clinical trialists, who have responded by turning to composite end points to capture more events. Composite end points permit smaller sample sizes and shorter follow-up times, without sacrificing power, the ability to detect a statistically significant increased rate of a prespecified size and Type I error. Increasing study power, while reducing sample size or observation time, is perhaps the most frequently cited rationale for using composite end points.

Competing Risks

Another reason sometimes offered in support of using composite end points is composites provide a strategy to avoid the problem of competing risks.⁵ Death (any cause) is sometimes added to a distinct clinical morbidity because patients who are taken out of the trial by death are “unavailable” to experience the morbidity outcome.

Multiple Testing

By aggregating several individual end points into a single pre-specified outcome, trialists can avoid corrections for multiple testing. Trials that seek data on multiple outcomes, or on multiple subgroups, inevitably raise concerns about the appropriate choice of the measure for the statistical test (alpha) to determine whether to reject the null hypothesis. According to some authors, “[c]omposite endpoints alleviate multiplicity concerns”:

“If designated a priori as the primary outcome, the composite obviates the multiple comparisons associated with testing of the separate components. Moreover, composite outcomes usually lead to high event rates thereby increasing power or reducing sample size requirements. Not surprisingly, investigators frequently use composite endpoints.”⁶

Other authors have similarly acknowledged that the need to avoid false positive results from multiple testing is an important rationale for composite end points:

“Because the likelihood of observing a statistically significant result by chance alone increases with the number of tests, it is important to restrict the number of tests undertaken and limit the type 1 error to preserve the overall error rate for the trial.”⁷

Indecision about an Appropriate Single Outcome

The International Conference on Harmonization suggests that the inability to select a single outcome variable may lead to the adoption of a composite outcome:

“If a single primary variable cannot be selected …, another useful strategy is to integrate or combine the multiple measurements into a single or composite variable.”⁸

The “indecision” rationale has also been criticized as “generally not a good reason to use a composite end point.”⁹

Validity of Composite End Points

The validity of composite end points depends upon methodological assumptions, which will have to be made at the time of the study design and protocol creation. After the data are collected and analyzed, the assumptions may or may not be supported. Among the supporting assumptions about the validity of using composites are:¹⁰

similarity in patient importance for included component end points,
similarity of association size of the components, and
number of events across the components.

The use of composite end points can sometimes be appropriate in the “first look” at a class of diseases or disorders, with the understanding that further research will sort out and refine the associated end point. Research into the causes of human birth defects, for instance, often starts out with a look at “all major malformations,” before focusing in on specific organ and tissue systems. To some extent, the legal system, in its gatekeeping function, has recognized the dangers and invalidity of lumping in the epidemiology of birth defects.¹¹ The Frischhertz decision, for instance, clearly acknowledged that given the clear evidence that different birth defects arise at different times, based upon interference with different embryological processes, “lumping” of end points was methodologically inappropriate. 2012 U.S. Dist. LEXIS 181507, at *8 (citing Chamber v. Exxon Corp., 81 F. Supp. 2d 661 (M.D. La. 2000), aff’d, 247 F.3d 240 (5th Cir. 2001) (unpublished)).

The Chamber decision involved a challenge to the causation opinion of frequent litigation industry witness, Peter Infante,¹² who attempted to defend his opinion about benzene and chronic myelogenous leukemia, based upon epidemiology of benzene and acute myelogenous leukemia. Plaintiffs’ witnesses and counsel sought to evade the burden of producing evidence of an AML association by pointing to a study that reported “excess leukemias,” without specifying the relevant type. Chamber, 81 F. Supp. 2d at 664. The trial court, however, perspicaciously recognized the claimants’ failure to identify relevant evidence of the specific association needed to support the causal claim.

The Frischhertz and Chamber cases are hardly unique. Several state and federal courts have concurred in the context of cancer causation claims.¹³ In the context of birth defects litigation, the Public Affairs Committee of the Teratology Society has weighed in with strong guidance that counsels against extrapolation between different birth defects in litigation:

“Determination of a causal relationship between a chemical and an outcome is specific to the outcome at issue. If an expert witness believes that a chemical causes malformation A, this belief is not evidence that the chemical causes malformation B, unless malformation B can be shown to result from malformation A. In the same sense, causation of one kind of reproductive adverse effect, such as infertility or miscarriage, is not proof of causation of a different kind of adverse effect, such as malformation.”¹⁴

The threat to validity in attributing a suggested risk for a composite end point to all included component end points is not, unfortunately, recognized by all courts. The trial court, in Ruff v. Ensign-Bickford Industries, Inc.,¹⁵ permitted plaintiffs’ expert witness to reanalyze a study by grouping together two previously distinct cancer outcomes to generate a statistically significant result. The result in Ruff is disappointing, but not uncommon. The result is also surprising, considering the guidance provided by the American Law Institute’s Restatement:

“Even when satisfactory evidence of general causation exists, such evidence generally supports proof of causation only for a specific disease. The vast majority of toxic agents cause a single disease or a series of biologically-related diseases. (Of course, many different toxic agents may be combined in a single product, such as cigarettes.) When biological-mechanism evidence is available, it may permit an inference that a toxic agent caused a related disease. Otherwise, proof that an agent causes one disease is generally not probative of its capacity to cause other unrelated diseases. Thus, while there is substantial scientific evidence that asbestos causes lung cancer and mesothelioma, whether asbestos causes other cancers would require independent proof. Courts refusing to permit use of scientific studies that support general causation for diseases other than the one from which the plaintiff suffers unless there is evidence showing a common biological mechanism include Christophersen v. Allied-Signal Corp., 939 F.2d 1106, 1115-1116 (5th Cir. 1991) (applying Texas law) (epidemiologic connection between heavy-metal agents and lung cancer cannot be used as evidence that same agents caused colon cancer); Cavallo v. Star Enters., 892 F. Supp. 756 (E.D. Va. 1995), aff’d in part and rev’d in part, 100 F.3d 1150 (4th Cir. 1996); Boyles v. Am. Cyanamid Co., 796 F. Supp. 704 (E.D.N.Y. 1992). In Austin v. Kerr-McGee Ref. Corp., 25 S.W.3d 280, 290 (Tex. Ct. App. 2000), the plaintiff sought to rely on studies showing that benzene caused one type of leukemia to prove that benzene caused a different type of leukemia in her decedent. Quite sensibly, the court insisted that before plaintiff could do so, she would have to submit evidence that both types of leukemia had a common biological mechanism of development.”

Restatement (Third) of Torts § 28 cmt. c, at 406 (2010). Notwithstanding some of the Restatement’s excesses on other issues, the guidance on composites, seems sane and consonant with the scientific literature.

Role of Mechanism in Justifying Composite End Points

A composite end point may make sense when the individual end points are biologically related, and the investigators can reasonably expect that the individual end points would be affected in the same direction, and approximately to the same extent:¹⁶

“Confidence in a composite end point rests partly on a belief that similar reductions in relative risk apply to all the components. Investigators should therefore construct composite endpoints in which the biology would lead us to expect similar effects across components.”

The important point, missed by some investigators and many courts, is that the assumption of similar “effects” must be tested by examining the individual component end points, and especially the end point that is the harm claimed by plaintiffs in a given case.

Methodological Issues

The acceptability of composite end points is often a delicate balance between the statistical power and efficiency gained and the reliability concerns raised by using the composite. As with any statistical or interpretative tool, the key questions turn on how the tool is used, and for what purpose. The reliability issues raised by the use of composites are likely to be highly contextual.

For instance, there is an important asymmetry between justifying the use of a composite for measuring efficacy and the use of the same composite for safety outcomes. A biological improvement in type 2 diabetes might be expected to lead to a reduction in all the macrovascular complications of that disease, but a medication for type 2 diabetes might have a very specific toxicity or drug interaction, which affects only one constituent end point among all macrovascular complications, such as myocardial infarction. The asymmetry between efficacy and safety outcomes is specifically addressed by cardiovascular epidemiologists in an important methodological paper:¹⁷

“Varying definitions of composite end points, such as MACE, can lead to substantially different results and conclusions. There, the term MACE, in particular, should not be used, and when composite study end points are desired, researchers should focus separately on safety and effectiveness outcomes, and construct separate composite end points to match these different clinical goals.”

There are many clear, published statements that caution consumers of medical studies against being misled by claims based upon composite end points. Several years ago, for example, the British Medical Journal published a paper with six methodological suggestions for consumers of studies, one of which deals explicitly with composite end points:¹⁸

“Guide to avoid being misled by biased presentation and interpretation of data

1. Read only the Methods and Results sections; bypass the Discuss section

2. Read the abstract reported in evidence based secondary publications

3. Beware faulty comparators

4. Beware composite endpoints

5. Beware small treatment effects

6. Beware subgroup analyses”

The paper elaborates on the problems that arise from the use of composite end points:¹⁹

“Problems in the interpretation of these trials arise when composite end points include component outcomes to which patients attribute very different importance… .”

“Problems may also arise when the most important end point occurs infrequently or when the apparent effect on component end points differs.”

“When the more important outcomes occur infrequently, clinicians should focus on individual outcomes rather than on composite end points. Under these circumstances, inferences about the end points (which because they occur infrequently will have very wide confidence intervals) will be weak.”

Authors generally acknowledge that “[w]hen large variations exist between components the composite end point should be abandoned.”²⁰

Methodological Issues Concerning Causal Inferences from Composite End Points to Individual End Points

Several authors have criticized pharmaceutical companies for using composite end points to “game” their trials. Composites allow smaller sample size, but they lend themselves to broader claims for outcomes included within the composite. The same criticism applies to attempts to infer that there is risk of an individual endpoint based upon a showing of harm in the composite endpoint.

“If a trial report specifies a composite endpoint, the components of the composite should be in the well-known pathophysiology of the disease. The researchers should interpret the composite endpoint in aggregate rather than as showing efficacy of the individual components. However, the components should be specified as secondary outcomes and reported beside the results of the primary analysis.”²¹

Virtually the entire field of epidemiology and clinical trial study has urged caution in inferring risk for a component end point from suggested risk in a composite end point:

“In summary, evaluating trials that use composite outcome requires scrutiny in regard to the underlying reasons for combining endpoints and its implications and has impact on medical decision-making (see below in Sect. 47.8). Composite endpoints are credible only when the components are of similar importance and the relative effects of the intervention are similar across components (Guyatt et al. 2008a).”²²

Not only do important methodologists urge caution in the interpretation of composite end points,²³ they emphasize a basic point of scientific (and legal) relevancy:

“[A] positive result for a composite outcome applies only to the cluster of events included in the composite and not to the individual components.”²⁴

Even regular testifying expert witnesses for the litigation industry insist upon the “principle of full disclosure”:

“The analysis of the effect of therapy on the combined end point should be accompanied by a tabulation of the effect of the therapy for each of the component end points.”²⁵

Gatekeepers in our judicial system need to be more vigilant against bait-and-switch inferences based upon composite end points. The quest for statistical power hardly justifies larding up an end point with irrelevant data points.

1 See, e.g., Milton Packer, “Unbelievable! Electrophysiologists Embrace ‘Alternative Facts’,” MedPage (May 16, 2018) (describing clinical trialists’ abandoning pre-specified intention-to-treat analysis).

2 Curtis Meinert, Clinical Trials Dictionary (Johns Hopkins Center for Clinical Trials 1996).

3 Victor M. Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596 (2005).

4 R. Fletcher & S. Fletcher, Clinical Epidemiology: The Essentials at 109 (4^th ed. 2005).

5 Neaton, et al., “Key issues in end point selection for heart failure trials: composite end points,” 11 J. Cardiac Failure 567, 569a (2005).

6 Schulz & Grimes, “Multiplicity in randomized trials I: endpoints and treatments,” 365 Lancet 1591, 1593a (2005).

7 Freemantle & Calvert, “Composite and surrogate outcomes in randomized controlled trials,” 334 Brit. Med. J. 756, 756a – b (2007).

8 International Conference on Harmonisation of Technical Requrements for Registration of Pharmaceuticals for Human Use; “ICH harmonized tripartite guideline: statistical principles for clinical trials,” 18 Stat. Med. 1905 (1999).

9 Neaton, et al., “Key issues in end point selection for heart failure trials: composite end points,” 11 J. Cardiac Failure 567, 569b (2005).

10 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596, Summary Point No. 2 (2005).

11 See “Lumpenepidemiology” (Dec. 24, 2012), discussing Frischhertz v. SmithKline Beecham Corp., 2012 U.S. Dist. LEXIS 181507 (E.D. La. 2012).Frischhertz was decided in the same month that a New York City trial judge ruled Dr. Shira Kramer out of bounds in the commission of similarly invalid lumping, in Reeps v. BMW of North America, LLC, 2012 NY Slip Op 33030(U), N.Y.S.Ct., Index No. 100725/08 (New York Cty. Dec. 21, 2012) (York, J.), 2012 WL 6729899, aff’d on rearg., 2013 WL 2362566, aff’d, 115 A.D.3d 432, 981 N.Y.S.2d 514 (2013), aff’d sub nom. Sean R. v. BMW of North America, LLC, ___ N.E.3d ___, 2016 WL 527107 (2016). See also “New York Breathes Life Into Frye Standard – Reeps v. BMW” (Mar. 5, 2013).

12 “Infante-lizing the IARC” (May 13, 2018).

13 Knight v. Kirby Inland Marine, 363 F.Supp. 2d 859, 864 (N.D. Miss. 2005), aff’d, 482 F.3d 347 (5th Cir. 2007) (excluding opinion of B.S. Levy on Hodgkin’s disease based upon studies of other lymphomas and myelomas); Allen v. Pennsylvania Eng’g Corp., 102 F.3d 194, 198 (5th Cir. 1996) (noting that evidence suggesting a causal connection between ethylene oxide and human lymphatic cancers is not probative of a connection with brain cancer);Current v. Atochem North America, Inc., 2001 WL 36101283, at *3 (W.D. Tex. Nov. 30, 2001) (excluding expert witness opinion of Michael Gochfeld, who asserted that arsenic causes rectal cancer on the basis of studies that show association with lung and bladder cancer; Hill’s consistency factor in causal inference does not apply to cancers generally); Exxon Corp. v. Makofski, 116 S.W.3d 176, 184-85 (Tex. App. Houston 2003) (“While lumping distinct diseases together as ‘leukemia’ may yield a statistical increase as to the whole category, it does so only by ignoring proof that some types of disease have a much greater association with benzene than others.”).

14The Public Affairs Committee of the Teratology Society, “Teratology Society Public Affairs Committee Position Paper Causation in Teratology-Related Litigation,” 73 Birth Defects Research (Part A) 421, 423 (2005).

15 168 F. Supp. 2d 1271, 1284–87 (D. Utah 2001).

16 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 595b (2005).

17 Kevin Kip, et al., “The problem with composite end points in cardiovascular studies,” 51 J. Am. Coll. Cardiol. 701, 701 (2008) (Abstract – Conclusions) (emphasis in original).

18 Montori, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004) (emphasis added).

19 Id. at 1094b, 1095a.

20 Montori, et al., “Validity of composite end points in clinical trials.” 300 Brit. Med. J. 594, 596 (2005).

21 Schulz & Grimes, “Multiplicity in randomized trials I: endpoints and treatments,” 365 Lancet 1591, 1595a (2005) (emphasis added). These authors acknowledge that composite end points often lack clinical relevancy, and that the gain in statistical efficiency comes at the high cost of interpretational difficulties. Id. at 1593.

22 Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 1840 (2d ed. 2014) (47.5.8 Use of Composite Endpoints).

23 See, e.g., Stuart J. Pocock, John J.V. McMurray, and Tim J. Collier, “Statistical Controversies in Reporting of Clinical Trials: Part 2 of a 4-Part Series on Statistics for Clinical Trials,” 66 J. Am. Coll. Cardiol. 2648, 2650-51 (2015) (“Interpret composite endpoints carefully.”)(“COMPOSITE ENDPOINTS. These are commonly used in CV RCTs to combine evidence across 2 or more outcomes into a single primary endpoint. But, there is a danger of oversimplifying the evidence by putting too much emphasis on the composite, without adequate inspection of the contribution from each separate component.”); Eric Lim, Adam Brown, Adel Helmy, Shafi Mussa, and Douglas G. Altman, “Composite Outcomes in Cardiovascular Research: A Survey of Randomized Trials,” 149 Ann. Intern. Med. 612, 612, 615-16 (2008) (“Individual outcomes do not contribute equally to composite measures, so the overall estimate of effect for a composite measure cannot be assumed to apply equally to each of its individual outcomes.”) (“Therefore, readers are cautioned against assuming that the overall estimate of effect for the composite outcome can be interpreted to be the same for each individual outcome.”); Freemantle, et al., “Composite outcomes in randomized trials: Greater precision but with greater uncertainty.” 289 J. Am. Med. Ass’n 2554, 2559a (2003) (“To avoid the burying of important components of composite primary outcomes for which on their own no effect is concerned, . . . the components of a composite outcome should always be declared as secondary outcomes, and the results described alongside the result for the composite outcome.”).

24 Freemantle & Calvert, “Composite and surrogate outcomes in randomized controlled trials.” 334 Brit. Med. J. 757a (2007).

25 Lem Moyé, “Statistical Methods for Cardiovascular Researchers,” 118 Circulation Research 439, 451 (2016).

Posted in Causation, Expert Witnesses, Rule 702, statistical evidence, Statistical Power | Comments Off on The Hazard of Composite End Points – More Lumpenepidemiology in the Courts

Carl Cranor’s Conflicted Jeremiad Against Daubert

September 23rd, 2018

Carl Cranor’s Conflicted Jeremiad Against Daubert

It seems that authors who have the most intense and refractory conflicts of interest (COI) often fail to see their own conflicts and are the most vociferous critics of others for failing to identify COIs. Consider the spectacle of having anti-tobacco activists and tobacco plaintiffs’ expert witnesses assert that the American Law Institute had an ethical problem because Institute members included some tobacco defense lawyers.¹ Somehow these authors overlooked their own positional and financial conflicts, as well as the obvious fact that the Institute’s members included some tobacco plaintiffs’ lawyers as well. Still, the complaint was instructive because it typifies the abuse of ethical asymmetrical standards, as well as ethical blindspots.²

Recently, Raymond Richard Neutra, Carl F. Cranor, and David Gee published a paper on the litigation use of Sir Austin Bradford Hill’s considerations for evaluating whether an association is causal or not.³ See Raymond Richard Neutra, Carl F. Cranor, and David Gee, “The Use and Misuse of Bradford Hill in U.S. Tort Law,” 58 Jurimetrics 127 (2018) [cited here as Cranor]. Their paper provides a startling example of hypocritical and asymmetrical assertions of conflicts of interests.

Neutra is a self-styled public health advocate⁴ and the Chief of the Division of Environmental and Occupational Disease Control (DEODC) of the California Department of Health Services (CDHS). David Gee, not to be confused with the English artist or the Australian coin forger, is with the European Environment Agency, in Copenhagen, Denmark. He is perhaps best known for his precautionary principle advocacy and his work with trade unions.⁵

Carl Cranor is with the Center for Progressive Reform, and he teaches philosophy at one of the University of California campuses. Although he is neither a lawyer nor a scientist, he participates with some frequency as a consultant, and as an expert witness, in lawsuits, on behalf of claimants. Perhaps Cranor’s most notorious appearance as an expert witness resulted in the decision of Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11 (1st Cir. 2011), cert. denied sub nom., U.S. Steel Corp. v. Milward, 132 S. Ct. 1002 (2012). Probably less generally known is that Cranor was one of the founders of an organization, the Council for Education and Research on Toxics (CERT), which recently was the complaining party in a California case in which CERT sought money damages for Starbucks’ failure to label each cup of coffee sold as known to the State of California as causing cancer.⁶ Having a so-called not-for-profit corporation can also be pretty handy, especially when it holds itself out as a scientific organization and files amicus briefs in support of reversing Daubert exclusions of the founding members of the corporation, as CERT did on behalf of its founding member in the Milward case.⁷ The conflict of interest, in such an amicus brief, however, is no longer potential or subtle, and violates the duty of candor to the court.

In this recent article on Hill’s considerations for judging causality, Cranor followed CERT’s lead from Milward. Cranor failed to disclose that he has been a party expert witness for plaintiffs, in cases in which he was advocating many of the same positions put forward in the Jurimetrics article, including the Milward case, in which he was excluded from testifying by the trial court. Cranor’s lack of candor with the readers of the Jurimetrics article is all the more remarkable in that Cranor and his co-authors give conflicts of interest outsize importance in substantive interpretations of scholarship:

“the desired reliability for evidence evaluation requires that biases that derive from the financial interests and ideological commitments of the investigators and editors that control the gateways to publication be considered in a way that Hill did not address.”

Cranor at 137 & n.59. Well, we could add that Cranor’s financial interests and ideological commitments might well be considered in evaluating the reliability of the opinions and positions advanced in this most recent work by Cranor and colleagues. If you believe that COIs disqualify a speaker from addressing important issues, then you have all the reason you need to avoid reading Cranor’s recent article.

Dubious Scholarship

The more serious problem with Cranor’s article is not his ethically strained pronouncements about financial interests, but the dubious scholarship he and his colleagues advance to thwart judicial gatekeeping of even more dubious expert witness opinion testimony. To begin with, the authors disparage the training and abilities of federal judges to assess the epistemic warrant and reliability of proffered causation opinions:

“With their enhanced duties to review scientific and technical testimony federal judges, typically not well prepared by legal education for these tasks, have struggled to assess the scientific support for—and the reliability and relevance of—expert testimony.”

Cranor at 147. Their assessment is fair but hides the authors’ cynical agenda to remove gatekeeping and leave the assessment to lay juries, who are less well prepared for the task, and whose function ensures no institutional accountability, review, or public evaluation.

Similarly, the authors note the temporal context and limitations of Bradford Hill’s 1965 paper, which date and limit the advice provided over 50 years ago in a discipline that has changed dramatically with the advancement of biological, epidemiologic, and genetic science.⁸ Even at the time of its original publication in 1965, Bradford Hill’s paper, which was based upon an informal lecture, was not designed or intended to be a definitive treatment of causal inference. Cranor and his colleagues make no effort to review Bradford Hill’s many other publications, both before and after his 1965 dinner speech, for evidence of his views on the factors for causal inference, including the role of statistical testing and inference.

Nonetheless, Bradford Hill’s 1965 paper has become a landmark, even if dated, because of its author’s iconic status in the world of public health, earned for his showing that tobacco smoking causes lung cancer,⁹ and for advancing the role of double-blind randomized clinical trials.¹⁰ Cranor and his colleagues made no serious effort to engage with the large body of Bradford Hill’s writings, including his immensely important textbook, The Principles of Medical Statistics, which started as a series of articles in The Lancet, and went through 12 editions in print.¹¹Hill’s reputation will no doubt survive Cranor’s bowdlerized version of Sir Austin’s views.

Epidemiology is Dispensable When It Fails to Support Causal Claims

The egregious aspect of Cranor’s article is its bill of particulars against the federal judiciary for allegedly errant gatekeeping, which for these authors translates really into any gatekeeping at all. Cranor at 144-45. Indeed, the authors provide not a single example of what was a “proper” exclusion of an expert witness, who was contending for some doubtful causal claim. Perhaps they have never seen a proper exclusion, but doesn’t that speak volumes about their agenda and their biases?

High on the authors’ list of claimed gatekeeping errors is the requirement that a causal claim be supported with epidemiologic evidence. Although some causal claims may be supported by strong evidence of a biological process with mechanistic evidence, such claims are not common in United States tort litigation.

In support of the claim that epidemiology is dispensable, Cranor suggests that:

“Some courts have recognized this, and distinguished scientific committees often do not require epidemiological studies to infer harm to humans. For example, the International Agency for Research on Cancer (IRAC) [sic], the National Toxicology Program, and California’s Proposition 65 Scientific Advisory Panel, among others, do not require epidemiological data to support findings that a substance is a probable or—in some cases—a known human carcinogen, but it is welcomed if available.”

Cranor at 149. California’s Proposition 65!??? Even IARC is hard to take seriously these days with its capture by consultants for the litigation industry, but if we were to accept IARC as an honest broker of causal inferences, what substance “known” to IARC to cause cancer in humans (Category I) was branded as a “known carcinogen” without the support of epidemiologic studies? Inquiring minds might want to know, but they will not learn the answer from Cranor and his co-authors.

When it comes to adverting to legal decisions that supposedly support the authors’ claim that epidemiology is unnecessary, their scholarship is equally wanting. The paper cites the notorious Wells case, which was so roundly condemned in scientific circles, that it probably helped ensure that a decision such as Daubert would ultimately be handed down by the Supreme Court. The authors seemingly cannot read, understand, and interpret even the most straightforward legal decisions. Here is how they cite Wells as support for their views:

“Wells v. Ortho Pharm. Corp., 788 F.2d 741, 745 (11th Cir. 1986) (reviewing a district court’s decision deciding not to require the use of epidemiological evidence and instead allowing expert testimony).”

Cranor at 149-50 n.122. The trial judge in Wells never made such a decision; indeed, the case was tried by the bench, before the Supreme Court decided Daubert. There was no gatekeeping involved at all. More important, however, and contrary to Cranor’s explanatory parenthetical, both sides presented epidemiologic evidence in support of their positions.¹²

Cranor and his co-authors similarly misread and misrepresent the trial court’s decision in the litigation over maternal sertraline use and infant birth defects. Twice they cite the Multi-District Litigation trial court’s decision that excluded plaintiffs’ expert witnesses:

“In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., 26 F. Supp. 3d 449, 455 (E.D. Pa. 2014) (expert may not rely on nonstatistically significant studies to which to apply the [Bradford Hill] factors).”

Cranor at 144 n.85; 158 n.179. The MDL judge, Judge Rufe, decidedly never held that an expert witness may not rely upon a statistically non-significant study in a “Bradford Hill” analysis, and the Third Circuit, which affirmed the exclusions of the plaintiffs’ expert witnesses’ testimony, was equally clear in avoiding the making of such a pronouncement.¹³

Who Needs Statistical Significance

Part of Cranor’s post-science agenda is to intimidate judges into believing that statistical significance is unnecessary and a wrong-headed criterion for judging the validity of relied upon research. In their article, Cranor and friends suggest that Hill agreed with their radical approach, but nothing could be further from the truth. Although these authors parse almost every word of Hill’s 1965 article, they conveniently omit Hill’s views about the necessary predicates for applying his nine considerations for causal inference:

“Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965). Cranor’s radicalism leaves no room for assessing whether a putative association is “beyond what we would care to attribute to the play of chance,” and his poor scholarship ignores Hill’s insistence that this statistical analysis be carried out.¹⁴

Hill’s work certainly acknowledged the limitations of statistical method, which could not compensate for poorly designed research:

“It is a serious mistake to rely upon the statistical method to eliminate disturbing factors at the completion of the work. No statistical method can compensate for a badly planned experiment.”

Austin Bradford Hill, Principles of Medical Statistics at 4 (4th ed. 1948). Hill was equally clear, however, that the limits on statistical methods did not imply that statistical methods are not needed to interpret a properly planned experiment or study. In the summary section of his textbook’s first chapter, Hill removed any doubt about his view of the importance, and the necessity, of statistical methods:

“The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”

Id. at 10 (emphasis added).

In his efforts to eliminate judicial gatekeeping of expert witness testimony, Cranor has struggled with understanding of statistical inference and testing.¹⁵ In an early writing, a 1993 book, Cranor suggests that we “can think of type I and II error rates as “standards of proof,” which begs the question whether they are appropriately used to assess significance or posterior probabilities.¹⁶ Indeed, Cranor goes further, in confusing significance and posterior probabilities, when he described the usual level of alpha (5%) as the “95%” rule, and claimed that regulatory agencies require something akin to proof “beyond a reasonable doubt,” when they require two “statistically significant” studies.¹⁷

Cranor has persisted in this fallacious analysis in his writings. In a 2006 book, he erroneously equated the 95% coefficient of statistical confidence with 95% certainty of knowledge.¹⁸ Later in this same text, Cranor again asserted his nonsense that agency regulations are written when supported by “beyond a reasonable doubt.”¹⁹ Given that Cranor has consistently confused significance and posterior probability, he really should not be giving advice to anyone about statistical or scientific inference. Cranor’s persistent misunderstandings of basic statistical concepts do, however, explain his motivation for advocating the elimination of statistical significance testing, even if these misunderstandings make his enterprise intellectually unacceptable.

Cranor and company fall into a similar muddle when they offer advice on post-hoc power calculations, which advice ignores standard statistical learning for interpreting completed studies.²⁰ Another measure of the authors’ failed scholarship is their omission of any discussion of recent efforts by many in the scientific community to lower the threshold for statistical significance, based upon the belief that the customary 5% p-value is an order of magnitude too high.²¹

Relative Risks Greater Than Two

There are other tendentious arguments and treatments in Cranor’s brief against gatekeeping, but I will stop with one last example. The inference of specific causation from study risk ratios has provoked a torrent of verbiage from Sander Greenland (who is cited copiously by Cranor). Cranor, however, does not even scratch the surface of the issue and fails to cite the work of epidemiologists, such as Duncan C. Thomas, who have defended the use of probabilities of (specific) causation. More important, however, Cranor fails to speak out against the abuse of using any relative risk greater than 1.0 to support an inference of specific causation, when the nature of the causal relationship is neither necessary nor sufficient. In this context, Kenneth Rothman has reminded us that someone can be exposed to, or have, a risk, and then develop the related outcome, without there being any specific causation:

“An elementary but essential principle to keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between the exposure and the disease. For this reason, we cannot consider the incidence proportion or the incidence rate among exposed people to measure a causal effect.”

Kenneth J. Rothman, Epidemiology: An Introduction at 57 (2d ed. 2012).

The danger in Cranor’s article in Jurimetrics is that some readers will not realize the extreme partisanship in its ipse dixit, and erroneous, pronouncements. Caveat lector!

1 Elizabeth Laposata, Richard Barnes & Stanton Glantz, “Tobacco Industry Influence on the American Law Institute’s Restatements of Torts and Implications for Its Conflict of Interest Policies,” 98 Iowa L. Rev. 1 (2012).

2 The American Law Institute responded briefly. See Roberta Cooper Ramo & Lance Liebman, “The ALI’s Response to the Center for Tobacco Control Research & Education,” 98 Iowa L. Rev. Bull. 1 (2013), and the original authors’ self-serving last word. Elizabeth Laposata, Richard Barnes & Stanton Glantz, “The ALI Needs to Implement Modern Conflict of Interest Policies,” 98 Iowa L. Rev. Bull. 17 (2013).

3 Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295 (1965).

4 Raymond Richard Neutra, “Epidemiology Differs from Public Health Practice,” 7 Epidemiology 559 (1996).

5 See “Late Lessons from Early Warnings: The Precautionary Principle 1896-2000, a Conversation with David Gee” (2009).

6 Michael Waters, “The Secretive Non-Profit Gaming California’s Health Laws: The Council for Education and Research on Toxics has won million-dollar settlements using a controversial public health law,” The Outline (June 18, 2018).

7 “From Here to CERT-ainty” (June 28, 2018).

8 Kristen Fedak, Autumn Bernal, Zachary Capshaw, and Sherilyn A Gross, “Applying the Bradford Hill Criteria in the 21^st Century: How Data Integration Has Changed Causal Inference in Molecular Epidemiology,” Emerging Themes in Epidemiol. 12:14 (2015); John P. A. Ioannides, “Exposure Wide Epidemiology, Revisiting Bradford Hill,” 35 Stats. Med. 1749 (2016).

9 Richard Doll & Austin Bradford Hill, “Smoking and Carcinoma of the Lung,” 2(4682) Brit. Med. J. (1950).

10 Geoffrey Marshall (chairman), “Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation,” 2 Brit. Med. J. 769, 769–71 (1948).

11 Vern Farewell & Anthony Johnson, “The origins of Austin Bradford Hill’s classic textbook of medical statistics,” 105 J. Royal Soc’y Med. 483 (2012). See also Hilary E. Tillett, “Bradford Hill’s Principles of Medical Statistics,” 108 Epidemiol. Infect. 559 (1992).

12 See “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 1”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 2”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 3”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 4”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 5”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 6.”

13 In re Zoloft Prod. Liab. Litig., No. 16-2247 , __ F.3d __, 2017 WL 2385279, 2017 U.S. App. LEXIS 9832 (3d Cir. June 2, 2017) (affirming exclusion of biostatistician Nichols Jewell’s dodgy opinions, which involved multiple methodological flaws and failures to follow any methodology faithfully).

14 See “Bradford Hill on Statistical Methods” (Sept. 24, 2013).

15 “Rhetorical Strategy in Characterizing Scientific Burdens of Proof” (Nov. 15, 2014).

16 Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34 (1993) (arguing incorrectly that one can think of α, β (the chances of type I and type II errors, respectively and 1- β as measures of the “risk of error” or “standards of proof.”); see also id. at 44, 47, 55, 72-76. At least one astute reviewer called Cranor on his statistical solecisms. Michael D. Green, “Science Is to Law as the Burden of Proof is to Significance Testing: Book Review of Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law,” 37 Jurimetrics J. 205 (1997) (taking Cranor to task for confusing significance and posterior (burden of proof) probabilities).

17 Id. (squaring 0.05 to arrive at “the chances of two such rare events occurring” as 0.0025, which impermissibly assumes independence between the two studies).

18 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 100 (2006) (incorrectly asserting that “[t]he practice of setting α =.05 I call the “95% rule,” for researchers want to be 95% certain that when knowledge is gained [a study shows new results] and the null hypothesis is rejected, it is correctly rejected.”).

19 Id. at 266.

20 See “Failed Gatekeeping in Ambrosini v. Labarraque (1996)” (Dec. 28, 2017).

21 See, e.g., John P. A. Ioannidis, “The Proposal to Lower P Value Thresholds to .005,” 319 J. Am. Med. Ass’n 1429 (2018); Daniel J. Benjamin, James O. Berger, Valen E. Johnson, et al., “Redefine statistical significance,” 2 Nature Human Behavior 6 (2018).

Posted in Causation, Conflicts of Interest, Public Nuisances, Rule 702, Scientific Evidence, statistical evidence | Comments Off on Carl Cranor’s Conflicted Jeremiad Against Daubert

The Appeal of the Learned Treatise

August 16th, 2018

In many states, the so-called “learned treatise” doctrine creates a pseudo-exception to the rule against hearsay. The contents of such a treatise can be read to the jury, not for its truth, but for the jury to consider against the credibility of an expert witness who denies the truth of the treatise. Supposedly, some lawyers can understand the distinction between the treatise’s content’s being admitted for its truth as opposed to the credibility of an expert witness who denies its truth. Under the Federal Rules of Evidence, and in some states, the language of the treatise may be considered for its truth as well, but the physical treatise may not be entered into evidence. There are several serious problems with both the state and the federal versions of the doctrine.¹

Legal on-line media recently reported about an appeal in the Pennsylvania Superior Court, which heard arguments in a case that apparently turned on allegations of trial court error in refusing to allow learned treatise cross-examination of a plaintiff’s expert witness in Pledger v. Janssen Pharms., Inc., Phila. Cty. Ct. C.P., April Term 2012, No. 1997. See Matt Fair, “J&J Urges Pa. Appeals Court To Undo $2.5M Risperdal Verdict,” Law360 (Aug. 8, 2018) (reporting on defendants’ appeal in Pledger, Pa. Super. Ct. nos. 2088 EDA 2016 and 2187 EDA 2016).

In Pledger, plaintiff claimed that he developed gynecomastia after taking the defendants’ antipsychotic medication Risperdal. Defendants warned about gynecomastia, but the plaintiff claimed that the defendants had not accurately quantified the rate of gynecomastia in its package insert.

From Mr. Fair’s reporting, readers can discern only one ground for appeal, namely whether the “trial judge improperly barred it from using a scientific article to challenge an expert’s opinion that the antipsychotic drug Risperdal caused an adolescent boy to grow breasts.” Without having heard the full oral argument, or having read the briefs, the reader cannot tell whether there were other grounds. According to Mr. Fair, defense counsel contended that the trial court’s refusal to allow the learned treatise “had allowed the [plaintiff’s] expert’s opinion to go uncountered during cross-examination.” The argument, according to Mr. Fair, continued:

“Instead of being able to confront the medical causation expert with an article that absolutely contradicted and undermined his opinion, the court instead admonished counsel in front of the jury and said, ‘In Pennsylvania, we don’t try cases by books, we try them by live witnesses’.”

The cross-examination at issue, on the other hand, related to whether gynecomastia could occur naturally in pre-pubertal boys. Plaintiffs’ expert witness, Dr. Mark Solomon, a plastic surgeon, opined that gynecomastia did not occur naturally, and the defense counsel attempted to confront him with a “learned treatise,” an article from the Journal of Endocrinology, which apparently stated to the contrary. Solomon, following the usual expert witness playbook, testified that he had not read the article (and why would a surgeon have read this endocrinology journal?) Defense counsel pressed, and according to Mr. Fair, the trial judge disallowed further inquiry on cross-examination. On appeal, the defendants argued that the trial judge violated the learned treatise rule that allows “scholarly articles to be used as evidence.” The plaintiffs contended, in defense of their judgment below, that the “learned treatise rule” does not allow “scholarly articles to simply be read verbatim into the record,” and that the defense had the chance to raise the article in the direct examination of its own expert witnesses.

The Law360 reporting is curious on several fronts. The assigned error would have only been in support of a challenge to the denial of a new trial, and in a Risperdal case, the defense would likely have made a motion for judgment notwithstanding the verdict, as well as for new trial. Although the appellate briefs are not posted online, the defense’s post-trial motions in Pledger v. Janssen Pharms., Inc., Phila. Cty. Ct. C.P., April Term 2012, No. 1997, are available. See Defendants’ Motions for Post-Trial Relief Pursuant to Pa.R.C.P. 227.1 (Mar. 6, 2015).

At least at the post-trial motion stage, the defendants clearly made both motions for judgment and for a new trial, as expected.

As for the preservation of the “learned treatise” issue, the entire assignment of error is described in a single paragraph (out of 116 paragraphs) in the post-trial motion, as follows:

27. Moreover, appearing to rely on Aldridge v. Edmunds, 750 A.2d 292 (Pa. 2000), the Court prevented Janssen from cross-examining Dr. Solomon with scientific authority that would undermine his position. See, e.g., Tr. 60:9-63:2 (p.m.). Aldridge, however, addresses the use of learned treatises in the direct examination, and it cites with approval the case of Cummings v. Borough of Nazareth, 242 A.2d 460, 466 (Pa. 1968) (plurality op.), which stated that “[i]t is entirely proper in examination and cross-examination for counsel to call the witness’s attention to published works on the matter which is the subject of the witness’s testimony.” Janssen should not have been so limited in its cross examination of Dr. Solomon.

In Cummings, the issue revolved around using manuals that contained industry standards for swimming pool construction, not the appropriateness of a learned scientific treatise. Cummings v. Nazareth Borough, 430 Pa. 255, 266-67 (Pa. 1968). The defense motion did not contend that the defense counsel had laid the appropriate foundation for the learned treatise to be used. In any event, the trial judge wrote an opinion on the post-trial motions, in which he did not appear to address the learned treatise issue at all. Pledger v Janssen Pharms, Inc., Phila. Ct. C.P., Op. sur post-trial motions (Aug. 10., 2017) (Djerassi, J.).

The Pennsylvania Supreme Court has addressed the learned treatise exception to the rule against hearsay on several occasions. Perhaps the leading case described the law as:

“well-settled that an expert witness may be cross-examined on the contents of a publication upon which he or she has relied in forming an opinion, and also with respect to any other publication which the expert acknowledges to be a standard work in the field. * * * In such cases, the publication or literature is not admitted for the truth of the matter asserted, but only to challenge the credibility of the witness’ opinion and the weight to be accorded thereto. * * * Learned writings which are offered to prove the truth of the matters therein are hearsay and may not properly be admitted into evidence for consideration by the jury.”

Majdic v. Cincinnati Mach. Co., 537 A. 2d 334, 621-22 (Pa. 1988) (internal citations omitted).

The Law360 report is difficult to assess. Perhaps the reporting by Mr. Fair was non-eponymously unfair? There is no discussion of how the defense had laid its foundation. Perhaps the defense had promised “to connect up” by establishing the foundation of the treatise through a defense expert witness. If there had been a foundation established, or promised to be established, the post-trial motion would have, in the normal course of events, cited the transcript for the proffer of a foundation. And why did Mr. Fair report on the oral argument as though the learned treatise issue was the only issue before the court? Inquiring minds want to know.

Judge Djerassi’s opinion on post-trial motions was perhaps more notable for embracing some testimony on statistical significance from Dr. David Kessler, former Commissioner of the FDA, and now a frequent testifier for the lawsuit industry on regulatory matters. Judge Djerassi, in his opinion, stated:

“This statistically significant measure is shown in Table 21 and was within a chi-square rate of .02, meaning within a 98% chance of certainty. In Dr. Kessler’s opinion this is a statistically significant finding. (N.T. 1/29/15, afternoon, pp. p. 27, ln. 2 10-11, p. 28, lns. 7-12).”

Post-trial opinion at p.11.² Surely, the defense’s expert witnesses explained that the chi-square test did not yield a measure of certainty that the measured statistic was the correct value.

The trial court’s whopper was enough of a teaser to force me to track down Kessler’s testimony, which was posted to the internet by the plaintiffs’ law firm. Judge Djerassi’s erroneous interpretation of the p-value can indeed be traced to Kessler’s improvident testimony:

Q. And since 2003, what have you been doing at University of California San Francisco, sir?

A. Among other things, I am currently a professor of pediatrics, professor of epidemiology, professor of biostatistics.

Pledger Transcript, Thurs., Jan. 28, 2015, Vol. 3, Morning Session at 111:3-7.

A. What statistical significance means is it’s mathematical and scientific calculations, but when we say something is statistically significant, it’s unlikely to happen by chance. So that association is very likely to be real. If you redid this, general statistically significant says if I redid this and redid the analysis a hundred times, I would get the same result 95 of those times.

Pledger Transcript, Fri., Jan. 29, 2015, Vol. 4, Morning Session at 80:18 – 81:2.

Q. So, sir, if we see on a study — and by the way, do the investigators of a study decided in their own criteria what is statistically significant? Do they assign what’s called a P value?

A. Exactly. So you can set it at 95, you can set it at 98, you can set it at 90. Generally, 95 significance level, for those of you who are mathematicians or scientifically inclined, it’s a P less than .05.

Q. As a general rule?

A. Yes.

Q. So if I see a number that is .0158, next to a dataset, that would mean that it occurs by chance less than two in 100. Correct?

A. Yes, that’s what the P value is saying.

Pledger Transcript, Fri., Jan. 29, 2015, Vol. 4, Morning Session at 81:5-20

Q. … If someone — if something has a p-value of less than .02, the converse of it is that your 98 — .98, that would be 98 percent certain that the result is not by chance?

A. Yes. That’s a fair way of saying it.

Q. And if you have a p-value of .10, that means the converse of it is 90 percent, or 90 percent that it’s not by chance, correct?

A. Yes.

Pledger Transcript, Fri., Jan. 29, 2015, Vol. 4, Afternoon Session at 7:14-22.

Q. Okay. And the last thing I’d like to ask about — sorry to keep going back and forth — is so if the jury saw a .0158, that’s of course less than .02, which means that it is 90 — almost 99 percent not by chance.

A. Yes. It’s statistically significant, as I would call it.

Pledger Transcript, Fri., Jan. 29, 2015, Vol. 4, Afternoon Session at 8:7-13.

1 See “Further Unraveling of the Learned Treatise Exception” (Sept. 29, 2010); “Unlearning The Learned Treatise Exception” (Aug. 21, 2010); “The New Wigmore on Learned Treatises” (Sept. 12, 2011); “Trust Me” Rules of Evidence” (Oct. 18, 2012).

2 See also Djerassi opinion at p.13 n. 13 (“P<0.02 is the chi—square rate reflecting a data outcome within a 98% chance of certainty.”).

Posted in Causation, Expert Witnesses, statistical evidence | Comments Off on The Appeal of the Learned Treatise

N.J. Supreme Court Uproots Weeds in Garden State’s Law of Expert Witnesses

August 8th, 2018

The United States Supreme Court’s decision in Daubert is now over 25 years old. The idea of judicial gatekeeping of expert witness opinion testimony is even older in New Jersey state courts. The New Jersey Supreme Court articulated a reliability standard before the Daubert case was even argued in Washington, D.C. See Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991). Articulating a standard, however, is something very different from following a standard, and in many New Jersey trial courts, until very recently, the standard was pretty much anything goes.

One counter-example to the general rule of dog-eat-dog in New Jersey was Judge Nelson Johnson’s careful review and analysis of the proffered causation opinions in cases in which plaintiffs claimed that their use of the anti-acne medication isotretinoin (Accutane) caused Crohn’s disease. Judge Johnson, who sits in the Law Division of the New Jersey Superior Court for Atlantic County held a lengthy hearing, and reviewed the expert witnesses’ reliance materials.¹ Judge Johnson found that the plaintiffs’ expert witnesses had employed undue selectivity in choosing what to rely upon. Perhaps even more concerning, Judge Johnson found that these witnesses had refused to rely upon reasonably well-conducted epidemiologic studies, while embracing unpublished, incomplete, and poorly conducted studies and anecdotal evidence. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div., Atlantic Cty. Feb. 20, 2015). In response, Judge Johnson politely but firmly closed the gate to conclusion-driven duplicitous expert witness causation opinions in over 2,000 personal injury cases. “Johnson of Accutane – Keeping the Gate in the Garden State” (Mar. 28, 2015).

Aside from resolving over 2,000 pending cases, Judge Johnson’s judgment was of intense interest to all who are involved in pharmaceutical and other products liability litigation. Judge Johnson had conducted a pretrial hearing, sometimes called a Kemp hearing in New Jersey, after the New Jersey Supreme Court’s opinion in Kemp v. The State of New Jersey, 174 N.J. 412 (2002). At the hearing and in his opinion that excluded plaintiffs’ expert witnesses’ causation opinions, Judge Johnson demonstrated a remarkable aptitude for analyzing data and inferences in the gatekeeping process.

When the courtroom din quieted, the trial court ruled that the proffered testimony of Dr., Arthur Kornbluth and Dr. David Madigan did not meet the liberal New Jersey test for admissibility. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div. Atlantic Cty. Feb. 20, 2015). And in closing the gate, Judge Johnson protected the judicial process from several bogus and misleading “lines of evidence,” which have become standard ploys to mislead juries in courthouses where the gatekeepers are asleep. Recognizing that not all evidence is on the same analytical plane, Judge Johnson gave case reports short shrift.

“[u]nsystematic clinical observations or case reports and adverse event reports are at the bottom of the evidence hierarchy.”

Id. at *16. Adverse event reports, largely driven by the very litigation in his courtroom, received little credit and were labeled as “not evidentiary in a court of law.” Id. at 14 (quoting FDA’s description of FAERS).

Judge Johnson recognized that there was a wide range of identified “risk factors” for irritable bowel syndrome, such as prior appendectomy, breast-feeding as an infant, stress, Vitamin D deficiency, tobacco or alcohol use, refined sugars, dietary animal fat, fast food. In re Accutane, 2015 WL 753674, at *9. The court also noted that there were four medications generally acknowledged to be potential risk factors for inflammatory bowel disease: aspirin, nonsteroidal anti-inflammatory medications (NSAIDs), oral contraceptives, and antibiotics. Understandably, Judge Johnson was concerned that the plaintiffs’ expert witnesses preferred studies unadjusted for potential confounding co-variables and studies that had involved “cherry picking the subjects.” Id. at *18.

Judge Johnson had found that both sides in the isotretinoin cases conceded the relative unimportance of animal studies, but the plaintiffs’ expert witnesses nonetheless invoked the animal studies in the face of the artificial absence of epidemiologic studies that had been created by their cherry-picking strategies. Id.

Plaintiffs’ expert witnesses had reprised a common claimants’ strategy; namely, they claimed that all the epidemiology studies lacked statistical power. Their arguments often ignored that statistical power calculations depend upon statistical significance, a concept to which many plaintiffs’ counsel have virulent antibodies, as well as an arbitrarily selected alternative hypothesis of association size. Furthermore, the plaintiffs’ arguments ignored the actual point estimates, most of which were favorable to the defense, and the observed confidence intervals, most of which were reasonably narrow.

The defense responded to the bogus statistical arguments by presenting an extremely capable clinical and statistical expert witness, Dr. Stephen Goodman, to present a meta-analysis of the available epidemiologic evidence.

Meta-analysis has become an important facet of pharmaceutical and other products liability litigation[1]. Fortunately for Judge Johnson, he had before him an extremely capable expert witness, Dr. Stephen Goodman, to explain meta-analysis generally, and two meta-analyses he had performed on isotretinoin and irritable bowel outcomes.

Dr. Goodman explained that the plaintiffs’ witnesses’ failure to perform a meta-analysis was telling when meta-analysis can obviate the plaintiffs’ hyperbolic statistical complaints:

“the strength of the meta-analysis is that no one feature, no one study, is determinant. You don’t throw out evidence except when you absolutely have to.”

In re Accutane, 2015 WL 753674, at *8.

Judge Johnson’s judicial handiwork received non-deferential appellate review from a three-judge panel of the Appellate Division, which reversed the exclusion of Kornbluth and Madigan. In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832 (App. Div. 2017). The New Jersey Supreme Court granted the isotretinoin defendants’ petition for appellate review, and the issues were joined over the appropriate standard of appellate review for expert witness opinion exclusions, and the appropriateness of Judge Johnson’s exclusions of Kornbluth and Madigan. A bevy of amici curiae joined in the fray.²

Last week, the New Jersey Supreme Court issued a unanimous opinion, which reversed the Appellate Division’s holding that Judge Johnson had “mistakenly exercised” discretion. Applying its own precedents from Rubanick, Landrigan, and Kemp, and the established abuse-of-discretion standard, the Court concluded that the trial court’s ruling to exclude Kornbluth and Madigan was “unassailable.” In re Accutane Litig., ___ N.J. ___, 2018 WL 3636867 (2018), Slip op. at 79.³

The high court graciously acknowledged that defendants and amici had “good reason” to seek clarification of New Jersey law. Slip op. at 67. In abandoning abuse-of-discretion as its standard of review, the Appellate Division had relied upon a criminal case that involved the application of the Frye standard, which is applied as a matter of law. Id. at 70-71. The high court also appeared to welcome the opportunity to grant review and reverse the intermediate court reinforce “the rigor expected of the trial court” in its gatekeeping role. Id. at 67. The Supreme Court, however, did not articulate a new standard; rather it demonstrated at length that Judge Johnson had appropriately applied the legal standards that had been previously announced in New Jersey Supreme Court cases.⁴

In attempting to defend the Appellate Division’s decision, plaintiffs sought to characterize New Jersey law as somehow different from, and more “liberal” than, the United States Supreme Court’s decision in Daubert. The New Jersey Supreme Court acknowledged that it had never formally adopted the dicta from Daubert about factors that could be considered in gatekeeping, slip op. at 10, but the Court went on to note what disinterested observers had long understood, that the so-called Daubert factors simply flowed from a requirement of sound methodology, and that there was “little distinction” and “not much light” between the Landrigan and Rubanick principles and the Daubert case or its progeny. Id at 10, 80.

Curiously, the New Jersey Supreme Court announced that the Daubert factors should be incorporated into the New Jersey Rules 702 and 703 and their case law, but it stopped short of declaring New Jersey a “Daubert” jurisdiction. Slip op. at 82. In part, the Court’s hesitance followed from New Jersey’s bifurcation of expert witness standards for civil and criminal cases, with the Frye standard still controlling in the criminal docket. At another level, it makes no sense to describe any jurisdiction as a “Daubert” state because the relevant aspects of the Daubert decision were dicta, and the Daubert decision and its progeny were superseded by the revision of the controlling statute in 2000.⁵

There were other remarkable aspects of the Supreme Court’s Accutane decision. For instance, the Court put its weight behind the common-sense and accurate interpretation of Sir Austin Bradford Hill’s famous articulation of factors for causal judgment, which requires that sampling error, bias, and confounding be eliminated before assessing whether the observed association is strong, consistent, plausible, and the like. Slip op. at 20 (citing the Reference Manual at 597-99), 78.

The Supreme Court relied extensively on the National Academies’ Reference Manual on Scientific Evidence.⁶ That reliance is certainly preferable to judicial speculations and fabulations of scientific method. The reliance is also positive, considering that the Court did not look only at the problematic epidemiology chapter, but adverted also to the chapters on statistical evidence and on clinical medicine.

The Supreme Court recognized that the Appellate Division had essentially sanctioned an anything goes abandonment of gatekeeping, an approach that has been all-too-common in some of New Jersey’s lower courts. Contrary to the previously prevailing New Jersey zeitgeist, the Court instructed that gatekeeping must be “rigorous” to “prevent[] the jury’s exposure to unsound science through the compelling voice of an expert.” Slip op. at 68-9.

Not all evidence is equal. “[C]ase reports are at the bottom of the evidence hierarchy.” Slip op. at 73. Extrapolation from non-human animal studies is fraught with external validity problems, and such studies “far less probative in the face of a substantial body of epidemiologic evidence.” Id. at 74 (internal quotations omitted).

Perhaps most chilling for the lawsuit industry will be the Supreme Court’s strident denunciation of expert witnesses’ selectivity in choosing lesser evidence in the face of a large body of epidemiologic evidence, id. at 77, and their unprincipled cherry picking among the extant epidemiologic publications. Like the trial court, the Supreme Court found that the plaintiffs’ expert witnesses’ inconsistent use of methodological criteria and their selective reliance upon studies (disregarding eight of the nine epidemiologic studies) that favored their task masters was the antithesis of sound methodology. Id. at 73, citing with approval, In re Lipitor, ___ F.3d ___ (4th Cir. 2018) (slip op. at 16) (“Result-driven analysis, or cherry-picking, undermines principles of the scientific method and is a quintessential example of applying methodologies (valid or otherwise) in an unreliable fashion.”).

An essential feature of the Supreme Court’s decision is that it was not willing to engage in the common reductionism that has “all epidemiologic studies are flawed,” and which thus privileges cherry picking. Not all disagreements between expert witnesses can be framed as differences in interpretation. In re Accutane will likely stand as a bulwark against flawed expert witness opinion testimony in the Garden State for a long time.

1 Judge Nelson Johnson is also the author of Boardwalk Empire: The Birth, High Times, and Corruption of Atlantic City (2010), a spell-binding historical novel about political and personal corruption.

2 In support of the defendants’ positions, amicus briefs were filed by the New Jersey Business & Industry Association, Commerce and Industry Association of New Jersey, and New Jersey Chamber of Commerce; by law professors Kenneth S. Broun, Daniel J. Capra, Joanne A. Epps, David L. Faigman, Laird Kirkpatrick, Michael M. Martin, Liesa Richter, and Stephen A. Saltzburg; by medical associations the American Medical Association, Medical Society of New Jersey, American Academy of Dermatology, Society for Investigative Dermatology, American Acne and Rosacea Society, and Dermatological Society of New Jersey, by the Defense Research Institute; by the Pharmaceutical Research and Manufacturers of America; and by New Jersey Civil Justice Institute. In support of the plaintiffs’ position and the intermediate appellate court’s determination, amicus briefs were filed by political action committee the New Jersey Association for Justice; by the Ironbound Community Corporation; and by plaintiffs’ lawyer Allan Kanner.

3 Nothing in the intervening scientific record called question upon Judge Johnson’s trial court judgment. See, e.g., I.A. Vallerand, R.T. Lewinson, M.S. Farris, C.D. Sibley, M.L. Ramien, A.G.M. Bulloch, and S.B. Patten, “Efﬁcacy and adverse events of oral isotretinoin for acne: a systematic review,” 178 Brit. J. Dermatol. 76 (2018).

4 Slip op. at 9, 14-15, citing Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991) (“We initially took that step to allow the parties in toxic tort civil matters to present novel scientific evidence of causation if, after the trial court engages in rigorous gatekeeping when reviewing for reliability, the proponent persuades the court of the soundness of the expert’s reasoning.”).

5 The Court did acknowledge that Federal Rule of Evidence 702 had been amended in 2000, to reflect the Supreme Court’s decision in Daubert, Joiner, and Kumho Tire, but the Court did not deal with the inconsistencies between the present rule and the 1993 Daubert case. Slip op. at 64, citing Calhoun v. Yamaha Motor Corp., U.S.A., 350 F.3d 316, 320-21, 320 n.8 (3d Cir. 2003).

6 See Accutane slip op. at 12-18, 24, 73-74, 77-78. With respect to meta-analysis, the Reference Manual’s epidemiology chapter is still stuck in the 1980s and the prevalent resistance to poorly conducted, often meaningless meta-analyses. See “The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011) (The Reference Manual fails to come to grips with the prevalence and importance of meta-analysis in litigation, and fails to provide meaningful guidance to trial judges).

Posted in Evidence, Expert Witnesses, Frye, Meta-analysis, Reference Manual on Scientific Evidence, Rule 702, Scientific Evidence, statistical evidence, Statistical Power | Comments Off on N.J. Supreme Court Uproots Weeds in Garden State’s Law of Expert Witnesses

P-Values: Pernicious or Perspicacious?

May 12th, 2018

Professor Kingsley R. Browne, of the Wayne State University Law School, recently published a paper that criticized the use of p-values and significance testing in discrimination litigation. Kingsley R. Browne, “Pernicious P-Values: Statistical Proof of Not Very Much,” 42 Univ. Dayton L. Rev. 113 (2017) (cited below as Browne). Browne amply documents the obvious and undeniable, that judges, lawyers, and even some ill-trained expert witnesses, are congenitally unable to describe and interpret p-values properly. Most of Browne’s examples are from the world of anti-discrimination law, but he also cites a few from health effects litigation as well. Browne also cites from many of the criticisms of p-values in the psychology and other social science literature.

Browne’s efforts to correct judicial innumeracy are welcomed, but they take a peculiar turn in this law review article. From the well-known state of affairs of widespread judicial refusal or inability to discuss statistical concepts accurately, Browne argues for what seem to be two incongruous, inconsistent responses. Rejecting the glib suggestion of former Judge Posner that evidence law is not “fussy” about evidence, Browne argues that federal evidence law requires courts to be “fussy” about evidence, and that Rule 702 requires courts to exclude expert witnesses, whose opinions fail to “employ[] in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.” Browne at 143 (quoting from Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999). Browne tells us, with apparently appropriate intellectual rigor, that “[i]f a disparity that does not provide a p-value of less than 0.05 would not be accepted as meaningful in the expert’s discipline, it is not clear that the expert should be allowed to testify – on the basis of his expertise in that discipline – that the disparity is, in fact, meaningful.” Id.

In a volte face, Browne then argues that p-values do “not tell us much,” basically because they are dependent upon sample size. Browne suggests that the quantitative disparity between expected value and observed proportion or average can be assessed without the use of p-values, and that measuring a p-value “adds virtually nothing and just muddies the water.” Id. at 152. The prevalent confusion among judges and lawyers seems sufficient in Browne’s view to justify his proposal, as well as his further suggestion that Rule 403 should be invoked to exclude p-values:

“The ease with which reported p-values cause a trier of fact to slip into the transposition fallacy and the difficulty of avoiding that lapse of logic, coupled with the relatively sparse information actually provided by the p-value, make p-values prime candidates for exclusion under Federal Rule of Evidence 403. *** If judges, not to mention the statistical experts they rely on, cannot use the information without falling into fallacious reasoning, the likelihood that the jury will misunderstand the evidence is very high. Since the p-value actually provides little useful relevant information, the high risk of misleading the jury greatly exceeds its scant probative value, so it simply should not be presented to the jury.”

Id. at 152-53.

And yet, elsewhere in the same article, Browne ridicules one court and several expert witnesses who have argued in favor of conclusions that were based upon p-values up to 50%.¹ The concept of p-values cannot be so flexible as to straddle the extremes of having no probative value, and yet capable of rendering an expert witness’s opinions ludicrous. P-values quantify an estimate of random error, even if that error rate varies with sample size. To be sure, the measure of random error depends upon the specified model and assumption of a null hypothesis, but the crucial point is that the estimate (whether mean, proportion, risk ratio, risk difference, etc.) is rather meaningless without some further estimate of random variability of the estimate. Of course, random error is not the only type of error, but the existence of other potential systematic errors is hardly a reason to ignore random error.

In the science of health effects, many applications of p-values have given way to the use of confidence intervals, which arguably provide more direct assessments of both sample estimates, along with ranges of potential outcomes that are reasonably compatible with the sample estimates. Remarkably, Browne never substantively discusses confidence intervals in his article.

Under the heading of other problems with p-values and significance testing, Browne advances four additional putative problems with p-values. First, Browne asserts with little to no support that “[t]he null hypothesis is unlikely a priori.” Id. at 155. He fails to tell us why the null hypothesis of no disparity is not a reasonable starting place in the absence of objective evidence of a prior estimate. Furthermore, a null hypothesis of no difference will have legal significance in claims of health effects, or of unlawful discrimination.

Second, Browne argues that significance testing will lead to “[c]onflation of statistical and practical (or legal) significance” in the minds of judges and jurors. Id. at 156-58. This charge is difficult to sustain. The actors in legal cases can probably best appreciate practical significance and its separation from statistical significance, most readily. If a large class action showed that the expected value of a minority’s proportion was 15%, and the observed proportion was 14.8%, p < 0.05, most innumerate judges and jurors would sense that this disparity was unimportant and that no employer would fine tune its discriminatory activities so closely as to achieve such a meaningless difference.

Third, Browne reminds us that the validity and the interpretation of a p-value turns on the assumption that the statistical model is perfectly specified. Id. at 158-59. His reminder is correct, but again, this aspect of p-values (or confidence intervals) is relatively easy to explain, as well as to defend or challenge. To be sure, there may be legitimate disputes about whether an appropriate model was used (say binomial versus hypergeometric), but such disputes are hardly the most arcane issues that judges and jurors will face.

Fourth, Browne claims that “the alternative hypothesis is seldom properly specified.” Id. at 159-62. Unless analysts are focused on measuring pre-test power or type II error, however, they need not advance an alternative hypothesis. Furthermore, it is hardly a flaw with significance testing that it does not account for systematic bias or confounding.

Browne does not offer an affirmative response such as urging courts to adopt a Bayesian program. A Bayesian response to prevalent blunders in interpreting statistical significance would introduce perhaps even more arcane and hard-to-discern blunders in court proceedings. Browne also leaves courts without a meaningful approach to evaluate random error other than to engage in crude comparisons between two means or proportions. The recommendations in this law review article appear to be a giant step, backwards, into an epistemic void.

1See Browne at 146, citing In re Photochromic Lens Antitrust Litig., 2014 WL 1338605 (M.D. Fla. April 3, 2014) (reversing magistrate judge’s exclusion of an expert witness who had advanced claims based upon p-value of 0.50); id. at 147 n. 116, citing In re High-Tech Employee Antitrust Litig., 2014 WL 1351040 (N.D. Cal. 2014).

Posted in Evidence, Rule 702, statistical evidence | Comments Off on P-Values: Pernicious or Perspicacious?

Statistical Deontology

March 2nd, 2018

In courtrooms across America, there has been a lot of buzzing and palavering about the American Statistical Association’s Statement on Statistical Significance Testing,¹ but very little discussion of the Society’s Ethical Guidelines, which were updated and promulgated in the same year, 2016. Statisticians and statistics, like lawyers and the law, receive their fair share of calumny over their professional activities, but the statistician’s principal North American professional organization is trying to do something about members’ transgressions.

The American Statistical Society (ASA) has promulgated ethical guidelines for statisticians, as has the Royal Statistical Society,² even if these organizations lack the means and procedures to enforce their codes. The ASA’s guidelines ³ are rich with implications for statistical analyses put forward in all contexts, including in litigation and regulatory rule making. As such, the guidelines are well worth studying by lawyers.

The ASA Guidelines were prepared by the Committee on Professional Ethics, and approved by the ASA’s Board in April 2016. There are lots of “thou shall” and “thou shall nots,” but I will focus on the issues that are more likely to arise in litigation. What is remarkable about the Guidelines is that if followed, they probably are more likely to eliminate unsound statistical practices in the courtroom than the ASA State on P-values.

Defining Good Statistical Practice

“Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” Guidelines at 1. The Guidelines thus incorporate something akin to the Kumho Tire standard that an expert witness ‘‘employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’ Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999).

A statistician engaged in expert witness testimony should provide “only expert testimony, written work, and oral presentations that he/she would be willing to have peer reviewed.” Guidelines at 2. “The ethical statistician uses methodology and data that are relevant and appropriate, without favoritism or prejudice, and in a manner intended to produce valid, interpretable, and reproducible results.” Id. Similarly, the statistician, if ethical, will identify and mitigate biases, and use analyses “appropriate and valid for the specific question to be addressed, so that results extend beyond the sample to a population relevant to the objectives with minimal error under reasonable assumptions.” Id. If the Guidelines were followed, a lot of spurious analyses would drop off the litigation landscape, regardless whether they used p-values or confidence intervals, or a Bayesian approach.

Integrity of Data and Methods

The ASA’s Guidelines also have a good deal to say about data integrity and statistical methods. In particular, the Guidelines call for candor about limitations in the statistical methods or the integrity of the underlying data:

“The ethical statistician is candid about any known or suspected limitations, defects, or biases in the data that may impact the integrity or reliability of the statistical analysis. Objective and valid interpretation of the results requires that the underlying analysis recognizes and acknowledges the degree of reliability and integrity of the data.”

Guidelines at 3.

The statistical analyst openly acknowledges the limits of statistical inference, the potential sources of error, as well as the statistical and substantive assumptions made in the execution and interpretation of any analysis,” including data editing and imputation. Id. The Guidelines urge analysts to address potential confounding not assessed by the study design. Id. at 3, 10. How often do we see these acknowledgments in litigation-driven analyses, or in peer-reviewed papers, for that matter?

Affirmative Actions Prescribed

In the aid of promoting data and methodological integrity, the Guidelines also urge analysts to share data when appropriate without revealing the identities of study participants. Statistical analysts should publicly correct any disseminated data and analyses in their own work, as well as working to “expose incompetent or corrupt statistical practice.” Of course, the Lawsuit Industry will call this ethical duty “attacking the messenger,” but maybe that’s a rhetorical strategy based upon an assessment of risks versus benefits to the Lawsuit Industry.

Multiplicity

The ASA Guidelines address the impropriety of substantive statistical errors, such as:

“[r]unning multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one “significant” result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading.”

Guidelines at 9.

There are some Lawsuit Industrialists who have taken comfort in the pronouncements of Kenneth Rothman on corrections for multiple comparisons. Rothman’s views on multiple comparisons are, however, much broader and more nuanced than the Industry’s sound bites.⁴ Given that Rothman opposes anything like strict statistical significance testing, it follows that he is relatively unmoved for the need for adjustments to alpha or the coefficient of confidence. Rothman, however, has never deprecated the need to consider the multiplicity of testing, and the need for researchers to be forthright in disclosing the the scope of comparisons originally planned and actually done.

1 Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 Am. Statistician, (2016). See “The American Statistical Association’s Statement on and of Significance” (March 17, 2016); “The ASA’s Statement on Statistical Significance – Buzzing from the Huckabees” (March 19, 2016).

2 Royal Statistical Society – Code of Conduct (2014); Steven Piantadosi, Clinical Trials: A Methodologic Perspective 609 (2d ed. 2005).

3 Shelley Hurwitz & John S. Gardenier, “Ethical Guidelines for Statistical Practice: The First 60 Years and Beyond,” 66 Am. Statistician 99 (2012) (describing the history and evolution of the Guidelines).

4 Kenneth J. Rothman, “Six Persistent Research Misconceptions,” 29 J. Gen. Intern. Med. 1060, 1063 (2014).

Posted in Data Sharing, Expert Witness Discovery, Scientific Misconduct, statistical evidence, Underlying Data | Comments Off on Statistical Deontology

The 5% Solution at the FDA

February 24th, 2018

The statistics wars rage on¹, with Bayesians attempting to take advantage of the so-called replication crisis to argue it is all the fault of frequentist significance testing. In 2016, there was an attempted coup at the American Statistical Association, but the Bayesians did not get what they wanted, with little more than a consensus that p-values and confidence intervals should be properly interpreted. Patient advocacy groups have lobbied for the availability of unapproved and incompletely tested medications, and rent-seeking litigation has argued and lobbied for the elimination of statistical tests and methods in the assessment of causal claims. The battle continues.

Against this backdrop, a young Harvard graduate student has published a a paper with a brief history of significance testing, and the role that significance testing has taken on at the United States Food and Drug Administration (FDA). Lee Kennedy-Shaffer, “When the Alpha is the Omega: P-Values, ‘Substantial Evidence’, and the 0.05 Standard at FDA,” 72 Food & Drug L.J. 595 (2017) [cited below as K-S]. The paper presents a short but entertaining history of the evolution of the p-value from its early invocation in 1710, by John Arbuthnott, a Scottish physician and mathematician, who calculated the probability that male births would exceed female births 82 consecutive years if their true proportions were equal. K-S at 603. Kennedy-Shaffer notes the role of the two great French mathematicians, Pierre-Simon Laplace and Siméon-Denis Poisson, who used p-values (or their complements) to evaluate empirical propositions. As Kennedy-Shaffer notes, Poisson observed that the equivalent of what would be a modern p-value about 0.005, was sufficient in his view, back in 1830, to believe that the French Revolution of 1830 had caused the pattern of jury verdicts to be changed. K-S at 604.

Kennedy-Shaffer traces the p-value, or its equivalent, through its treatment by the great early 20th century statisticians, Karl Pearson and Ronald A. Fisher, through its modification by Jerzy Neyman and Egon Pearson, into the bowels of the FDA in Rockville, Maryland. It is a history well worth recounting, if for no other reason, to remind us that the p-value or its equivalent has been remarkably durable and reasonably effective in protecting the public against false claims of safety and efficacy. Kennedy-Shaffer provides several good examples in which the FDA’s use of significance testing was outcome dispositive of approval or non-approval of medications and devices.

There is enough substance and history here that everyone will have something to pick at this paper. Let me volunteer the first shot. Kennedy-Shaffer describes the co-evolution of the controlled clinical trial and statistical tests, and points to the landmark study by the Medical Research Council on streptomycin for tuberculosis. Geoffrey Marshall (chairman), “Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation,” 2 Brit. Med. J. 769, 769–71 (1948). This clinical trial was historically important, not only for its results and for Sir Austin Bradford Hill’s role in its design, but for the care with which it described randomization, double blinding, and multiple study sites. Kennedy-Shaffer suggests that “[w]hile results were presented in detail, few formal statistical tests were incorporated into this analysis.” K-S at 597-98. And yet, a few pages later, he tells us that “both chi-squared tests and t-tests were used to evaluate the responses to the drug and compare the control and treated groups,” and that “[t]he difference in mortality between the two groups is statistically significant.” K-S at 611. Although it is true that the authors did not report their calculated p-values for any test, the difference in mortality between the streptomycin and control groups was very large, and the standards for describing the results of such a clinical trial were in their infancy in 1948.

Kennedy-Shaffer’s characterization of Sir Austin Bradford Hill’s use of statistical tests and methods takes on out-size importance because of the mischaracterizations, and even misrepresentations, made by some representatives of the Lawsuit Industry, who contend that Sir Austin dismissed statistical methods as unnecessary. In the United States, some judges have been seriously misled by those misrepresentations, which have their way into published judicial decisions.

The operative document, of course, is the publication of Sir Austin’s famous after-dinner speech, in 1965, on the occasion of his election to the Presidency of the Royal Society of Medicine. Although the speech is casual and free of scholarly footnotes, Sir Austin’s message was precise, balanced, and nuanced. The speech is a classic in the history of medicine, which remains important even if rather dated in terms of its primary message about how science and medicine move from beliefs about associations to knowledge of causal associations. As everyone knows, Sir Austin articulated nine factors or viewpoints through which to assess any putative causal association, but he emphasized that before these nine factors are assessed, our starting point itself has prerequisites:

“Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965) [cited below as Hill]. The starting point, therefore, before the Bradford Hill nine factors come into play, is a “clear-cut” association, which is “beyond what we would care to attribute to the play of chance.”

In other words, consideration of random error is necessary.

Now for the nuance and the balance. Sir Austin acknowledged that there were some situations in which we simply do not need to calculate standard errors because the disparity between treatment and control groups is so large and meaningful. He goes on to wonder out loud:

“whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse the glitter of the t table diverts attention from the inadequacies of the fare.”

Hill at 299. Now this is all true, but hardly the repudiation of statistical testing claimed by those who want to suppress the consideration of random error from science and judicial gatekeeping. There are very few litigation cases in which the difference between exposed and unexposed is “grotesquely obvious,” such that we can leave statistical methods at the door. Importantly, the very large differences between the streptomycin and placebo control groups in the Medical Council’s 1948 clinical trial were not so “grotesquely obvious” that statistical methods were obviated. To be fair, the differences were sufficiently great that statistical discussion could be kept to a minimum. Sir Austin gave extensive tables in the 1948 paper to let the reader appreciate the actual data themselves.

In his after-dinner speech, Hill also gives examples of studies that are so biased and confounded that no statistical method will likely ever save them. Certainly, the technology of regression and propensity-score analyses have progressed tremendously since Hill’s 1965 speech, but his point still remains. This point hardly excuses the lack of statistical apparatus in highly confounding or biased observations.

In addressing the nine factors he identified, which presumed a “clear-cut” association, with random error ruled out, Sir Austin did opine that for the factors raised questions and that:

“No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.”

Hill at 299. Again, the date and the context are important. Hill is addressing consideration of the nine factors, not the required predicate association beyond the play of chance or random error. The date is important as well, because it would be foolish to suggest that statistical methods have not grown in the last half century to address some of the nine factors. The existence and the nature of dose-response are the subject of extensive statistical methods, and meta-analysis and meta-regression are used to assess and measure consistency between studies.

Kennedy-Shaffer might well have pointed out the great influence Sir Austin’s textbook on medical statistics had had on medical research and practice. This textbook, which went through numerous editions, makes clear the importance of statistical testing and methods:

“Are simple methods of the interpretation of figures only a synonym for common sense or do they involve an art or knowledge which can be imparted? Familiarity with medical statistics leads inevitably to the conclusion that common sense is not enough. Mistakes which when pointed out look extremely foolish are quite frequently made by intelligent persons, and the same mistakes, or types of mistakes, crop up again and again. There is often lacking what has been called a ‘statistical tact, which is rather more than simple good sense’. That fact the majority of persons must acquire (with a minority it is undoubtedly innate) by a study of the basic principles of statistical method.”

Austin Bradford Hill, Principles of Medical Statistics at 2 (4th ed. 1948) (emphasis in original). And later in his text, Sir Austin notes that:

“The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”

Id. at 10 (emphasis added).

Sir Austin’s work taken as a whole demonstrates the acceptance of the necessity of statistical methods in medicine, and causal inference. Kennedy-Shaffer’s paper covers much ground, but it short changes this important line of influence, which lies directly in the historical path between Sir Ronald Fisher and the medical regulatory community.

Kennedy-Shaffer gives a nod to Bayesian methods, and even suggests that Bayesian results are “more intuitive,” but he does not explain the supposed intuitiveness of how a parameter has a probability distribution. This might make sense at the level of quantum physics, but does not seem to describe the reality of a biomedical phenomenon such as relative risk. Kennedy-Shaffer notes the FDA’s expression of willingness to entertain Bayesian analyses of clinical trials, and the rare instances in which such analyses have actually been deployed. K-S at 629 (“e.g., Pravigard Pac for prevention of myocardial infarction”). He concedes, however, that Bayesian designs are still the exception to the rule, as well as the cautions of Robert Temple, a former FDA Director of Medical Policy, in 2005, that Bayesian proposals for drug clinical trials were at that time “very rare.²” K-S at 630.

1 Deborah Mayo, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018).

2 Robert Temple, “How FDA Currently Makes Decisions on Clinical Studies,” 2 Clinical Trials 276, 281 (2005).

Posted in statistical evidence | Comments Off on The 5% Solution at the FDA

Scientific Evidence in Canadian Courts

February 20th, 2018

A couple of years ago, Deborah Mayo called my attention to the Canadian version of the Reference Manual on Scientific Evidence.¹ In the course of discussion of mistaken definitions and uses of p-values, confidence intervals, and significance testing, Sander Greenland pointed to some dubious pronouncements in the Science Manual for Canadian Judges [Manual].

Unlike the United States federal court Reference Manual, which is published through a joint effort of the National Academies of Science, Engineering, and Medicine, the Canadian version, is the product of the Canadian National Judicial Institute (NJI, or the Institut National de la Magistrature, if you live in Quebec), which claims to be an independent, not-for-profit group, that is committed to educating Canadian judges. In addition to the Manual, the Institute publishes Model Jury Instructions and a guide, Problem Solving in Canada’s Courtrooms: A Guide to Therapeutic Justice (2d ed.), as well as conducting educational courses.

The NJI’s website describes the Instute’s Manual as follows:

“Without the proper tools, the justice system can be vulnerable to unreliable expert scientific evidence.

* * *

The goal of the Science Manual is to provide judges with tools to better understand expert evidence and to assess the validity of purportedly scientific evidence presented to them. …”

The Chief Justice of Canada, Hon. Beverley M. McLachlin, contributed an introduction to the Manual, which was notable for its frank admission that:

“[w]ithout the proper tools, the justice system is vulnerable to unreliable expert scientific evidence.

****

Within the increasingly science-rich culture of the courtroom, the judiciary needs to discern ‘good’ science from ‘bad’ science, in order to assess expert evidence effectively and establish a proper threshold for admissibility. Judicial education in science, the scientific method, and technology is essential to ensure that judges are capable of dealing with scientific evidence, and to counterbalance the discomfort of jurists confronted with this specific subject matter.”

Manual at 14. These are laudable goals, indeed, but did the National Judicial Institute live up to its stated goals, or did it leave Canadian judges vulnerable to the Institute’s own “bad science”?

In his comments on Deborah Mayo’s blog, Greenland noted some rather cavalier statements in Chapter two that suggest that the conventional alpha of 5% corresponds to a “scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it.” And he, pointed elsewhere where the chapter seems to suggest that the coefficient of confidence that corresponds to an alpha of 5% “constitutes a rather high standard of proof,” thus confusing and conflating probability of random error with posterior probabilities. Greenland is absolutely correct that the Manual does a rather miserable job of educating Canadian judges if our standard for its work product is accuracy and truth.

Some of the most egregious errors are within what is perhaps the most important chapter of the Manual, Chapter 2, “Science and the Scientific Method.” The chapter has two authors, a scientist, Scott Findlay, and a lawyer, Nathalie Chalifour. Findlay is an Associate Professor, in the Department of Biology, of the University of Ottawa. Nathalie Chalifour is an Associate Professor on the Faculty of Law, also in the University of Ottawa. Together, they produced some dubious pronouncements, such as:

Weight of the Evidence (WOE)

“First, the concept of weight of evidence in science is similar in many respects to its legal counterpart. In both settings, the outcome of a weight-of-evidence assessment by the trier of fact is a binary decision.”

Manual at 40. Findlay and Chalifour cite no support for their characterization of WOE in science. Most attempts to invoke WOE are woefully vague and amorphous, with no meaningful guidance or content.² Sixty-five pages later, if any one is noticing, the authors let us in a dirty, little secret:

“at present, there exists no established prescriptive methodology for weight of evidence assessment in science.”

Manual at 105. The authors omit, however, that there are prescriptive methods for inferring causation in science; you just will not see them in discussions of weight of the evidence. The authors then compound the semantic and conceptual problems by stating that “in a civil proceeding, if the evidence adduced by the plaintiff is weightier than that brought forth by the defendant, a judge is obliged to find in favour of the plaintiff.” Manual at 41. This is a remarkable suggestion, which implies that if the plaintiff adduces the crummiest crumb of evidence, a mere peppercorn on the scales of justice, but the defendant has none to offer, that the plaintiff must win. The plaintiff wins notwithstanding that no reasonable person could believe that the plaintiff’s claims are more likely than not true. Even if there were the law of Canada, it is certainly not how scientists think about establishing the truth of empirical propositions.

Confusion of Hypothesis Testing with “Beyond a Reasonable Doubt”

The authors’ next assault comes in conflating significance probability with the probability connected with the burden of proof, a posterior probability. Legal proceedings have a defined burden of proof, with criminal cases requiring the state to prove guilt “beyond a reasonable doubt.” Findlay and Chalifour’s discussion then runs off the rails by likening hypothesis testing, with an alpha of 5% or its complement, 95%, as a coefficient of confidence, to a “very high” burden of proof:

“In statistical hypothesis-testing – one of the tools commonly employed by scientists – the predisposition is that there is a particular hypothesis (the null hypothesis) that is assumed to be true unless sufficient evidence is adduced to overturn it. But in statistical hypothesis-testing, the standard of proof has traditionally been set very high such that, in general, scientists will only (provisionally) reject the null hypothesis if they are at least 95% sure it is false. Third, in both scientific and legal proceedings, the setting of the predisposition and the associated standard of proof are purely normative decisions, based ultimately on the perceived consequences of an error in inference.”

Manual at 41. This is, as Greenland and many others have pointed out, a totally bogus conception of hypothesis testing, and an utterly false description of the probabilities involved.

Later in the chapter, Findlay and Chalifour flirt with the truth, but then lapse into an unrecognizable parody of it:

“Inferential statistics adopt the frequentist view of probability whereby a proposition is either true or false, and the task at hand is to estimate the probability of getting results as discrepant or more discrepant than those observed, given the null hypothesis. Thus, in statistical hypothesis testing, the usual inferred conclusion is either that the null is true (or rather, that we have insufficient evidence to reject it) or it is false (in which case we reject it). 16 The decision to reject or not is based on the value of p if the estimated value of p is below some threshold value a, we reject the null; otherwise we accept it.”

Manual at 74. OK; so far so good, but here comes the train wreck:

“By convention (and by convention only), scientists tend to set α = 0.05; this corresponds to the collective – and, one assumes, consensual – scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it. It is partly because of this that scientists have the reputation of being a notoriously conservative lot, given that a 95% threshold constitutes a rather high standard of proof.”

Manual at 75. Uggh; so we are back to significance probability’s being a posterior probability. As if to atone for their sins, in the very next paragraph, the authors then remind the judicial readers that:

“As noted above, p is the probability of obtaining results at least as discrepant as those observed if the null is true. This is not the same as the probability of the null hypothesis being true, given the results.”

Manual at 75. True, true, and completely at odds with what the authors have stated previously. And to add to the reader’s now fully justified conclusion, the authors describe the standard for rejecting the null hypothesis as “very high indeed.” Manual at 102, 109. Any reader who is following the discussion might wonder how and why there is such a problem of replication and reproducibility in contemporary science.

Conflating Bayesianism with Frequentist Modes of Inference

We have seen how Findlay and Chalifour conflate significance and posterior probabilities, some of the time. In a section of their chapter that deals explicitly with probability, the authors tell us that before any study is conducted the prior probability of the truth of the tested hypothesis is 50%, sans evidence. This an astonishing creation of certainty out nothingness, and perhaps it explains the authors’ implied claim that the crummiest morsel of evidence on one side is sufficient to compel a verdict, if the other side has no morsels at all. Here is how the authors put their claim to the Canadian judges:

“Before each study is conducted (that is, a priori), the hypothesis is as likely to be true as it is to be false. Once the results are in, we can ask: How likely is it now that the hypothesis is true? In the first study, the low a priori inferential strength of the study design means that this probability will not be much different from the a priori value of 0.5 because any result will be rather equivocal owing to limitations in the experimental design.”

Manual at 64. This implied Bayesian slant, with 50% priors, in the world of science would lead anyone to believe “as many as six impossible things before breakfast,” and many more throughout the day.

Lest you think that the Manual is all rubbish, there are occasional gems of advice to the Canadian judges. The authors admonish the judges to

“be wary of individual ‘statistically significant’ results that are mined from comparatively large numbers of trials or experiments, as the results may be ‘cherry picked’ from a larger set of experiments or studies that yielded mostly negative results. The court might ask the expert how many other trials or experiments testing the same hypothesis he or she is aware of, and to describe the outcome of those studies.”

Manual at 87. Good advice, but at odds with the authors’ characterization of statistical significance as establishing the rejection of the null hypothesis well-nigh beyond a reasonable doubt.

When Greenland first called attention to this Manual, I reached to some people who had been involved in its peer review. One reviewer told me that it was a “living document,” and would likely be revised after he had the chance to call the NJI’s attention to the errors. But two years later, the errors remain, and so we have to infer that the authors meant to say all the contradictory and false statements that are still present in the downloadable version of the Manual.

1 See “‘Statistical Significance’ According to the U.S. Dept. of Health and Human Services (ii),” Error Statistics Philosophy (July 17, 2015). At the time, I wrote briefly about the Canadian Manual. See “Canadian Judges’ Reference Manual on Scientific Evidence” (July 24, 2015).

2 See “WOE-fully Inadequate Methodology – An Ipse Dixit By Another Name” (May 1, 2012); “Weight of the Evidence in Science and in Law” (July 29, 2017); see also David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89 Notre Dame L. Rev. 27 (2013).

Posted in Reference Manual on Scientific Evidence, Scientific Evidence, statistical evidence | Comments Off on Scientific Evidence in Canadian Courts

Ninth Circuit Quashes Harkonen’s Last Chance

January 8th, 2018

With the benefit of hindsight, even the biggest whopper can be characterized as a strategic choice for trial counsel. As are result of this sort of thinking, the convicted have a very difficult time in pressing claims of ineffective assistance of counsel. After the fact, a reviewing or an appellate court can always imagine a strategic reason for trial counsel’s decisions, even if they contributed to the client’s conviction.

In the Harkonen case, a pharmaceutical executive was indicted and tried for wire fraud and misbranding. His crime was to send out a fax with a preliminary assessment of a recently unblinded clinical trial. In his fax, Dr Harkonen described the trial’s results as “demonstrating” a survival benefit in study participants with mild and moderate disease. Survival (or mortality) was not a primary outcome of the trial, but it was a secondary outcome, and arguably the most important one of all. The subgroup of “mild and moderate” was not pre-specified, but it was highly plausible.

Clearly, Harkonen’s post hoc analysis would not be sufficient normally to persuade the FDA to approve a medication, but Harkonen did not assert or predict that the company would obtain FDA approval. He simply claimed that the trial “demonstrated” a benefit. A charitable interpretation of his statement, which was several pages long, would include the prior successful clinical trial, as important context for Harkonen’s statement.

The United States government, however, was not interested in the principle of charity, the context, or even its own pronouncements on the issue of statistical significance. Instead, the United States Attorney pushed for draconian sentences under the Wire Fraud Act, and the misbranding sections of the Food, Drug, and Cosmetics Act. A jury acquitted on the misbranding charge, but convicted on wire fraud. The government’s request for an extreme prison term and fines was rebuffed by the trial court, which imposed a term of six months of house arrest, and a small fine.¹ The conviction, however, effectively keeps Dr Harkonen from working again in the pharmaceutical industry.

In post-verdict challenges to the conviction, Harkonen’s lawyers were able to marshal support from several well-renown statisticians and epidemiologists, but the trial court was reluctant to consider these post-verdict opinions when the defense called no expert witness at trial. The trial situation, however, was complicated and confused by the government’s pre-trial position that it would not call expert witnesses on the statistical and clinical trial interpretative issues. Contrary to these representations, the government called Dr Thomas Fleming, as statistician, who testified at some length, and without objection, to strict criteria for assessing statistical significance and causation in clinical trials.

Having read Fleming’s testimony, I can say that the government got away with introducing a great deal of expert witness opinion testimony, without effective contradiction or impeachment. With the benefit of hindsight, the defense decision not to call an expert witness looks like a serious deviation from the standard of care. Fleming’s “facts” about how the FDA would evaluate the success or failure of the clinical trial were not relevant to whether Harkonen’s claim of a demonstrated benefit were true or false. More importantly, Harkonen’s claim involved an inference, which is not a fact, but an opinion. Fleming’s contrary opinion really did not turn Harkonen’s claim into a falsehood. A contrary rule would have many expert witnesses in civil and in criminal litigation behind bars on similar charges of wire or mail fraud.

After Harkonen exhausted his direct appeals,² he petitioned for a writ of coram nobis. The trial court denied the petition,³ and in a non-precedential opinion [sic], the Ninth Circuit affirmed the denial of coram nobis.⁴ United States v. Harkonen, slip op., No. 15-16844 (9th Cir., Dec. 4, 2017) [cited below as Harkonen].

The Circuit rejected Harkonen’s contention that the Supreme Court had announced a new rule with respect to statistical significance, in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27 (2011), which change in law required that his conviction be vacated. Harkonen’s lawyer, like much of the plaintiffs’ tort bar, oversold the Supreme Court’s comments about statistical significance, which were at best dicta, and not very well considered or supported dicta, at that. Still, there was an obvious tension, and duplicity, between positions that the government, through the Solicitor General’s office, had taken in Siracusano, and positions the government took in the Harkonen case.⁵ Given the government’s opportunistic double-faced arguments about statistical significance, the Ninth Circuit held that Harkonen’s proffered evidence was “compelling, especially in light of Matrixx,” but the panel concluded that his conviction was not the result of a “manifest injustice” that requires the issuance of the writ of coram nobis. Harkonen at 2 (emphasis added). Apparently, Harkonen had suffered an injustice of a less obvious and blatant variety, which did not rise to the level of manifest injustice.

The Ninth Circuit gave similarly short shrift to Harkonen’s challenge to the competency of his counsel. His trial lawyers had averred that they thought that they were doing well enough not to risk putting on an expert witness, especially given that the defense’s view of the evidence came out in the testimony of the government’s witnesses. The Circuit thus acquiesced in the view that both sides had chosen to forgo expert witness testimony, and overlooked the defense’s competency issue for not having objected to Fleming’s opinion trial testimony. Harkonen at 2-4. Remarkably, the appellate court did not look at how Fleming was allowed to testify on statistical issues, without being challenged on cross-examination.

1 See “The (Clinical) Trial by Franz Kafka”; “Further Musings on U.S. v. Harkonen”;

“Harkonen’s Appeal Updated.” See also Patti Zettler, “U.S. v. Harkonen: Should Scientists Worry About Being Prosecuted for How They Interpret Their Research Results?” Stanford Center for Law & the Biosciences (Oct. 7, 2013); William M. Briggs, “Use The Wrong P-value, Go To Jail: Not A Joke: Updated With Amicus Brief,” Statistician to the Stars (Oct. 1, 2013).

2 United States v. Harkonen, 510 F. App’x 633, 638 (9th Cir. 2013), cert. denied, 134 S. Ct. 824 (2013).

3 “District Court Denies Writ of Coram Nobis to Dr Harkonen.”

4 Dave Simpson, “9th Circuit Refuses To Rethink Ex-InterMune CEO’s Conviction,” Law360 (Dec. 5, 2017).

5 “Multiplicity versus Duplicity – The Harkonen Conviction”; “The Matrixx Motion in U.S. v. Harkonen.”

Posted in Causation, Scientific Evidence, statistical evidence | Comments Off on Ninth Circuit Quashes Harkonen’s Last Chance

Failed Gatekeeping in Ambrosini v. Labarraque (1996)

December 28th, 2017

The Ambrosini case straddled the Supreme Court’s 1993 Daubert decision. The case began before the Supreme Court clarified the federal standard for expert witness gatekeeping, and ended in the Court of Appeals for the District of Columbia, after the high court adopted the curious notion that scientific claims should be based upon reliable evidence and valid inferences. That notion has only slowly and inconsistently trickled down to the lower courts.

Given that Ambrosini was litigated in the District of Columbia, where the docket is dominated by regulatory controversies, frequently involving dubious scientific claims, no one should be surprised that the D.C. Court of Appeals did not see that the Supreme Court had read “an exacting standard” into Federal Rule of Evidence 702. And so, we see, in Ambrosini, this Court of Appeals citing and purportedly applying its own pre-Daubert decision in Ferebee v. Chevron Chem. Co., 552 F. Supp. 1297 (D.D.C. 1982), aff’d, 736 F.2d 1529 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).¹ In 2000, the Federal Rule of Evidence 702 was revised in a way that extinguishes the precedential value of Ambrosini and the broad dicta of Ferebee, but some courts and commentators have failed to stay abreast of the law.

Escolastica Ambrosini was using a synthetic progestin birth control, Depo-Provera, as well as an anti-nausea medication, Bendectin, when she became pregnant. The child that resulted from this pregnancy, Teresa Ambrosini, was born with malformations of her face, eyes, and ears, cleft lip and palate, and vetebral malformations. About three percent of all live births in the United States have a major malformation. Perhaps because the Divine Being has sovereign immunity, Escolastica sued the manufacturers of Bendectin and Depo-Provera, as well as the prescribing physician.

The causal claims were controversial when made, and they still are. The progestin at issue, medroxyprogesterone acetate (MPA), was embryotoxic in the cynomolgus monkey², but not in the baboon³. The evidence in humans was equivocal at best, and involved mostly genital malformations⁴; the epidemiologic evidence for the MPA causal claim to this day remains unconvincing⁵.

At the close of discovery in Ambrosini, Upjohn (the manufacturer of the progestin) moved for summary judgment, with a supporting affidavit of a physician and geneticist, Dr. Joe Leigh Simpson. In his affidavit, Simpson discussed three epidemiologic studies, as well as other published papers, in support of his opinion that the progestin at issue did not cause the types of birth defects manifested by Teresa Ambrosini.

Ambrosini had disclosed two expert witnesses, Dr. Allen S. Goldman and Dr. Brian Strom. Neither Goldman nor Strom bothered to identify the papers, studies, data, or methodology used in arriving at an opinion on causation. Not surprisingly, the district judge was unimpressed with their opposition, and granted summary judgment for the defendant. Ambrosini v. Labarraque, 966 F.2d 1462, 1466 (D.C. Cir. 1992).

The plaintiffs appealed on the remarkable ground that Goldman’s and Strom’s crypto-evidence satisfied Federal Rule of Evidence 703. Even more remarkably, the Circuit, in a strikingly unscholarly opinion by Judge Mikva, opined that disclosure of relied-upon studies was not required for expert witnesses under Rules 703 and 705. Judge Mikva seemed to forget that the opinions being challenged were not given in testimony, but in (late-filed) affidavits that had to satisfy the requirement of Federal Rule of Civil Procedure 26. Id. at 1468-69. At trial, an expert witness may express an opinion without identifying its bases, but of course the adverse party may compel disclosure of those bases. In discovery, the proffered expert witness must supply all opinions and evidence relied upon in reach the opinions. In any event, the Circuit remanded the case for a hearing and further proceedings, at which the two challenged expert witnesses, Goldman and Strom, would have to identify the bases of their opinions. Id. at 1471.

Not long after the case landed back in the district court, the Supreme Court decided Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). With an order to produce entered, plaintiffs’ counsel could no longer hide Goldman and Strom’s evidentiary bases, and their scientific inferences came under judicial scrutiny.

Upjohn moved again to exclude Goldman and Strom’s opinions. The district court upheld Upjohn’s challenges, and granted summary judgment in favor of Upjohn for the second time. The Ambrosinis appealed again, but the second case in the D.C. Circuit resulted in a split decision, with the majority holding that the exclusion of Goldman and Strom’s opinions under Rule 702 was erroneous. Ambrosini v. Labarraque, 101 F.3d 129 (D.C. Cir. 1996).

Although issued two decades ago, the majority’s opinion remains noteworthy as an example of judicial resistance to the existence and meaning of the Supreme Court’s Daubert opinion. The majority opinion uncritically cited the notorious Ferebee⁶ and other pre-Daubert decisions. The court embraced the Daubert dictum about gatekeeping being limited to methodologic consideration, and then proceeded to interpret methodology as superficially as necessary to sustain admissibility. If an expert witness claimed to have looked at epidemiologic studies, and epidemiology was an accepted methodology, then the opinion of the expert witness must satisfy the legal requirements of Daubert, or so it would seem from the opinion of the U.S. Court of Appeals for the District of Columbia.

Despite the majority’s hand waving, a careful reader will discern that there must have been substantial gaps and omissions in the explanations and evidence cited by plaintiffs’ expert witnesses. Seeing anything clearly in the Circuit’s opinion is made difficult, however, by careless and imprecise language, such as its descriptions of studies as showing, or not showing “causation,” when it could have meant only that such studies showed associations, with more or less random and systematic error.

Dr. Strom’s report addressed only general causation, and even so, he apparently did not address general causation of the specific malformations manifested by the plaintiffs’ child. Strom claimed to have relied upon the “totality of the data,” but his methodologic approach seems to have required him to dismiss studies that failed to show an association.

“Dr. Strom first set forth the reasoning he employed that led him to disagree with those studies finding no causal relationship [sic] between progestins and birth defects like Teresa’s. He explained that an epidemiologist evaluates studies based on their ‘statistical power’. Statistical power, he continued, represents the ability of a study, based on its sample size, to detect a causal relationship. Conventionally, in order to be considered meaningful, negative studies, that is, those which allege the absence of a causal relationship, must have at least an 80 to 90 percent chance of detecting a causal link if such a link exists; otherwise, the studies cannot be considered conclusive. Based on sample sizes too small to be reliable, the negative studies at issue, Dr. Strom explained, lacked sufficient statistical power to be considered conclusive.”

Id. at 136⁷.

Putting aside the problem of suggesting that an observational study detects a “causal relationship,” as opposed to an association in need of further causal evaluation, the Court’s précis of Strom’s testimony on power is troublesome, and typical of how other courts have misunderstood and misapplied the concept of statistical power. Statistical power is a probability of observing an association of a specified size at a specified level of statistical significance. The calculation of statistical power turns indeed on sample size, the level of significance probability preselected for “statistical significance, an assumed probability distribution of the sample, and, critically, an alternative hypothesis. Without a specified alternative hypothesis, the notion of statistical power is meaningless, regardless of what probability (80% or 90% or some other percentage) is sought for finding the alternative hypothesis. Furthermore, the notion that the defense must adduce studies with “sufficient statistical power to be considered conclusive” creates an unscientific standard that can never be met, while subverting the law’s requirement that the claimant establish causation.

The suggestion that the studies that failed to find an association cannot be considered conclusive because they “lacked sufficient statistical power” is troublesome because it distorts and misapplies the very notion of statistical power. No attempt was made to describe the confidence intervals surrounding the point estimates of the null studies; nor was there any discussion whether the studies could be aggregated to increase their power to rule out meaningful associations.

The Circuit court’s scientific jurisprudence was thus seriously flawed. Without a discussion of the end points observed, the relevant point estimates of risk ratios, and the confidence intervals, the reader cannot assess the strength of the claims made by Goldman and Strom, or by defense expert Simpson, in their reports. Without identifying the study endpoints, the reader cannot evaluate whether the plaintiffs’ expert witnesses relied upon relevant outcomes in formulating their opinions. The court viewed the subject matter from 30,000 feet, passing over at 600 mph, without engagement or care. A strong dissent, however, suggested serious mischaracterizations of the plaintiffs’ evidence by the majority.

The only specific causation testimony to support plaintiff’s claims came from Goldman, in what appears to have been a “differential etiology.” Goldman purported to rule out a genetic cause, even though he had not conducted a critical family history or ordered a state-of-the-art chromosomal study. Id. at 140. Of course, nothing in a differential etiology approach would allow a physician to rule out “unknown” causes, which, for birth defects, make up the most prevalent and likely causes to explain any particular case. The majority acknowledged that these were short comings, but rhetorically characterized them as substantive, not methodologic, and therefore as issues for cross-examination, not for consideration by a judicial gatekeeping. All this is magical thinking, but it continues to infect judicial approaches to specific causation. See, e.g., Green Mountain Chrysler Plymouth Dodge Jeep v. Crombie, 508 F. Supp. 2d 295, 311 (D.Vt. 2007) (citing Ambrosini for the proposition that “the possibility of uneliminated causes goes to weight rather than admissibility, provided that the expert has considered and reasonably ruled out the most obvious”). In Ambrosini, however, Dr. Goldman had not ruled out much of anything.

Circuit Judge Karen LeCraft Henderson dissented in a short, but pointed opinion that carefully marshaled the record evidence. Drs. Goldman and Strom had relied upon a study by Greenberg and Matsunaga, whose data failed to show a statistically significant association between MPA and cleft lip and palate, when the crucial issue of timing of exposure was taken into consideration. Ambrosini, 101 F.3d at 142.

Beyond the specific claims and evidence, Judge Henderson anticipated the subsequent Supreme Court decisions in Joiner, Kumho Tire, and Weisgram, and the year 2000 revision of Rule 702, in noting that the majority’s acceptance of glib claims to have used a “traditional methodology” would render Daubert nugatory. Id. at 143-45 (characterizing Strom and Goldman’s methodologies as “wispish”). Even more importantly, Judge Henderson refused to indulge the assumption that somehow the length of Goldman’s C.V. substituted for evidence that his methods satisfied the legal (or scientific) standard of reliability. Id.

The good news is that little or nothing in Ambrosini survives the 2000 amendment to Rule 702. The bad news is that not all federal judges seem to have noticed, and that some commentators continue to cite the case, as lovely.

Probably no commentator has promiscuously embraced Ambrosini as warmly as Carl Cranor, a philosopher, and occasional expert witness for the lawsuit industry, in several publications and presentations.⁸ Cranor has been particularly enthusiastic about Ambrosini’s approval of expert witness’s testimony that failed to address “the relative risk between exposed and unexposed populations of cleft lip and palate, or any other of the birth defects from which [the child] suffers,” as well as differential etiologies that exclude nothing.⁹ Somehow Cranor, as did the majority in Ambrosini, believes that testimony that fails to identify the magnitude of the point estimate of relative risk can “assist the trier of fact to understand the evidence or to determine a fact in issue.”¹⁰ Of course, without that magnitude given, the trier of fact could not evaluate the strength of the alleged association; nor could the trier assess the probability of individual causation to the plaintiff. Cranor also has written approvingly of lumping unrelated end points, which defeats the assessment of biological plausibility and coherence by the trier of fact. When the defense expert witness in Ambrosini adverted to the point estimates for relevant end points, the majority, with Cranor’s approval, rejected the null findings as “too small to be significant.”¹¹ If the null studies were, in fact, too small to be useful tests of the plaintiffs’ claims, intellectual and scientific honesty required an acknowledgement that the evidentiary display was not one from which a reasonable scientist would draw a causal conclusion.

1Ambrosini v. Labarraque, 101 F.3d 129, 138-39 (D.C. Cir. 1996) (citing and applying Ferebee), cert. dismissed sub nom. Upjohn Co. v. Ambrosini, 117 S.Ct. 1572 (1997) See also David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89Notre Dame L. Rev. 27, 31 (2013).

2 S. Prahalada, E. Carroad, M. Cukierski, and A.G. Hendrickx, “Embryotoxicity of a single dose of medroxyprogesterone acetate (MPA) and maternal serum MPA concentrations in cynomolgus monkey (Macaca fascicularis),” 32 Teratology 421 (1985).

3 S. Prahalada, E. Carroad, and A.G. Hendrick, “Embryotoxicity and maternal serum concentrations of medroxyprogesterone acetate (MPA) in baboons (Papio cynocephalus),” 32 Contraception 497 (1985).

4 See, e.g., Z. Katz, M. Lancet, J. Skornik, J. Chemke, B.M. Mogilner, and M. Klinberg, “Teratogenicity of progestogens given during the first trimester of pregnancy,” 65 Obstet Gynecol. 775 (1985); J.L. Yovich, S.R. Turner, and R. Draper, “Medroxyprogesterone acetate therapy in early pregnancy has no apparent fetal effects,” 38 Teratology 135 (1988).

5 G. Saccone, C. Schoen, J.M. Franasiak, R.T. Scott, and V. Berghella, “Supplementation with progestogens in the first trimester of pregnancy to prevent miscarriage in women with unexplained recurrent miscarriage: a systematic review and meta-analysis of randomized, controlled trials,” 107 Fertil. Steril. 430 (2017).

6 Ferebee v. Chevron Chemical Co., 736 F.2d 1529, 1535 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).

7 Dr. Strom was also quoted as having provided a misleading definition of statistical significance: “whether there is a statistically significant finding at greater than 95 percent chance that it’s not due to random error.” Ambrosini at 101 F.3d at 136. Given the majority’s inadequate description of the record, the description of witness testimony may not be accurate, and error cannot properly be allocated.

8 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 320, 327-28 (2006); see also Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 238 (2d ed. 2016).

9 Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 320 (2006).

10 Id.

11 Id. ; see also Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 238 (2d ed. 2016).

Posted in Causation, Expert Witness Discovery, Rule 702, Rule 703, Scientific Evidence, statistical evidence, Statistical Power | Comments Off on Failed Gatekeeping in Ambrosini v. Labarraque (1996)