TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

N.J. Supreme Court Uproots Weeds in Garden State’s Law of Expert Witnesses

August 8th, 2018

The United States Supreme Court’s decision in Daubert is now over 25 years old. The idea of judicial gatekeeping of expert witness opinion testimony is even older in New Jersey state courts. The New Jersey Supreme Court articulated a reliability standard before the Daubert case was even argued in Washington, D.C. See Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991). Articulating a standard, however, is something very different from following a standard, and in many New Jersey trial courts, until very recently, the standard was pretty much anything goes.

One counter-example to the general rule of dog-eat-dog in New Jersey was Judge Nelson Johnson’s careful review and analysis of the proffered causation opinions in cases in which plaintiffs claimed that their use of the anti-acne medication isotretinoin (Accutane) caused Crohn’s disease. Judge Johnson, who sits in the Law Division of the New Jersey Superior Court for Atlantic County held a lengthy hearing, and reviewed the expert witnesses’ reliance materials.1 Judge Johnson found that the plaintiffs’ expert witnesses had employed undue selectivity in choosing what to rely upon. Perhaps even more concerning, Judge Johnson found that these witnesses had refused to rely upon reasonably well-conducted epidemiologic studies, while embracing unpublished, incomplete, and poorly conducted studies and anecdotal evidence. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div., Atlantic Cty. Feb. 20, 2015). In response, Judge Johnson politely but firmly closed the gate to conclusion-driven duplicitous expert witness causation opinions in over 2,000 personal injury cases. “Johnson of Accutane – Keeping the Gate in the Garden State” (Mar. 28, 2015).

Aside from resolving over 2,000 pending cases, Judge Johnson’s judgment was of intense interest to all who are involved in pharmaceutical and other products liability litigation. Judge Johnson had conducted a pretrial hearing, sometimes called a Kemp hearing in New Jersey, after the New Jersey Supreme Court’s opinion in Kemp v. The State of New Jersey, 174 N.J. 412 (2002). At the hearing and in his opinion that excluded plaintiffs’ expert witnesses’ causation opinions, Judge Johnson demonstrated a remarkable aptitude for analyzing data and inferences in the gatekeeping process.

When the courtroom din quieted, the trial court ruled that the proffered testimony of Dr., Arthur Kornbluth and Dr. David Madigan did not meet the liberal New Jersey test for admissibility. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div. Atlantic Cty. Feb. 20, 2015). And in closing the gate, Judge Johnson protected the judicial process from several bogus and misleading “lines of evidence,” which have become standard ploys to mislead juries in courthouses where the gatekeepers are asleep. Recognizing that not all evidence is on the same analytical plane, Judge Johnson gave case reports short shrift.

[u]nsystematic clinical observations or case reports and adverse event reports are at the bottom of the evidence hierarchy.”

Id. at *16. Adverse event reports, largely driven by the very litigation in his courtroom, received little credit and were labeled as “not evidentiary in a court of law.” Id. at 14 (quoting FDA’s description of FAERS).

Judge Johnson recognized that there was a wide range of identified “risk factors” for irritable bowel syndrome, such as prior appendectomy, breast-feeding as an infant, stress, Vitamin D deficiency, tobacco or alcohol use, refined sugars, dietary animal fat, fast food. In re Accutane, 2015 WL 753674, at *9. The court also noted that there were four medications generally acknowledged to be potential risk factors for inflammatory bowel disease: aspirin, nonsteroidal anti-inflammatory medications (NSAIDs), oral contraceptives, and antibiotics. Understandably, Judge Johnson was concerned that the plaintiffs’ expert witnesses preferred studies unadjusted for potential confounding co-variables and studies that had involved “cherry picking the subjects.” Id. at *18.

Judge Johnson had found that both sides in the isotretinoin cases conceded the relative unimportance of animal studies, but the plaintiffs’ expert witnesses nonetheless invoked the animal studies in the face of the artificial absence of epidemiologic studies that had been created by their cherry-picking strategies. Id.

Plaintiffs’ expert witnesses had reprised a common claimants’ strategy; namely, they claimed that all the epidemiology studies lacked statistical power. Their arguments often ignored that statistical power calculations depend upon statistical significance, a concept to which many plaintiffs’ counsel have virulent antibodies, as well as an arbitrarily selected alternative hypothesis of association size. Furthermore, the plaintiffs’ arguments ignored the actual point estimates, most of which were favorable to the defense, and the observed confidence intervals, most of which were reasonably narrow.

The defense responded to the bogus statistical arguments by presenting an extremely capable clinical and statistical expert witness, Dr. Stephen Goodman, to present a meta-analysis of the available epidemiologic evidence.

Meta-analysis has become an important facet of pharmaceutical and other products liability litigation[1]. Fortunately for Judge Johnson, he had before him an extremely capable expert witness, Dr. Stephen Goodman, to explain meta-analysis generally, and two meta-analyses he had performed on isotretinoin and irritable bowel outcomes.

Dr. Goodman explained that the plaintiffs’ witnesses’ failure to perform a meta-analysis was telling when meta-analysis can obviate the plaintiffs’ hyperbolic statistical complaints:

the strength of the meta-analysis is that no one feature, no one study, is determinant. You don’t throw out evidence except when you absolutely have to.”

In re Accutane, 2015 WL 753674, at *8.

Judge Johnson’s judicial handiwork received non-deferential appellate review from a three-judge panel of the Appellate Division, which reversed the exclusion of Kornbluth and Madigan. In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832 (App. Div. 2017). The New Jersey Supreme Court granted the isotretinoin defendants’ petition for appellate review, and the issues were joined over the appropriate standard of appellate review for expert witness opinion exclusions, and the appropriateness of Judge Johnson’s exclusions of Kornbluth and Madigan. A bevy of amici curiae joined in the fray.2

Last week, the New Jersey Supreme Court issued a unanimous opinion, which reversed the Appellate Division’s holding that Judge Johnson had “mistakenly exercised” discretion. Applying its own precedents from Rubanick, Landrigan, and Kemp, and the established abuse-of-discretion standard, the Court concluded that the trial court’s ruling to exclude Kornbluth and Madigan was “unassailable.” In re Accutane Litig., ___ N.J. ___, 2018 WL 3636867 (2018), Slip op. at 79.3

The high court graciously acknowledged that defendants and amici had “good reason” to seek clarification of New Jersey law. Slip op. at 67. In abandoning abuse-of-discretion as its standard of review, the Appellate Division had relied upon a criminal case that involved the application of the Frye standard, which is applied as a matter of law. Id. at 70-71. The high court also appeared to welcome the opportunity to grant review and reverse the intermediate court reinforce “the rigor expected of the trial court” in its gatekeeping role. Id. at 67. The Supreme Court, however, did not articulate a new standard; rather it demonstrated at length that Judge Johnson had appropriately applied the legal standards that had been previously announced in New Jersey Supreme Court cases.4

In attempting to defend the Appellate Division’s decision, plaintiffs sought to characterize New Jersey law as somehow different from, and more “liberal” than, the United States Supreme Court’s decision in Daubert. The New Jersey Supreme Court acknowledged that it had never formally adopted the dicta from Daubert about factors that could be considered in gatekeeping, slip op. at 10, but the Court went on to note what disinterested observers had long understood, that the so-called Daubert factors simply flowed from a requirement of sound methodology, and that there was “little distinction” and “not much light” between the Landrigan and Rubanick principles and the Daubert case or its progeny. Id at 10, 80.

Curiously, the New Jersey Supreme Court announced that the Daubert factors should be incorporated into the New Jersey Rules 702 and 703 and their case law, but it stopped short of declaring New Jersey a “Daubert” jurisdiction. Slip op. at 82. In part, the Court’s hesitance followed from New Jersey’s bifurcation of expert witness standards for civil and criminal cases, with the Frye standard still controlling in the criminal docket. At another level, it makes no sense to describe any jurisdiction as a “Daubert” state because the relevant aspects of the Daubert decision were dicta, and the Daubert decision and its progeny were superseded by the revision of the controlling statute in 2000.5

There were other remarkable aspects of the Supreme Court’s Accutane decision. For instance, the Court put its weight behind the common-sense and accurate interpretation of Sir Austin Bradford Hill’s famous articulation of factors for causal judgment, which requires that sampling error, bias, and confounding be eliminated before assessing whether the observed association is strong, consistent, plausible, and the like. Slip op. at 20 (citing the Reference Manual at 597-99), 78.

The Supreme Court relied extensively on the National Academies’ Reference Manual on Scientific Evidence.6 That reliance is certainly preferable to judicial speculations and fabulations of scientific method. The reliance is also positive, considering that the Court did not look only at the problematic epidemiology chapter, but adverted also to the chapters on statistical evidence and on clinical medicine.

The Supreme Court recognized that the Appellate Division had essentially sanctioned an anything goes abandonment of gatekeeping, an approach that has been all-too-common in some of New Jersey’s lower courts. Contrary to the previously prevailing New Jersey zeitgeist, the Court instructed that gatekeeping must be “rigorous” to “prevent[] the jury’s exposure to unsound science through the compelling voice of an expert.” Slip op. at 68-9.

Not all evidence is equal. “[C]ase reports are at the bottom of the evidence hierarchy.” Slip op. at 73. Extrapolation from non-human animal studies is fraught with external validity problems, and such studies “far less probative in the face of a substantial body of epidemiologic evidence.” Id. at 74 (internal quotations omitted).

Perhaps most chilling for the lawsuit industry will be the Supreme Court’s strident denunciation of expert witnesses’ selectivity in choosing lesser evidence in the face of a large body of epidemiologic evidence, id. at 77, and their unprincipled cherry picking among the extant epidemiologic publications. Like the trial court, the Supreme Court found that the plaintiffs’ expert witnesses’ inconsistent use of methodological criteria and their selective reliance upon studies (disregarding eight of the nine epidemiologic studies) that favored their task masters was the antithesis of sound methodology. Id. at 73, citing with approval, In re Lipitor, ___ F.3d ___ (4th Cir. 2018) (slip op. at 16) (“Result-driven analysis, or cherry-picking, undermines principles of the scientific method and is a quintessential example of applying methodologies (valid or otherwise) in an unreliable fashion.”).

An essential feature of the Supreme Court’s decision is that it was not willing to engage in the common reductionism that has “all epidemiologic studies are flawed,” and which thus privileges cherry picking. Not all disagreements between expert witnesses can be framed as differences in interpretation. In re Accutane will likely stand as a bulwark against flawed expert witness opinion testimony in the Garden State for a long time.


1 Judge Nelson Johnson is also the author of Boardwalk Empire: The Birth, High Times, and Corruption of Atlantic City (2010), a spell-binding historical novel about political and personal corruption.

2 In support of the defendants’ positions, amicus briefs were filed by the New Jersey Business & Industry Association, Commerce and Industry Association of New Jersey, and New Jersey Chamber of Commerce; by law professors Kenneth S. Broun, Daniel J. Capra, Joanne A. Epps, David L. Faigman, Laird Kirkpatrick, Michael M. Martin, Liesa Richter, and Stephen A. Saltzburg; by medical associations the American Medical Association, Medical Society of New Jersey, American Academy of Dermatology, Society for Investigative Dermatology, American Acne and Rosacea Society, and Dermatological Society of New Jersey, by the Defense Research Institute; by the Pharmaceutical Research and Manufacturers of America; and by New Jersey Civil Justice Institute. In support of the plaintiffs’ position and the intermediate appellate court’s determination, amicus briefs were filed by political action committee the New Jersey Association for Justice; by the Ironbound Community Corporation; and by plaintiffs’ lawyer Allan Kanner.

3 Nothing in the intervening scientific record called question upon Judge Johnson’s trial court judgment. See, e.g., I.A. Vallerand, R.T. Lewinson, M.S. Farris, C.D. Sibley, M.L. Ramien, A.G.M. Bulloch, and S.B. Patten, “Efficacy and adverse events of oral isotretinoin for acne: a systematic review,” 178 Brit. J. Dermatol. 76 (2018).

4 Slip op. at 9, 14-15, citing Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991) (“We initially took that step to allow the parties in toxic tort civil matters to present novel scientific evidence of causation if, after the trial court engages in rigorous gatekeeping when reviewing for reliability, the proponent persuades the court of the soundness of the expert’s reasoning.”).

5 The Court did acknowledge that Federal Rule of Evidence 702 had been amended in 2000, to reflect the Supreme Court’s decision in Daubert, Joiner, and Kumho Tire, but the Court did not deal with the inconsistencies between the present rule and the 1993 Daubert case. Slip op. at 64, citing Calhoun v. Yamaha Motor Corp., U.S.A., 350 F.3d 316, 320-21, 320 n.8 (3d Cir. 2003).

6 See Accutane slip op. at 12-18, 24, 73-74, 77-78. With respect to meta-analysis, the Reference Manual’s epidemiology chapter is still stuck in the 1980s and the prevalent resistance to poorly conducted, often meaningless meta-analyses. SeeThe Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011) (The Reference Manual fails to come to grips with the prevalence and importance of meta-analysis in litigation, and fails to provide meaningful guidance to trial judges).

P-Values: Pernicious or Perspicacious?

May 12th, 2018

Professor Kingsley R. Browne, of the Wayne State University Law School, recently published a paper that criticized the use of p-values and significance testing in discrimination litigation. Kingsley R. Browne, “Pernicious P-Values: Statistical Proof of Not Very Much,” 42 Univ. Dayton L. Rev. 113 (2017) (cited below as Browne). Browne amply documents the obvious and undeniable, that judges, lawyers, and even some ill-trained expert witnesses, are congenitally unable to describe and interpret p-values properly. Most of Browne’s examples are from the world of anti-discrimination law, but he also cites a few from health effects litigation as well. Browne also cites from many of the criticisms of p-values in the psychology and other social science literature.

Browne’s efforts to correct judicial innumeracy are welcomed, but they take a peculiar turn in this law review article. From the well-known state of affairs of widespread judicial refusal or inability to discuss statistical concepts accurately, Browne argues for what seem to be two incongruous, inconsistent responses. Rejecting the glib suggestion of former Judge Posner that evidence law is not “fussy” about evidence, Browne argues that federal evidence law requires courts to be “fussy” about evidence, and that Rule 702 requires courts to exclude expert witnesses, whose opinions fail to “employ[] in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.” Browne at 143 (quoting from Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999). Browne tells us, with apparently appropriate intellectual rigor, that “[i]f a disparity that does not provide a p-value of less than 0.05 would not be accepted as meaningful in the expert’s discipline, it is not clear that the expert should be allowed to testify – on the basis of his expertise in that discipline – that the disparity is, in fact, meaningful.” Id.

In a volte face, Browne then argues that p-values do “not tell us much,” basically because they are dependent upon sample size. Browne suggests that the quantitative disparity between expected value and observed proportion or average can be assessed without the use of p-values, and that measuring a p-value “adds virtually nothing and just muddies the water.” Id. at 152. The prevalent confusion among judges and lawyers seems sufficient in Browne’s view to justify his proposal, as well as his further suggestion that Rule 403 should be invoked to exclude p-values:

The ease with which reported p-values cause a trier of fact to slip into the transposition fallacy and the difficulty of avoiding that lapse of logic, coupled with the relatively sparse information actually provided by the p-value, make p-values prime candidates for exclusion under Federal Rule of Evidence 403. *** If judges, not to mention the statistical experts they rely on, cannot use the information without falling into fallacious reasoning, the likelihood that the jury will misunderstand the evidence is very high. Since the p-value actually provides little useful relevant information, the high risk of misleading the jury greatly exceeds its scant probative value, so it simply should not be presented to the jury.”

Id. at 152-53.

And yet, elsewhere in the same article, Browne ridicules one court and several expert witnesses who have argued in favor of conclusions that were based upon p-values up to 50%.1 The concept of p-values cannot be so flexible as to straddle the extremes of having no probative value, and yet capable of rendering an expert witness’s opinions ludicrous. P-values quantify an estimate of random error, even if that error rate varies with sample size. To be sure, the measure of random error depends upon the specified model and assumption of a null hypothesis, but the crucial point is that the estimate (whether mean, proportion, risk ratio, risk difference, etc.) is rather meaningless without some further estimate of random variability of the estimate. Of course, random error is not the only type of error, but the existence of other potential systematic errors is hardly a reason to ignore random error.

In the science of health effects, many applications of p-values have given way to the use of confidence intervals, which arguably provide more direct assessments of both sample estimates, along with ranges of potential outcomes that are reasonably compatible with the sample estimates. Remarkably, Browne never substantively discusses confidence intervals in his article.

Under the heading of other problems with p-values and significance testing, Browne advances four additional putative problems with p-values. First, Browne asserts with little to no support that “[t]he null hypothesis is unlikely a priori.” Id. at 155. He fails to tell us why the null hypothesis of no disparity is not a reasonable starting place in the absence of objective evidence of a prior estimate. Furthermore, a null hypothesis of no difference will have legal significance in claims of health effects, or of unlawful discrimination.

Second, Browne argues that significance testing will lead to “[c]onflation of statistical and practical (or legal) significance” in the minds of judges and jurors. Id. at 156-58. This charge is difficult to sustain. The actors in legal cases can probably best appreciate practical significance and its separation from statistical significance, most readily. If a large class action showed that the expected value of a minority’s proportion was 15%, and the observed proportion was 14.8%, p < 0.05, most innumerate judges and jurors would sense that this disparity was unimportant and that no employer would fine tune its discriminatory activities so closely as to achieve such a meaningless difference.

Third, Browne reminds us that the validity and the interpretation of a p-value turns on the assumption that the statistical model is perfectly specified. Id. at 158-59. His reminder is correct, but again, this aspect of p-values (or confidence intervals) is relatively easy to explain, as well as to defend or challenge. To be sure, there may be legitimate disputes about whether an appropriate model was used (say binomial versus hypergeometric), but such disputes are hardly the most arcane issues that judges and jurors will face.

Fourth, Browne claims that “the alternative hypothesis is seldom properly specified.” Id. at 159-62. Unless analysts are focused on measuring pre-test power or type II error, however, they need not advance an alternative hypothesis. Furthermore, it is hardly a flaw with significance testing that it does not account for systematic bias or confounding.

Browne does not offer an affirmative response such as urging courts to adopt a Bayesian program. A Bayesian response to prevalent blunders in interpreting statistical significance would introduce perhaps even more arcane and hard-to-discern blunders in court proceedings. Browne also leaves courts without a meaningful approach to evaluate random error other than to engage in crude comparisons between two means or proportions. The recommendations in this law review article appear to be a giant step, backwards, into an epistemic void.


1See Browne at 146, citing In re Photochromic Lens Antitrust Litig., 2014 WL 1338605 (M.D. Fla. April 3, 2014) (reversing magistrate judge’s exclusion of an expert witness who had advanced claims based upon p-value of 0.50); id. at 147 n. 116, citing In re High-Tech Employee Antitrust Litig., 2014 WL 1351040 (N.D. Cal. 2014).

Statistical Deontology

March 2nd, 2018

In courtrooms across America, there has been a lot of buzzing and palavering about the American Statistical Association’s Statement on Statistical Significance Testing,1 but very little discussion of the Society’s Ethical Guidelines, which were updated and promulgated in the same year, 2016. Statisticians and statistics, like lawyers and the law, receive their fair share of calumny over their professional activities, but the statistician’s principal North American professional organization is trying to do something about members’ transgressions.

The American Statistical Society (ASA) has promulgated ethical guidelines for statisticians, as has the Royal Statistical Society,2 even if these organizations lack the means and procedures to enforce their codes. The ASA’s guidelines3 are rich with implications for statistical analyses put forward in all contexts, including in litigation and regulatory rule making. As such, the guidelines are well worth studying by lawyers.

The ASA Guidelines were prepared by the Committee on Professional Ethics, and approved by the ASA’s Board in April 2016. There are lots of “thou shall” and “thou shall nots,” but I will focus on the issues that are more likely to arise in litigation. What is remarkable about the Guidelines is that if followed, they probably are more likely to eliminate unsound statistical practices in the courtroom than the ASA State on P-values.

Defining Good Statistical Practice

Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” Guidelines at 1. The Guidelines thus incorporate something akin to the Kumho Tire standard that an expert witness ‘‘employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’ Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999).

A statistician engaged in expert witness testimony should provide “only expert testimony, written work, and oral presentations that he/she would be willing to have peer reviewed.” Guidelines at 2. “The ethical statistician uses methodology and data that are relevant and appropriate, without favoritism or prejudice, and in a manner intended to produce valid, interpretable, and reproducible results.” Id. Similarly, the statistician, if ethical, will identify and mitigate biases, and use analyses “appropriate and valid for the specific question to be addressed, so that results extend beyond the sample to a population relevant to the objectives with minimal error under reasonable assumptions.” Id. If the Guidelines were followed, a lot of spurious analyses would drop off the litigation landscape, regardless whether they used p-values or confidence intervals, or a Bayesian approach.

Integrity of Data and Methods

The ASA’s Guidelines also have a good deal to say about data integrity and statistical methods. In particular, the Guidelines call for candor about limitations in the statistical methods or the integrity of the underlying data:

The ethical statistician is candid about any known or suspected limitations, defects, or biases in the data that may impact the integrity or reliability of the statistical analysis. Objective and valid interpretation of the results requires that the underlying analysis recognizes and acknowledges the degree of reliability and integrity of the data.”

Guidelines at 3.

The statistical analyst openly acknowledges the limits of statistical inference, the potential sources of error, as well as the statistical and substantive assumptions made in the execution and interpretation of any analysis,” including data editing and imputation. Id. The Guidelines urge analysts to address potential confounding not assessed by the study design. Id. at 3, 10. How often do we see these acknowledgments in litigation-driven analyses, or in peer-reviewed papers, for that matter?

Affirmative Actions Prescribed

In the aid of promoting data and methodological integrity, the Guidelines also urge analysts to share data when appropriate without revealing the identities of study participants. Statistical analysts should publicly correct any disseminated data and analyses in their own work, as well as working to “expose incompetent or corrupt statistical practice.” Of course, the Lawsuit Industry will call this ethical duty “attacking the messenger,” but maybe that’s a rhetorical strategy based upon an assessment of risks versus benefits to the Lawsuit Industry.

Multiplicity

The ASA Guidelines address the impropriety of substantive statistical errors, such as:

[r]unning multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one “significant” result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading.”

Guidelines at 9.

There are some Lawsuit Industrialists who have taken comfort in the pronouncements of Kenneth Rothman on corrections for multiple comparisons. Rothman’s views on multiple comparisons are, however, much broader and more nuanced than the Industry’s sound bites.4 Given that Rothman opposes anything like strict statistical significance testing, it follows that he is relatively unmoved for the need for adjustments to alpha or the coefficient of confidence. Rothman, however, has never deprecated the need to consider the multiplicity of testing, and the need for researchers to be forthright in disclosing the the scope of comparisons originally planned and actually done.


2 Royal Statistical Society – Code of Conduct (2014); Steven Piantadosi, Clinical Trials: A Methodologic Perspective 609 (2d ed. 2005).

3 Shelley Hurwitz & John S. Gardenier, “Ethical Guidelines for Statistical Practice: The First 60 Years and Beyond,” 66 Am. Statistician 99 (2012) (describing the history and evolution of the Guidelines).

4 Kenneth J. Rothman, “Six Persistent Research Misconceptions,” 29 J. Gen. Intern. Med. 1060, 1063 (2014).

The 5% Solution at the FDA

February 24th, 2018

The statistics wars rage on1, with Bayesians attempting to take advantage of the so-called replication crisis to argue it is all the fault of frequentist significance testing. In 2016, there was an attempted coup at the American Statistical Association, but the Bayesians did not get what they wanted, with little more than a consensus that p-values and confidence intervals should be properly interpreted. Patient advocacy groups have lobbied for the availability of unapproved and incompletely tested medications, and rent-seeking litigation has argued and lobbied for the elimination of statistical tests and methods in the assessment of causal claims. The battle continues.

Against this backdrop, a young Harvard graduate student has published a a paper with a brief history of significance testing, and the role that significance testing has taken on at the United States Food and Drug Administration (FDA). Lee Kennedy-Shaffer, “When the Alpha is the Omega: P-Values, ‘Substantial Evidence’, and the 0.05 Standard at FDA,” 72 Food & Drug L.J. 595 (2017) [cited below as K-S]. The paper presents a short but entertaining history of the evolution of the p-value from its early invocation in 1710, by John Arbuthnott, a Scottish physician and mathematician, who calculated the probability that male births would exceed female births 82 consecutive years if their true proportions were equal. K-S at 603. Kennedy-Shaffer notes the role of the two great French mathematicians, Pierre-Simon Laplace and Siméon-Denis Poisson, who used p-values (or their complements) to evaluate empirical propositions. As Kennedy-Shaffer notes, Poisson observed that the equivalent of what would be a modern p-value about 0.005, was sufficient in his view, back in 1830, to believe that the French Revolution of 1830 had caused the pattern of jury verdicts to be changed. K-S at 604.

Kennedy-Shaffer traces the p-value, or its equivalent, through its treatment by the great early 20th century statisticians, Karl Pearson and Ronald A. Fisher, through its modification by Jerzy Neyman and Egon Pearson, into the bowels of the FDA in Rockville, Maryland. It is a history well worth recounting, if for no other reason, to remind us that the p-value or its equivalent has been remarkably durable and reasonably effective in protecting the public against false claims of safety and efficacy. Kennedy-Shaffer provides several good examples in which the FDA’s use of significance testing was outcome dispositive of approval or non-approval of medications and devices.

There is enough substance and history here that everyone will have something to pick at this paper. Let me volunteer the first shot. Kennedy-Shaffer describes the co-evolution of the controlled clinical trial and statistical tests, and points to the landmark study by the Medical Research Council on streptomycin for tuberculosis. Geoffrey Marshall (chairman), “Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation,” 2 Brit. Med. J. 769, 769–71 (1948). This clinical trial was historically important, not only for its results and for Sir Austin Bradford Hill’s role in its design, but for the care with which it described randomization, double blinding, and multiple study sites. Kennedy-Shaffer suggests that “[w]hile results were presented in detail, few formal statistical tests were incorporated into this analysis.” K-S at 597-98. And yet, a few pages later, he tells us that “both chi-squared tests and t-tests were used to evaluate the responses to the drug and compare the control and treated groups,” and that “[t]he difference in mortality between the two groups is statistically significant.” K-S at 611. Although it is true that the authors did not report their calculated p-values for any test, the difference in mortality between the streptomycin and control groups was very large, and the standards for describing the results of such a clinical trial were in their infancy in 1948.

Kennedy-Shaffer’s characterization of Sir Austin Bradford Hill’s use of statistical tests and methods takes on out-size importance because of the mischaracterizations, and even misrepresentations, made by some representatives of the Lawsuit Industry, who contend that Sir Austin dismissed statistical methods as unnecessary. In the United States, some judges have been seriously misled by those misrepresentations, which have their way into published judicial decisions.

The operative document, of course, is the publication of Sir Austin’s famous after-dinner speech, in 1965, on the occasion of his election to the Presidency of the Royal Society of Medicine. Although the speech is casual and free of scholarly footnotes, Sir Austin’s message was precise, balanced, and nuanced. The speech is a classic in the history of medicine, which remains important even if rather dated in terms of its primary message about how science and medicine move from beliefs about associations to knowledge of causal associations. As everyone knows, Sir Austin articulated nine factors or viewpoints through which to assess any putative causal association, but he emphasized that before these nine factors are assessed, our starting point itself has prerequisites:

Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965) [cited below as Hill]. The starting point, therefore, before the Bradford Hill nine factors come into play, is a “clear-cut” association, which is “beyond what we would care to attribute to the play of chance.”

In other words, consideration of random error is necessary.

Now for the nuance and the balance. Sir Austin acknowledged that there were some situations in which we simply do not need to calculate standard errors because the disparity between treatment and control groups is so large and meaningful. He goes on to wonder out loud:

whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse the glitter of the t table diverts attention from the inadequacies of the fare.”

Hill at 299. Now this is all true, but hardly the repudiation of statistical testing claimed by those who want to suppress the consideration of random error from science and judicial gatekeeping. There are very few litigation cases in which the difference between exposed and unexposed is “grotesquely obvious,” such that we can leave statistical methods at the door. Importantly, the very large differences between the streptomycin and placebo control groups in the Medical Council’s 1948 clinical trial were not so “grotesquely obvious” that statistical methods were obviated. To be fair, the differences were sufficiently great that statistical discussion could be kept to a minimum. Sir Austin gave extensive tables in the 1948 paper to let the reader appreciate the actual data themselves.

In his after-dinner speech, Hill also gives examples of studies that are so biased and confounded that no statistical method will likely ever save them. Certainly, the technology of regression and propensity-score analyses have progressed tremendously since Hill’s 1965 speech, but his point still remains. This point hardly excuses the lack of statistical apparatus in highly confounding or biased observations.

In addressing the nine factors he identified, which presumed a “clear-cut” association, with random error ruled out, Sir Austin did opine that for the factors raised questions and that:

No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.”

Hill at 299. Again, the date and the context are important. Hill is addressing consideration of the nine factors, not the required predicate association beyond the play of chance or random error. The date is important as well, because it would be foolish to suggest that statistical methods have not grown in the last half century to address some of the nine factors. The existence and the nature of dose-response are the subject of extensive statistical methods, and meta-analysis and meta-regression are used to assess and measure consistency between studies.

Kennedy-Shaffer might well have pointed out the great influence Sir Austin’s textbook on medical statistics had had on medical research and practice. This textbook, which went through numerous editions, makes clear the importance of statistical testing and methods:

Are simple methods of the interpretation of figures only a synonym for common sense or do they involve an art or knowledge which can be imparted? Familiarity with medical statistics leads inevitably to the conclusion that common sense is not enough. Mistakes which when pointed out look extremely foolish are quite frequently made by intelligent persons, and the same mistakes, or types of mistakes, crop up again and again. There is often lacking what has been called a ‘statistical tact, which is rather more than simple good sense’. That fact the majority of persons must acquire (with a minority it is undoubtedly innate) by a study of the basic principles of statistical method.”

Austin Bradford Hill, Principles of Medical Statistics at 2 (4th ed. 1948) (emphasis in original). And later in his text, Sir Austin notes that:

The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”

Id. at 10 (emphasis added).

Sir Austin’s work taken as a whole demonstrates the acceptance of the necessity of statistical methods in medicine, and causal inference. Kennedy-Shaffer’s paper covers much ground, but it short changes this important line of influence, which lies directly in the historical path between Sir Ronald Fisher and the medical regulatory community.

Kennedy-Shaffer gives a nod to Bayesian methods, and even suggests that Bayesian results are “more intuitive,” but he does not explain the supposed intuitiveness of how a parameter has a probability distribution. This might make sense at the level of quantum physics, but does not seem to describe the reality of a biomedical phenomenon such as relative risk. Kennedy-Shaffer notes the FDA’s expression of willingness to entertain Bayesian analyses of clinical trials, and the rare instances in which such analyses have actually been deployed. K-S at 629 (“e.g., Pravigard Pac for prevention of myocardial infarction”). He concedes, however, that Bayesian designs are still the exception to the rule, as well as the cautions of Robert Temple, a former FDA Director of Medical Policy, in 2005, that Bayesian proposals for drug clinical trials were at that time “very rare.2” K-S at 630.


2 Robert Temple, “How FDA Currently Makes Decisions on Clinical Studies,” 2 Clinical Trials 276, 281 (2005).