The 5% Solution at the FDA

The statistics wars rage on1, with Bayesians attempting to take advantage of the so-called replication crisis to argue it is all the fault of frequentist significance testing. In 2016, there was an attempted coup at the American Statistical Association, but the Bayesians did not get what they wanted, with little more than a consensus that p-values and confidence intervals should be properly interpreted. Patient advocacy groups have lobbied for the availability of unapproved and incompletely tested medications, and rent-seeking litigation has argued and lobbied for the elimination of statistical tests and methods in the assessment of causal claims. The battle continues.

Against this backdrop, a young Harvard graduate student has published a a paper with a brief history of significance testing, and the role that significance testing has taken on at the United States Food and Drug Administration (FDA). Lee Kennedy-Shaffer, “When the Alpha is the Omega: P-Values, ‘Substantial Evidence’, and the 0.05 Standard at FDA,” 72 Food & Drug L.J. 595 (2017) [cited below as K-S]. The paper presents a short but entertaining history of the evolution of the p-value from its early invocation in 1710, by John Arbuthnott, a Scottish physician and mathematician, who calculated the probability that male births would exceed female births 82 consecutive years if their true proportions were equal. K-S at 603. Kennedy-Shaffer notes the role of the two great French mathematicians, Pierre-Simon Laplace and Siméon-Denis Poisson, who used p-values (or their complements) to evaluate empirical propositions. As Kennedy-Shaffer notes, Poisson observed that the equivalent of what would be a modern p-value about 0.005, was sufficient in his view, back in 1830, to believe that the French Revolution of 1830 had caused the pattern of jury verdicts to be changed. K-S at 604.

Kennedy-Shaffer traces the p-value, or its equivalent, through its treatment by the great early 20th century statisticians, Karl Pearson and Ronald A. Fisher, through its modification by Jerzy Neyman and Egon Pearson, into the bowels of the FDA in Rockville, Maryland. It is a history well worth recounting, if for no other reason, to remind us that the p-value or its equivalent has been remarkably durable and reasonably effective in protecting the public against false claims of safety and efficacy. Kennedy-Shaffer provides several good examples in which the FDA’s use of significance testing was outcome dispositive of approval or non-approval of medications and devices.

There is enough substance and history here that everyone will have something to pick at this paper. Let me volunteer the first shot. Kennedy-Shaffer describes the co-evolution of the controlled clinical trial and statistical tests, and points to the landmark study by the Medical Research Council on streptomycin for tuberculosis. Geoffrey Marshall (chairman), “Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation,” 2 Brit. Med. J. 769, 769–71 (1948). This clinical trial was historically important, not only for its results and for Sir Austin Bradford Hill’s role in its design, but for the care with which it described randomization, double blinding, and multiple study sites. Kennedy-Shaffer suggests that “[w]hile results were presented in detail, few formal statistical tests were incorporated into this analysis.” K-S at 597-98. And yet, a few pages later, he tells us that “both chi-squared tests and t-tests were used to evaluate the responses to the drug and compare the control and treated groups,” and that “[t]he difference in mortality between the two groups is statistically significant.” K-S at 611. Although it is true that the authors did not report their calculated p-values for any test, the difference in mortality between the streptomycin and control groups was very large, and the standards for describing the results of such a clinical trial were in their infancy in 1948.

Kennedy-Shaffer’s characterization of Sir Austin Bradford Hill’s use of statistical tests and methods takes on out-size importance because of the mischaracterizations, and even misrepresentations, made by some representatives of the Lawsuit Industry, who contend that Sir Austin dismissed statistical methods as unnecessary. In the United States, some judges have been seriously misled by those misrepresentations, which have their way into published judicial decisions.

The operative document, of course, is the publication of Sir Austin’s famous after-dinner speech, in 1965, on the occasion of his election to the Presidency of the Royal Society of Medicine. Although the speech is casual and free of scholarly footnotes, Sir Austin’s message was precise, balanced, and nuanced. The speech is a classic in the history of medicine, which remains important even if rather dated in terms of its primary message about how science and medicine move from beliefs about associations to knowledge of causal associations. As everyone knows, Sir Austin articulated nine factors or viewpoints through which to assess any putative causal association, but he emphasized that before these nine factors are assessed, our starting point itself has prerequisites:

Disregarding then any such problem in semantics we have this situation. Our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance. What aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965) [cited below as Hill]. The starting point, therefore, before the Bradford Hill nine factors come into play, is a “clear-cut” association, which is “beyond what we would care to attribute to the play of chance.”

In other words, consideration of random error is necessary.

Now for the nuance and the balance. Sir Austin acknowledged that there were some situations in which we simply do not need to calculate standard errors because the disparity between treatment and control groups is so large and meaningful. He goes on to wonder out loud:

whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse the glitter of the t table diverts attention from the inadequacies of the fare.”

Hill at 299. Now this is all true, but hardly the repudiation of statistical testing claimed by those who want to suppress the consideration of random error from science and judicial gatekeeping. There are very few litigation cases in which the difference between exposed and unexposed is “grotesquely obvious,” such that we can leave statistical methods at the door. Importantly, the very large differences between the streptomycin and placebo control groups in the Medical Council’s 1948 clinical trial were not so “grotesquely obvious” that statistical methods were obviated. To be fair, the differences were sufficiently great that statistical discussion could be kept to a minimum. Sir Austin gave extensive tables in the 1948 paper to let the reader appreciate the actual data themselves.

In his after-dinner speech, Hill also gives examples of studies that are so biased and confounded that no statistical method will likely ever save them. Certainly, the technology of regression and propensity-score analyses have progressed tremendously since Hill’s 1965 speech, but his point still remains. This point hardly excuses the lack of statistical apparatus in highly confounding or biased observations.

In addressing the nine factors he identified, which presumed a “clear-cut” association, with random error ruled out, Sir Austin did opine that for the factors raised questions and that:

No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.”

Hill at 299. Again, the date and the context are important. Hill is addressing consideration of the nine factors, not the required predicate association beyond the play of chance or random error. The date is important as well, because it would be foolish to suggest that statistical methods have not grown in the last half century to address some of the nine factors. The existence and the nature of dose-response are the subject of extensive statistical methods, and meta-analysis and meta-regression are used to assess and measure consistency between studies.

Kennedy-Shaffer might well have pointed out the great influence Sir Austin’s textbook on medical statistics had had on medical research and practice. This textbook, which went through numerous editions, makes clear the importance of statistical testing and methods:

Are simple methods of the interpretation of figures only a synonym for common sense or do they involve an art or knowledge which can be imparted? Familiarity with medical statistics leads inevitably to the conclusion that common sense is not enough. Mistakes which when pointed out look extremely foolish are quite frequently made by intelligent persons, and the same mistakes, or types of mistakes, crop up again and again. There is often lacking what has been called a ‘statistical tact, which is rather more than simple good sense’. That fact the majority of persons must acquire (with a minority it is undoubtedly innate) by a study of the basic principles of statistical method.”

Austin Bradford Hill, Principles of Medical Statistics at 2 (4th ed. 1948) (emphasis in original). And later in his text, Sir Austin notes that:

The statistical method is required in the interpretation of figures which are at the mercy of numerous influences, and its object is to determine whether individual influences can be isolated and their effects measured.”

Id. at 10 (emphasis added).

Sir Austin’s work taken as a whole demonstrates the acceptance of the necessity of statistical methods in medicine, and causal inference. Kennedy-Shaffer’s paper covers much ground, but it short changes this important line of influence, which lies directly in the historical path between Sir Ronald Fisher and the medical regulatory community.

Kennedy-Shaffer gives a nod to Bayesian methods, and even suggests that Bayesian results are “more intuitive,” but he does not explain the supposed intuitiveness of how a parameter has a probability distribution. This might make sense at the level of quantum physics, but does not seem to describe the reality of a biomedical phenomenon such as relative risk. Kennedy-Shaffer notes the FDA’s expression of willingness to entertain Bayesian analyses of clinical trials, and the rare instances in which such analyses have actually been deployed. K-S at 629 (“e.g., Pravigard Pac for prevention of myocardial infarction”). He concedes, however, that Bayesian designs are still the exception to the rule, as well as the cautions of Robert Temple, a former FDA Director of Medical Policy, in 2005, that Bayesian proposals for drug clinical trials were at that time “very rare.2” K-S at 630.


2 Robert Temple, “How FDA Currently Makes Decisions on Clinical Studies,” 2 Clinical Trials 276, 281 (2005).