Matrixx Galvanized – More Errors, More Comedy About Statistics

Matrixx Initiatives is a rich case – rich in irony, comedy, tragedy, and error.  It is well worth further exploration, especially in terms of how this 9-0 decision was reached, what it means, and how it should be applied.

It pains me that the Respondents (plaintiffs) generally did a better job in explaining significance testing than did the Petitioner (defendant).

At least some of the Respondents’ definitional efforts are unexceptional.  For instance:

“Researchers use the term ‘statistical significance’ to characterize a result from a test that satisfies a particular kind of test designed to show that the result is unlikely to have occurred by random chance.  See David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 83, 122 (Fed. Judicial Ctr., 2d ed. 2000) (“Reference Manual”).”

Brief for Respondents, at 38 – 39 (Nov 5, 2010).

“The purpose of significance testing in this context is to assess whether two events (here, taking Zicam and developing anosmia) occur together often enough to make it sufficiently implausible that no actual underlying relationship exists between them.”

Id. at 39.   These definitions seem acceptable as far as they go, as long as we realize that the relationship that remains, when chance is excluded, may not be causal, and indeed, it may well be a false-positive relationship that results from bias or confounding.

Rather than giving one good, clear definition, the Respondents felt obligated to and repeat and restate their definitions, and thus wandered into error:

“To test for significance, the researcher typically develops a ‘null hypothesis’ – e.g., that there is no relationship between using intranasal Zicam and the onset of burning pain and subsequent anosmia. The researcher then selects a threshold (the ‘significance level’) that reflects an acceptably low probability of rejecting a true null hypothesis – e.g., of concluding that a relationship between Zicam and anosmia exists based on observations that in fact reflect random chance.”

Id. at 39.  Perhaps the Respondents were using the “cooking frogs” approach.  As the practical wisdom has it, dropping a frog into boiling water risks having the frog jump out, but if you put a frog into a pot of warm water, and gradually bring the pot to a boil, you will have a cooked frog.  Here the Respondents repeat and morph their definition of statistical significance until they have brought it around to their rhetorical goal of confusing statistical significance with causation.  Note that now the definition is muddled, and the Respondents are edging closer towards claiming that statistical significance signals the existence of a “relationship” between Zicam and anosmia, when in fact, the statistical significance simply means that chance is not a likely explanation for the observations.  Whether a “relationship” exists requires further analysis, and usually a good deal more evidence.

“The researcher then calculates a value (referred to as p) that reflects the probability that the observed data could have occurred even if the null hypothesis were in fact true.”

Id. at 39-40 (emphasis in original). Well, this is almost true.  It’s not “even if,” but simply “if”; that is, the p-value is based upon the assumption that the null hypothesis is correct.  The “if” is not an incidental qualifier, it is essential to the definition of statistical significance. “Even” here adds nothing, but a slightly misleading rhetorical flourish.  And the p-value is not the probability that the observed data are correct; it’s the probability of observing the data obtained, or data more extreme, assuming the null hypothesis is true.

The Respondents/plaintiffs efforts at serious explication ultimately succumb to their hyperbolic rhetoric.  They explained that statistical significance may not be “practical significance,” which is true enough.  There are, of course, instances in which a statistical significant difference is not particularly interesting.  A large clinical trial, testing two cancer medications head to head, may show one extends life expectancy by a week or two, but has a worse side-effect profile.  The statistically significant “better” drug may be refused a license from regulatory agencies, or be rejected by knowledgeable oncologists and sensible patients, who are more concerned about quality of life issues.

The Respondents are also correct that invoking statistically significance does not provide the simple, bright-line test, Petitioner desired.  Someone would still have to specify the level of alpha, the acceptable level of Type I error, and this would further require a specification of either a one-sided or two-sided test.  To be sure, the two-sided test, with an alpha of 5%, is generally accepted in the world of biostatistics and biomedical research.  Regulatory agencies, including the FDA, however, lower the standard test to implement their precautionary principles and goals.  Furthermore, evaluation of statistical significance requires additional analysis to determine whether the observed deviation from expected is due to bias or confounding, or whether the statistical test has been unduly diluted by multiple comparisons, subgroup analyses, or data mining techniques.

Of course, statistical significance today usually occurs in conjunction with an assessment of “effect size,” usually through an analysis of a confidence interval around a point estimate of a risk ratio.  The Respondents’ complaint that the p-value does not convey the magnitude of the association is a bit off the mark, but not completely illegitimate.  For instance, if there were a statistically significant finding of anosmia from Zicam use, in the form of an elevated risk that was itself small, the FDA might well decide that the risk was manageable with a warning to users to discontinue the medication if they experienced a burning sensation upon use.

The Respondents, along with their two would-be “statistical expert” amici, misrepresent the substance of many of the objections to statistical significance in the medical literature.  A telling example is the Respondents’ citation to an article by Professor David Savitz:

David A. Savitz, “Is Statistical Significance Testing Useful in Interpreting Data?” 7 Reproductive Toxicology 95, 96 (1993) “[S]tatistical significance testing is not useful in the analysis or interpretation of scientific research.”).

Id. at 52, n. 40.

More complete quotations from Professor Savitz’ article, however, reveals a more nuanced, and rather different message:

“Although P values and statistical significance testing have become entrenched in the practice of biomedical research, their usefulness and drawbacks should be reconsidered, particularly in observational epidemiology. The central role for the null hypothesis, assuming an infinite number of replications, and the dichotomization of results as positive or negative are argued to be detrimental to the proper design and evaluation of research. As an alternative, confidence intervals for estimated parameters convey some information about random variation without several of these limitations. Elimination of statistical significance testing as a decision rule would encourage those who present and evaluate research to more comprehensively consider the methodologic features that may yield inaccurate results and shift the focus from the potential influence of random error to a broader consideration of possible reasons for erroneous results.”

Savitz, 7 Reproductive Toxicology at 95.  Respondents’ case would hardly have been helped by substituting a call for statistical significance with a call for using confidence intervals, along with careful scrutiny of the results for erroneous results.

“Regardless of what is taught in statistics courses or advocated by editorials, including the recent one in this journal, statistical tests are still routinely invoked as the primary criterion for assessing whether the hypothesized phenomenon has occurred.”

7 Reproductive Toxicology at 96 (internal citation omitted).

“No matter how carefully worded, “statistically significant” misleadingly conveys notions of causality and importance.”

Id. at 99.  This last quotation really unravels the Respondents’ fatuous use of citations.  Of course, the Savitz article is quite inconsistent generally with the message that the Respondents wished to convey to the Supreme Court, but intellectually honesty required a fuller acknowledgement of Prof. Savitz’ thinking about the matter.

Finally, there are some limited cases, in which the failure to obtain a conventionally statistically significant result is not fatal to an assessment of causality.  Such cases usually involve instances in which it is extremely difficult to find observational or experimental data to analyze for statistical significance, but other lines of evidence support the conclusion in a way that scientists accept.  Although these cases are much rarer than Respondents imagine, they may well exist, but they do not detract much from Sir Ronald Fisher’s original conception of statistical significance:

“In the investigation of living beings by biological methods statistical tests of significance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to the causes we wish to study, or are trying to detect, but to a combination of the many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.”

Ronald A. Fisher, “The Statistical Method in Psychical Research,” 39 Proceedings of the Society for Psychical Research 189, 191 (1929). Note that Fisher was talking about experiments, not observational studies, and that he hardly was advocating a mechanical, thoughtless criterion of significance.

The Supreme Court’s decision in Castenada illustrates how misleading statistical significance can be.  In a five-to-four decision, the Court held that a prima facie case of ethnic discrimination could be made out on the basis of statistical significance alone.  In dictum, the Court suggested that statistical evidence alone sufficed when the observed outcome was more than two or three standard deviations from the expected outcome.  Castaneda v. Partida, 430 U.S. 482, 496 n. 17 (1977).  The facts of Castaneda illustrate a compelling case in which the statistical significance observed was likely the result of confounding effects of reduced civic participation by poor, itinerant minorities, in a Texas county in which the ethnic minority controlled political power, and made up a majority of the petit jury that convicted Mr. Partida.