1. Incubi Curiae
As I noted in the Matrixx Unloaded, Justice Sotomayor’s scholarship, in discussing case law under Federal Rule of Evidence 702, was seriously off base. Of course, Matrixx Initiatives was only a pleading case, and so there was no real reason to consider rules of admissibility or sufficiency, such as Rule 702.
Fortunately, Justice Sotomayor avoided further embarrassment by not discussing the fine details of significance or hypothesis testing. Not so the two so-called “statistics experts” who submitted an amicus brief.
Consider the following statement by McCloskey and Ziliak, about adverse event reports (AER) and statistical significance.
“Suppose that a p-value for a particular test comes in at 9 percent. Should this p-value be considered “insignificant” in practical, human, or economic terms? We respectfully answer, “No.” For a p-value of .09, the odds of observing the AER is 91 percent divided by 9 percent. Put differently, there are 10-to-l odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not).”
Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Zilliak in Support of Respondents, at 18 (Nov. 18, 2010), 2010 WL 4657930 (U.S.) (emphasis added).
Of course, the whole enterprise of using statistical significance to evaluate AER is suspect because there is no rate, either expected or observed. A rate could be estimated from number of AER reported per total number of persons using the medication in some unit of time. Pharmacoepidemiologists sometimes do engage in such speculative blue-sky enterprises to determine whether a “signal” may have been generated by the AER. Even if a denominator were implied, and significance testing used, it would be incorrect to treat the association as causal. Our statistics experts here have committed several serious mistakes; they have
- treated the AERs as a rate, when it is simply a count;
- treated the AERs as an observed rate that can be evaluated against a null hypothesis of no increase in rate, when there is no expected rate for the event in question; and
- treated the pseudo-statistical analysis as if it provided a basis for causal assessment, when at best it would be a very weak observational study that raised an hypothesis for study.
Now that would be, and should be, enough error for any two “statistics experts” in a given day, and we might have hoped that these putative experts would have thought through their ideas before imposing themselves upon a very busy Court. But there is another mistake, which is even more stunning for having come from self-styled “statistics experts.” Their derivation of a probability (or an odds statement) that the null hypothesis of no increased rate of AER is false is statistically incorrect. A p-value is based upon the assumption that the null hypothesis is true, and it measures the probability of having obtained data as extreme, or more extreme, from the expected value, as seen in the study. The p-value is thus a conditional probability statement of the probability of the data given the hypothesis. As every first year statistics student learns, you cannot reverse the order of the conditional probability statement without committing a transpositional fallacy. In other words, you cannot obtain a statement of the probability of the hypothesis given the data, from the probability of the data given the hypothesis. Bayesians, of course, point to this limitation as a “failing” of frequentist statistics, but the limitation cannot be overcome by semantic fiat.
No Confidence in Defendant’s Confidence Intervals
Lest anyone think I am picking on the “statistics experts,” consider the brief filed by Matrixx Initiatives. In addition to the whole crazy business of relying upon statistical significance in the absence of a study that used a statistical test, there are the two following howlers. You would probably think that the company putting forward a “no statistical significance” defense would want to state statistical concepts clearly, but take a look at the Petitioner’s brief:
“Various analytical methods can be used to determine whether data reflect a statistically significant result. One such method, calculating confidence intervals, is especially useful for epidemiological analysis of drug safety, because it allows the researcher to estimate the relative risk associated with taking a drug by comparing the incidence rate of an adverse event among a sample of persons who took a drug with the background incidence rate among those who did not. Dividing the former figure by the latter produces a relative risk figure (e.g., a relative risk of 2.0 indicates a 50% greater risk among the exposed population). The researcher then calculates the confidence interval surrounding the observed risk, based on the preset confidence level, to reflect the degree of certainty that the “true” risk falls within the calculated interval. If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate. Green et al., Reference Guide on Epidemiology, at 360-61. For further discussion, see id. at 348-61.”
Matrixx Initiatives Brief at p. 36 n. 18 (emphasis added). Both passages in bold are wrong. The Federal Judicial Center’s Reference Manual does not support the bold statements. A relative risk of 2.0 represents a 100% increase in risk, not 50%, although Matrixx Initiatives may have been thinking of a very different risk metric – the attributable risk, which would be 50% when the relative risk is 2.0.
The second bold statement is much worse because there is no possible word choice that might make the brief a correct understanding of a confidence interval (CI). The CI does not permit us to make a direct probability statement about the truth of any point within the interval. Although the interval does provide some insight into the true value of the parameter, the meaning of the confidence interval must be understood operationally. For a 95% interval, if 100 samples were taken and (100 – α) percent CIs constructed, we would expect that 95 of the intervals to cover, or include, the true value of the variable. (And α is our measure of Type I error, or probability of false positives.)
To realize how wrong the Petitioner’s brief is, consider the following example. The observed relative risk is 10, but it is not statistically significant on a two-tailed test of significance, with α set at 0.05. Suppose further that the two-sided 95% confidence interval around the observed rate is (0.9 to 18). Matrixx Initiatives asserts:
“If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate.”
The Petitioner would thus have the Court believe that with the example of a relative risk of 10, with the CI noted above, the result should be interpreted to mean that it is equally probable that the true value is 1.0 or less. This is statistically silliness.
I have collected some statements about the CI, from well-known statisticians, below, as an aid to avoid such distortions of statistical concepts, as we see in the Matrixx.
“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval …”
Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)
“Nevertheless, the difference between population means is much more likely to be near to the middle of the confidence interval than towards the extremes. Although the confidence interval is wide, the best estimate of the population difference is 6-0 mm Hg, the difference between the sample means.
* * *
“The two extremes of a confidence interval are sometimes presented as confidence limits. However, the word “limits” suggests that there is no going beyond and may be misunderstood because, of course, the population value will not always lie within the confidence interval. Moreover, there is a danger that one or other of the “limits” will be quoted in isolation from the rest of the results, with misleading consequences. For example, concentrating only on the upper figure and ignoring the rest of the confidence interval would misrepresent the finding by exaggerating the study difference. Conversely, quoting only the lower limit would incorrectly underestimate the difference. The confidence interval is thus preferable because it focuses on the range of values.”
Martin Gardner & Douglas Altman, “Confidence intervals rather than P values: estimation rather than hypothesis testing,” 292 Brit. Med. J. 746, 748 (1986)
“The main purpose of confidence intervals is to indicate the (im)precision of the sample study estimates as population values. Consider the following points for example: a difference of 20% between the percentages improving in two groups of 80 patients having treatments A and B was reported, with a 95% confidence interval of 6% to 34%*2 Firstly, a possible difference in treatment effectiveness of less than 6% or of more than 34% is not excluded by such values being outside the confidence interval-they are simply less likely than those inside the confidence interval. Secondly, the middle half of the confidence interval (13% to 27%) is more likely to contain the population value than the extreme two quarters (6% to 13% and 27% to 34%) – in fact the middle half forms a 67% confidence interval. Thirdly, regardless of the width of the confidence interval, the sample estimate is the best indicator of the population value – in this case a 20% difference in treatment response.”
Martin Gardner & Douglas Altman, “Estimating with confidence,” 296 Brit. Med. J. 1210 (1988)
“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible.”
“A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”
Kenneth J. Rothman, Sander Greenland, and Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008)
“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data. But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”
Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)