TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

De-Zincing the Matrixx

April 12th, 2011

Although the plaintiffs, in Matrixx Intiatives, Inc. v. Siracusano,  generally were accurate in defining statistical significance than the defendant, or than the so-called “statistical expert” amici (Ziliak and McCloskey), the plaintiffs’ brief goes off the rails when it turned to discussing the requirements for proving causation.  Of course, the admissibility and sufficiency of evidence to show causation were not at issue in the case, but plaintiffs got pulled down the rabbit hole dug by the defendant, in its bid to establish a legal bright-line rule about pleading.

Differential Diagnosis

In an effort to persuade the Court that statistical significance is not required, the plaintiffs/respondents threw science and legal principles to the wind.  They contended that statistical significance is not at all necessary to causal determinations because

“[c]ourts have recognized that a physician’s differential diagnosis (which identifies a likely cause of certain symptoms after ruling out other possibilities) can be reliable evidence of causation.”

Respondents’ Brief at 49.   Perhaps this is simply the Respondents’ naiveté, but it seems to suggest scienter to deceive. Differential diagnosis is not about etiology; it is about diagnosis, which rarely incorporates an assessment of etiology.  Even if the differentials were etiologies and not diagnoses, the putative causes in the differential must already be shown, independently, to be capable of causing the outcome in question. See, e.g., Tamraz v. Lincoln Electric Co., 620 F.3d 665 (6th Cir. 2010).  A physician cannot rule in an etiology in a specific person simply by positing it among the differentials, without independent, reliable evidence that the ruled in “specific cause” can cause the outcome in question, under the circumstances of the plaintiff’s exposure.  Furthermore, differential diagnosis or etiology is nothing more than a process of elimination to select a specific cause; it has nothing to do with statistical significance because it has nothing to do with general causation.

This error in the Respondent’s brief about differential diagnosis unfortunately finds its way into Justice Sotomayor’s opinion.

Daubert Denial and the Recrudescence of Ferebee

In their zeal, the Respondents go further than advancing a confusion between general and specific causation, and an erroneous view of what must be shown before a putative cause can be inserted in a set of differential (specific) causes.  They cite one of the most discredited cases in 20th century American law of expert witnesses:

Ferebee v. Chevron Chem. Co., 736 F.2d 1529, 1536 (D.C. Cir. 1984) (“products liability law does not preclude recovery until a ‘statistically significant’ number of people have been injured”).”

Respondents’ Brief at 50.  This is not a personal, subjective opinion about this 1984 pre-Daubert decision.  Ferebee was wrongly decided when announced, and it was soon abandoned by the very court that issued the opinion.  It has been a derelict on the sea of evidence law for over a quarter of a century.  Citing to Ferebee, without acknowledging its clearly overruled status, raises an interesting issue about candor to the Court, and the responsibilities of counsel in trash picking in the dustbin of expert witness law.

Along with its apparent rejection of statistical significance, Ferebee is known for articulating an “anything goes” philosophy toward the admissibility and sufficiency of expert witnesses:

“Judges, both trial and appellate, have no special competence to resolve the complex and refractory causal issues raised by the attempt to link low-level exposure to toxic chemicals with human disease.  On questions such as these, which stand at the frontier of current medical and epidemiological inquiry, if experts are willing to testify that such a link exists, it is for the jury to decide to credit such testimony.”

Ferebee v. Chevron Chemical Co., 736 F.2d 1529, 1534 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).  Within a few years, the nihilism of Ferebee was severely limited by the court that decided the case:

“The question whether Bendectin causes limb reduction defects is scientific in nature, and it is to the scientific community that the law must look for the answer.  For this reason, expert witnesses are indispensable in a case such as this.  But that is not to say that the court’s hands are inexorably tied, or that it must accept uncritically any sort of opinion espoused by an expert merely because his credentials render him qualified to testify… . Whether an expert’s opinion has an adequate basis and whether without it an evidentiary burden has been met, are matters of law for the court to decide.”

Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 829 (D.C. Cir. 1988).

Of course, several important decisions intervened between Ferebee and Richardson.  In 1986, the Fifth Circuit expressed a clear message to trial judges that it would no longer continue to tolerate the anything-goes approach to expert witness opinions:

“We adhere to the deferential standard for review of decisions regarding the admission of testimony by xperts.  Nevertheless, we … caution that the standard leaves appellate judges with a considerable task.  We will turn to that task with a sharp eye, particularly in those instances, hopefully few, where the record makes it evident that the decision to receive expert testimony was simply tossed off to the jury under a ‘let it all in’ philosophy.  Our message to our able trial colleagues:  it is time to take hold of expert testimony in federal trials.

In re Air Crash Disaster, 795 F.2d 1230, 1234 (5th Cir. 1986) (emphasis added).

In the same intervening period between Ferebee and Richardson, Judge Jack Weinstein, a respected evidence scholar and well-known liberal judge, announced :

“The expert is assumed, if he meets the test of Rule 702, to have the skill to properly evaluate the hearsay, giving it probative force appropriate to the circumstances.  Nevertheless, the court may not abdicate its independent responsibilities to decide if the bases meet minimum standards of reliability as a condition of admissibility.  See Fed. Rule Ev. 104(a).  If the underlying data are so lacking in probative force and reliability that no reasonable expert could base an opinion on them, an opinion which rests entirely upon them must be excluded.”

In re “Agent Orange” Prod. Liab. Litig., 611 F. Supp. 1223, 1245 (E.D.N.Y. 1985)(excluding plaintiffs’ expert witnesses), aff’d, 818 F.2d 187 (2d Cir. 1987), cert. denied, 487 U.S. 1234 (1988).

The notion that technical decisions had to be evidence based, not opinion based, emerged elsewhere as well. For example, in the context of applying statistics, the federal courts pronounced that the ipse dixit of parties and witnesses did not count for much:

“When a litigant seeks to prove his point exclusively through the use of statistics, he is borrowing the principles of another discipline, mathematics, and applying these principles to the law. In borrowing from another discipline, a litigant cannot be selective in which principles are applied. He must employ a standard mathematical analysis. Any other requirement defies logic to the point of being unjust. Statisticians do not simply look at two statistics, such as the actual and expected percentage of blacks on a grand jury, and make a subjective conclusion that the statistics are significantly different. Rather, statisticians compare figures through an objective process known as hypothesis testing.”

Moultrie v. Martin, 690 F.2d 1078, 1082 (4th Cir. 1982)(citations omitted)

Of course, not long after the District of Columbia Circuit decided Ferebee, in 1993, the Supreme Court decided Daubert, followed by decisions in Joiner, Kumho Tire, and Weisgram.  In 2000, Congress approved a new Rule of Evidence 702, which incorporated the learning and experience in judicial gatekeeping from a wide range of cases and principles.

Do the Respondents have a defense to having cited an overruled, superseded, discredited precedent in the highest federal Court?  Perhaps they would argue that they are in pari delicto with courts (Daubert-Deniers), which remarkably have ignored the status of Ferebee, and cited it.  See, e.g., Betz v. Pneumo Abex LLC, 998 A.2d 962, 981 (Pa. Super. 2010); McCarrell v. Hoffman-La Roche, Inc., 2009 WL 614484, *23 (N.J.Super.A.D. 2009).  See also Rubanick v. Witco Chemical Corp., 125 N.J. 421, 438-39 (1991)(quoting Ferebee before it was overruled by the Supreme Court, but after it was disregarded by the D.C. Circuit in Richardson).

Matrixx Galvanized – More Errors, More Comedy About Statistics

April 9th, 2011

Matrixx Initiatives is a rich case – rich in irony, comedy, tragedy, and error.  It is well worth further exploration, especially in terms of how this 9-0 decision was reached, what it means, and how it should be applied.

It pains me that the Respondents (plaintiffs) generally did a better job in explaining significance testing than did the Petitioner (defendant).

At least some of the Respondents’ definitional efforts are unexceptional.  For instance:

“Researchers use the term ‘statistical significance’ to characterize a result from a test that satisfies a particular kind of test designed to show that the result is unlikely to have occurred by random chance.  See David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 83, 122 (Fed. Judicial Ctr., 2d ed. 2000) (“Reference Manual”).”

Brief for Respondents, at 38 – 39 (Nov 5, 2010).

“The purpose of significance testing in this context is to assess whether two events (here, taking Zicam and developing anosmia) occur together often enough to make it sufficiently implausible that no actual underlying relationship exists between them.”

Id. at 39.   These definitions seem acceptable as far as they go, as long as we realize that the relationship that remains, when chance is excluded, may not be causal, and indeed, it may well be a false-positive relationship that results from bias or confounding.

Rather than giving one good, clear definition, the Respondents felt obligated to and repeat and restate their definitions, and thus wandered into error:

“To test for significance, the researcher typically develops a ‘null hypothesis’ – e.g., that there is no relationship between using intranasal Zicam and the onset of burning pain and subsequent anosmia. The researcher then selects a threshold (the ‘significance level’) that reflects an acceptably low probability of rejecting a true null hypothesis – e.g., of concluding that a relationship between Zicam and anosmia exists based on observations that in fact reflect random chance.”

Id. at 39.  Perhaps the Respondents were using the “cooking frogs” approach.  As the practical wisdom has it, dropping a frog into boiling water risks having the frog jump out, but if you put a frog into a pot of warm water, and gradually bring the pot to a boil, you will have a cooked frog.  Here the Respondents repeat and morph their definition of statistical significance until they have brought it around to their rhetorical goal of confusing statistical significance with causation.  Note that now the definition is muddled, and the Respondents are edging closer towards claiming that statistical significance signals the existence of a “relationship” between Zicam and anosmia, when in fact, the statistical significance simply means that chance is not a likely explanation for the observations.  Whether a “relationship” exists requires further analysis, and usually a good deal more evidence.

“The researcher then calculates a value (referred to as p) that reflects the probability that the observed data could have occurred even if the null hypothesis were in fact true.”

Id. at 39-40 (emphasis in original). Well, this is almost true.  It’s not “even if,” but simply “if”; that is, the p-value is based upon the assumption that the null hypothesis is correct.  The “if” is not an incidental qualifier, it is essential to the definition of statistical significance. “Even” here adds nothing, but a slightly misleading rhetorical flourish.  And the p-value is not the probability that the observed data are correct; it’s the probability of observing the data obtained, or data more extreme, assuming the null hypothesis is true.

The Respondents/plaintiffs efforts at serious explication ultimately succumb to their hyperbolic rhetoric.  They explained that statistical significance may not be “practical significance,” which is true enough.  There are, of course, instances in which a statistical significant difference is not particularly interesting.  A large clinical trial, testing two cancer medications head to head, may show one extends life expectancy by a week or two, but has a worse side-effect profile.  The statistically significant “better” drug may be refused a license from regulatory agencies, or be rejected by knowledgeable oncologists and sensible patients, who are more concerned about quality of life issues.

The Respondents are also correct that invoking statistically significance does not provide the simple, bright-line test, Petitioner desired.  Someone would still have to specify the level of alpha, the acceptable level of Type I error, and this would further require a specification of either a one-sided or two-sided test.  To be sure, the two-sided test, with an alpha of 5%, is generally accepted in the world of biostatistics and biomedical research.  Regulatory agencies, including the FDA, however, lower the standard test to implement their precautionary principles and goals.  Furthermore, evaluation of statistical significance requires additional analysis to determine whether the observed deviation from expected is due to bias or confounding, or whether the statistical test has been unduly diluted by multiple comparisons, subgroup analyses, or data mining techniques.

Of course, statistical significance today usually occurs in conjunction with an assessment of “effect size,” usually through an analysis of a confidence interval around a point estimate of a risk ratio.  The Respondents’ complaint that the p-value does not convey the magnitude of the association is a bit off the mark, but not completely illegitimate.  For instance, if there were a statistically significant finding of anosmia from Zicam use, in the form of an elevated risk that was itself small, the FDA might well decide that the risk was manageable with a warning to users to discontinue the medication if they experienced a burning sensation upon use.

The Respondents, along with their two would-be “statistical expert” amici, misrepresent the substance of many of the objections to statistical significance in the medical literature.  A telling example is the Respondents’ citation to an article by Professor David Savitz:

David A. Savitz, “Is Statistical Significance Testing Useful in Interpreting Data?” 7 Reproductive Toxicology 95, 96 (1993) “[S]tatistical significance testing is not useful in the analysis or interpretation of scientific research.”).

Id. at 52, n. 40.

More complete quotations from Professor Savitz’ article, however, reveals a more nuanced, and rather different message:

“Although P values and statistical significance testing have become entrenched in the practice of biomedical research, their usefulness and drawbacks should be reconsidered, particularly in observational epidemiology. The central role for the null hypothesis, assuming an infinite number of replications, and the dichotomization of results as positive or negative are argued to be detrimental to the proper design and evaluation of research. As an alternative, confidence intervals for estimated parameters convey some information about random variation without several of these limitations. Elimination of statistical significance testing as a decision rule would encourage those who present and evaluate research to more comprehensively consider the methodologic features that may yield inaccurate results and shift the focus from the potential influence of random error to a broader consideration of possible reasons for erroneous results.”

Savitz, 7 Reproductive Toxicology at 95.  Respondents’ case would hardly have been helped by substituting a call for statistical significance with a call for using confidence intervals, along with careful scrutiny of the results for erroneous results.

“Regardless of what is taught in statistics courses or advocated by editorials, including the recent one in this journal, statistical tests are still routinely invoked as the primary criterion for assessing whether the hypothesized phenomenon has occurred.”

7 Reproductive Toxicology at 96 (internal citation omitted).

“No matter how carefully worded, “statistically significant” misleadingly conveys notions of causality and importance.”

Id. at 99.  This last quotation really unravels the Respondents’ fatuous use of citations.  Of course, the Savitz article is quite inconsistent generally with the message that the Respondents wished to convey to the Supreme Court, but intellectually honesty required a fuller acknowledgement of Prof. Savitz’ thinking about the matter.

Finally, there are some limited cases, in which the failure to obtain a conventionally statistically significant result is not fatal to an assessment of causality.  Such cases usually involve instances in which it is extremely difficult to find observational or experimental data to analyze for statistical significance, but other lines of evidence support the conclusion in a way that scientists accept.  Although these cases are much rarer than Respondents imagine, they may well exist, but they do not detract much from Sir Ronald Fisher’s original conception of statistical significance:

“In the investigation of living beings by biological methods statistical tests of significance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to the causes we wish to study, or are trying to detect, but to a combination of the many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.”

Ronald A. Fisher, “The Statistical Method in Psychical Research,” 39 Proceedings of the Society for Psychical Research 189, 191 (1929). Note that Fisher was talking about experiments, not observational studies, and that he hardly was advocating a mechanical, thoughtless criterion of significance.

The Supreme Court’s decision in Castenada illustrates how misleading statistical significance can be.  In a five-to-four decision, the Court held that a prima facie case of ethnic discrimination could be made out on the basis of statistical significance alone.  In dictum, the Court suggested that statistical evidence alone sufficed when the observed outcome was more than two or three standard deviations from the expected outcome.  Castaneda v. Partida, 430 U.S. 482, 496 n. 17 (1977).  The facts of Castaneda illustrate a compelling case in which the statistical significance observed was likely the result of confounding effects of reduced civic participation by poor, itinerant minorities, in a Texas county in which the ethnic minority controlled political power, and made up a majority of the petit jury that convicted Mr. Partida.

The Matrixx – A Comedy of Errors

April 6th, 2011

1. Incubi Curiae

As I noted in the Matrixx Unloaded, Justice Sotomayor’s scholarship, in discussing case law under Federal Rule of Evidence 702, was seriously off base.  Of course, Matrixx Initiatives was only a pleading case, and so there was no real reason to consider rules of admissibility or sufficiency, such as Rule 702.

Fortunately, Justice Sotomayor avoided further embarrassment by not discussing the fine details of significance or hypothesis testing.  Not so the two so-called “statistics experts” who submitted an amicus brief.

Consider the following statement by McCloskey and Ziliak, about adverse event reports (AER) and statistical significance.

“Suppose that a p-value for a particular test comes in at 9 percent.  Should this p-value be considered “insignificant” in practical, human, or economic terms? We respectfully answer, “No.” For a p-value of .09, the odds of observing the AER is 91 percent divided by 9 percent. Put differently, there are 10-to-l odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not).”

Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Zilliak in Support of Respondents, at 18 (Nov. 18, 2010), 2010 WL 4657930 (U.S.) (emphasis added).

Of course, the whole enterprise of using statistical significance to evaluate AER is suspect because there is no rate, either expected or observed.  A rate could be estimated from number of AER reported per total number of persons using the medication in some unit of time.  Pharmacoepidemiologists sometimes do engage in such speculative blue-sky enterprises to determine whether a “signal” may have been generated by the AER.  Even if a denominator were implied, and significance testing used, it would be incorrect to treat the association as causal.  Our statistics experts here have committed several serious mistakes; they have

  • treated the AERs as a rate, when it is simply a count;
  • treated the AERs as an observed rate that can be evaluated against a null hypothesis of no increase in rate, when there is no expected rate for the event in question; and
  • treated the pseudo-statistical analysis as if it provided a basis for causal assessment, when at best it would be a very weak observational study that raised an hypothesis for study.

Now that would be, and should be, enough error for any two “statistics experts” in a given day, and we might have hoped that these putative experts would have thought through their ideas before imposing themselves upon a very busy Court.  But there is another mistake, which is even more stunning for having come from self-styled “statistics experts.”  Their derivation of a probability (or an odds statement) that the null hypothesis of no increased rate of AER is false is statistically incorrect.  A p-value is based upon the assumption that the null hypothesis is true, and it measures the probability of having obtained data as extreme, or more extreme, from the expected value, as seen in the study.  The p-value is thus a conditional probability statement of the probability of the data given the hypothesis.  As every first year statistics student learns, you cannot reverse the order of the conditional probability statement without committing a transpositional fallacy.  In other words, you cannot obtain a statement of the probability of the hypothesis given the data, from the probability of the data given the hypothesis.  Bayesians, of course, point to this limitation as a “failing” of frequentist statistics, but the limitation cannot be overcome by semantic fiat.

No Confidence in Defendant’s Confidence Intervals

Lest anyone think I am picking on the “statistics experts,” consider the brief filed by Matrixx Initiatives.  In addition to the whole crazy business of relying upon statistical significance in the absence of a study that used a statistical test, there are the two following howlers.  You would probably think that the company putting forward a “no statistical significance” defense would want to state statistical concepts clearly, but take a look at the Petitioner’s brief:

“Various analytical methods can be used to determine whether data reflect a statistically significant result. One such method, calculating confidence intervals, is especially useful for epidemiological analysis of drug safety, because it allows the researcher to estimate the relative risk associated with taking a drug by comparing the incidence rate of an adverse event among a sample of persons who took a drug with the background incidence rate among those who did not. Dividing the former figure by the latter produces a relative risk figure (e.g., a relative risk of 2.0 indicates a 50% greater risk among the exposed population). The researcher then calculates the confidence interval surrounding the observed risk, based on the preset confidence level, to reflect the degree of certainty that the “true” risk falls within the calculated interval. If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate. Green et al., Reference Guide on Epidemiology, at 360-61. For further discussion, see id. at 348-61.”

Matrixx Initiatives Brief at p. 36 n. 18 (emphasis added). Both passages in bold are wrong.  The Federal Judicial Center’s Reference Manual does not support the bold statements. A relative risk of 2.0 represents a 100% increase in risk, not 50%, although Matrixx Initiatives may have been thinking of a very different risk metric – the attributable risk, which would be 50% when the relative risk is 2.0.

The second bold statement is much worse because there is no possible word choice that might make the brief a correct understanding of a confidence interval (CI). The CI does not permit us to make a direct probability statement about the truth of any point within the interval. Although the interval does provide some insight into the true value of the parameter, the meaning of the confidence interval must be understood operationally.  For a 95% interval, if 100 samples were taken and (100 – α) percent CIs constructed, we would expect that 95 of the intervals to cover, or include, the true value of the variable.  (And α is our measure of Type I error, or probability of false positives.)

To realize how wrong the Petitioner’s brief is, consider the following example.  The observed relative risk is 10, but it is not statistically significant on a two-tailed test of significance, with α set at 0.05.  Suppose further that the two-sided 95% confidence interval around the observed rate is (0.9 to 18).  Matrixx Initiatives asserts:

“If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate.

The Petitioner would thus have the Court believe that with the example of a relative risk of 10, with the CI noted above, the result should be interpreted to mean that it is equally probable that the true value is 1.0 or less.  This is statistically silliness.

I have collected some statements about the CI, from well-known statisticians, below, as an aid to avoid such distortions of statistical concepts, as we see in the Matrixx.


“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval …”

Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)

“Nevertheless, the difference between population means is much more likely to be near to the middle of the confidence interval than towards the extremes. Although the confidence interval is wide, the best estimate of the population difference is 6-0 mm Hg, the difference between the sample means.

* * *

“The two extremes of a confidence interval are sometimes presented as confidence limits. However, the word “limits” suggests that there is no going beyond and may be misunderstood because, of course, the population value will not always lie within the confidence interval. Moreover, there is a danger that one or other of the “limits” will be quoted in isolation from the rest of the results, with misleading consequences. For example, concentrating only on the upper figure and ignoring the rest of the confidence interval would misrepresent the finding by exaggerating the study difference. Conversely, quoting only the lower limit would incorrectly underestimate the difference. The confidence interval is thus preferable because it focuses on the range of values.”

Martin Gardner & Douglas Altman, “Confidence intervals rather than P values: estimation rather than hypothesis testing,” 292 Brit. Med. J. 746, 748 (1986)

“The main purpose of confidence intervals is to indicate the (im)precision of the sample study estimates as population values. Consider the following points for example: a difference of 20% between the percentages improving in two groups of 80 patients having treatments A and B was reported, with a 95% confidence interval of 6% to 34%*2 Firstly, a possible difference in treatment effectiveness of less than 6% or of more than 34% is not excluded by such values being outside the confidence interval-they are simply less likely than those inside the confidence interval. Secondly, the middle half of the confidence interval (13% to 27%) is more likely to contain the population value than the extreme two quarters (6% to 13% and 27% to 34%) – in fact the middle half forms a 67% confidence interval. Thirdly, regardless of the width of the confidence interval, the sample estimate is the best indicator of the population value – in this case a 20% difference in treatment response.”

Martin Gardner & Douglas Altman, “Estimating with confidence,” 296 Brit. Med. J. 1210 (1988)

“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible.”

“A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”

Kenneth J. Rothman, Sander Greenland, and Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008)

“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data.  But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”

Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)

The Matrixx Oversold

April 4th, 2011

“Now their view is the rule of law: Statistical significance is neither necessary nor sufficient for proving a commercial or scientific result.”

Statistics Experts

The perverse rhetorical distortions of the Matrixx case have begun.  The quote above, from the website of one of the amicus brief authors, will probably not be the last distortion or perversion of scientific method or of the holding of Matrixx Initiatives, Inc. v. Siracusano, 2011 WL 977060 (March 22, 2011, U.S. Supreme Court).  Still, the distortion of the holding raises some interesting questions about who these would-be friends of the Court are, and why would they misrepresent the case in a way that any first-year law student would see was incorrect.  What is the agenda of these authors?

I had never heard of Deirdre N. McCloskey or Stephen T. Ziliak before the Matrixx case.  After the decision was delivered on March 22, 2011, I started to look at the amicus briefs.  McCloskey and Ziliak filed one such brief, on behalf of the respondents.  Their brief was styled “Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen N. Ziliak in Support of Respondents.”  The more I considered this amicus brief, the more troubling I found it, both procedurally and substantively.

1. No statistical organization (such as the American Statistical Association) joined this amicus brief, and none of the many statistician-lawyers who frequently contribute amicus briefs on quantitative issues was associated with their effort.  This was the first peculiarity of the McCloskey-Ziliak brief, which attracted my attention only after the Supreme Court issued its opinion in the Matrixx case.

2. The second remarkable fact about these amici is that they are not statisticians or statistics professors, despite titling their brief as that of “statistics experts.”  According to his website, Stephen T. Ziliak, is a Professor of Economics,in the department of economics, in Roosevelt University (Chicago). His doctorate was in economics.  Deirdre N. McCloskey is a professor of economics, history, English, and communication, at the University of Illinois (Chicago).  Of course, this is not to say that these professors do not have expertise in statistics.  Both authors have written on the history of statistics, but the title of their brief seems a bit misleading.  Why would they not say that they were economists?  I, for one, found this ruse peculiarly misleading for a brief filed in our highest Court.

3. The third curious fact is the incestuous nature of the brief’s authors.  McCloskey was Ziliak’s doctoral supervisor. Again, there is nothing wrong with a mentor and his or her student joining together in a project such as this, but the work suggests an intellectual inbreeding, which was, well, peculiar in that no one else with putative substantive expertise was involved in the amicus brief.

4.  Some of the McCloseky-Ziliak brief is unexceptional exposition about the meaning of Type I and Type II errors, and hypothesis testing.  The Supreme Court really did not need this information, which could readily be found in the Federal Judicial Center’s Reference Manual on Scientific Evidence.  Some of the brief, however, is peculiarly tendentious nonsense, which I will explore in follow-up posts.

5. The Supreme Court, in its opinion, did not dignify this amicus brief with a citation, but the amici nonetheless appear to have a delusionally inflated view of their influence.  Now there is nothing at all peculiar about such delusions in academia.  A short trip to Ziliak’s and McCloskey’s websites revealed many references to their efforts on the brief, including their (inflated) assessment of their influence. McCloskey’s website goes further, with what appears to be a press release, in which she claims, without citation or support that some of “their book and some of their articles did affect the case.”

6. The press release ends with the harrumphing, noted above:

“Now their [McCloskey and Ziliak’s] view is the rule of law: ‘Statistical significance is neither necessary nor sufficient for proving a commercial or scientific result.””

This statement, of course, is not the rule of law; nor is it the holding of the case.  The statement is so clearly wrong that the reader has to wonder about the authors’ academic pretenses, qualifications, and claimed disinterest in the proceedings.  Rhetorical excess is no stranger in the halls of academia, but our learned professors appear to have jumped the rhetorical shark.

This amicus brief certainly got my attention, and it raises serious questions about who files amicus briefs, and whether they distort the appellate process.   In a follow-up post, I will look at some of the substantive opinions put forward by McCloskey and Ziliak.  Like the curious distortions of their credentials, the  misleading assessment of their own influence, and the erroneous conclusion about the Matrixx holding, the substantive claims and statements by these authors, in their amicus brief, are equally dubious.  Their claims are worth exploring as a road map to how other irresponsible advocates may use and misuse the Matrixx.