TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Misplaced Reliance On Peer Review to Separate Valid Science From Nonsense

August 14th, 2011

A recent editorial in the Annals of Occupational Hygiene is a poignant reminder of how oversold peer review is in the context of expert witness judicial gatekeeping.  Editor Trevor Ogden urges some cautionary suggestions:

“1. Papers that have been published after proper peer review are more likely to be generally right than ones that have not.

2. However, a single study is very unlikely to take everything into account, and peer review is a very fallible process, and it is very unwise to rely on just one paper.

3. The question should be asked, has any published correspondence dealt with these paper, and what do other papers that cite them say about them?

4. Correspondence will legitimately give a point of view and not consider alternative explanations in the way a paper should, so peer review does not necessarily validate the views expressed.”

Trevor Ogden, “Lawyers Beware! The Scientific Process, Peer Review, and the Use of Papers in Evidence,” 55 Ann. Occup. Hyg. 689, 691 (2011).

Ogden’s conclusions, however, are misleading.  For instance, he suggests that peer-reviewed papers are better than non-peer reviewed papers, but by how much?  What is the empirical evidence for Ogden’s assertion?  In his editorial, Ogden gives an anecdote of a scientific report submitted to a political body, and comments that this report would not have survived peer review.   But an anecdote is not a datum.  What’s worse is that the paper that is rejected by peer review at Ogden’s journal will show up in another publication, eventually.  Courts make little distinction between and among journals for purposes of rating the value of peer review.

Of course it is unwise, and perhaps scientifically unsound, as Ogden points out, to rely upon just one paper, but the legal process permits it.  Worse yet,  litigants, either plaintiff or defendant, are often allowed to pick out isolated findings in a variety of studies, and throw them together as if that were science. “[O]n fait la science avec des faits comme une maison avec des pierres; mais une accumulation de faits n’est pas plus une science qu’un tas de pierres n’est une maison.” Henri Poincaré, La Science et l’Hypothèse (1905) (chapter 9, Les Hypothèses en Physique).

As for letters to the editor, sure, courts and litigants should pay attention to them, but as Ogden notes, these writings are themselves not peer reviewed, or not peer reviewed with very much analytical rigor.  The editing of letters raises additional concerns of imperious editors who silence some points of view to the benefit of others. Most journals have space only for a few letters, and unpopular but salient points of view can go unreported. Furthermore, many scientists will not write letters to the editors, even when the published article is terribly wrong in its methods, data analyses, conclusions, or discussion, because in most journals the authors typically have the last word in the form of reply, which often is self-serving and misleading, with immunity from further criticism.

Ogden describes and details the limitations of peer review in some detail, but he misses the significance of how these limitations play out in the legal arena.

Limitations and Failures of Peer Review

For instance, Ogden acknowledges that peer review fails to remove important errors from published articles. Here he does provide empirical evidence.  S. Schroter, N. Black, S. Evans, et al., “What errors do peer reviewers detect, and does training improve their ability to detect them?” 101 J. Royal Soc’y  Med. 507 (2008) (describing an experiment in which manuscripts were seeded with known statistical errors (9 major and 5 minor) and sent to 600 reviewers; each reviewer missed, on average, over 6 of 14 of the major errors).  Ogden tells us that the empirical evidence suggests that “peer review is a coarse and fallible filter.”

This is hardly a ringing endorsement.

Surveys of the medical literature have found the prevalence of statistical errors ranges from 30% to 90% of papers.  See, e.g., Douglas Altman, “Statistics in medical journals: developments in the 1980s,” 10 Stat. Med. 1897 (1991); Stuart J. Pocock, M.D. Hughes, R.J. Lee, “Statistical problems in the reporting of clinical trials. A survey of three medical journals,” 317 New Engl. J. Med. 426 (1987); S.M. Gore, I.G. Jones, E.C. Rytter, “Misuse of statistical methods: critical assessment of articles in the BMJ from January to March 1976. 1 Brit. Med. J. 85 (1977).

Without citing any empirical evidence, Ogden notes that peer review is not well designed to detect fraud, especially when the data are presented to look plausible.  Despite the lack of empirical evidence, the continuing saga of fraudulent publications coming to light supports Ogden’s evaluation. Peer reviewers rarely have access to underlying data.  In the silicone gel breast implant litigation, for instance, plaintiffs relied upon a collection of studies that looked very plausible from their peer-reviewed publications.  Only after the defense discovered misrepresentations and spoliation of data did the patent unreliability and invalidity of the studies become clear to reviewing courts.  The rate of retractions of published scientific articles appears to have increased, although the secular trend may have resulted from increased surveillance and scrutiny of the published literature for fraud.  Daniel S. Levine, “Fraud and Errors Fuel Research Journal Retractions,” (August 10, 2011); Murat Cokol, Fatih Ozbay, and Raul Rodriguez-Esteban, “Retraction rates are on the rise,” 9 European Molecular Biol. Reports 2 (2008);  Orac, “Scientific fraud and journal article retractions” (Aug. 12, 2011).

The fact is that peer review is not very good in detecting fraud or error in scientific work.  Ultimately, the scientific community must judge the value of the work, but in some niche areas, only “the acolytes” are paying attention.  These acolytes cite to one another, applaud each others’ work, and often serve as peer reviewers of the work in the field because editors see them as the most knowledgeable investigators in the narrow field. This phenomenon seems especially prevalent in occupational and environmental medicine.  See Cordelia Fine, “Biased But Brilliant,” New York Times (July 30, 2011) (describing confirmation bias and irrational loyalty of scientists to their hobby-horse hypotheses).

Peer review and correspondence to the editors are not the end of the story.  Discussion and debate may continue in the scientific community, but the pace of this debate may be glacial.  In areas of research where litigation or public policy does not fuel further research to address aberrant findings or to reconcile discordant results, science may take decades to ferret out the error. Litigation cannot proceed at this deliberative speed.  Furthermore, post-publication review is hardly a cure-all for the defects of peer review; post-publication commentary can be, and often is, spotty and inconsistent.  David Schriger and Douglas Altman, “Inadequate post-publication review of medical research:  A sign of an unhealthy research environment in clinical medicine,” 341 Brit. Med. J. 356 (2010)(identifying reasons for the absence of post-publication peer review).

The Evolution of Peer Review as a Criterion for Judicial Gatekeeping of Expert Witness Opinion

The story of how peer review came to be held in such high esteem in legal circles is sad, but deserves to be told.  In the Bendectin litigation, the medication sponsor, Merrell-Richardson, was confronted with the testimony of an epidemiologist, Shanna Swan, who propounded her own, unpublished re-analysis of the published epidemiologic studies, which failed to find an association between Bendectin use and birth defects.  Merrell challenged Swan’s unpublished, non-peer-reviewed re-analyses as not “generally accepted” under the Frye test.  The lack of peer review seemed like good evidence of the novelty of Swan’s reanalyses, as well as their lack of general acceptance.

In the briefings, the Supreme Court received radically different views of peer review in the Daubert case.  One group of amici modestly explained that “peer review referees and editors limit their assessment of submitted articles to such matters as style, plausibility, and defensibility; they do not duplicate experiments from scratch or plow through reams of computer-generated data in order to guarantee accuracy or veracity or certainty.” Brief for Amici Curiae Daryl E. Chubin, et al. at 10, Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993).  See also E. Chubin & Edward J. Hackett, Peerless Science: Peer Review and U.S. Science Policy (1990).

Other amici, such as the New England Journal of Medicine, Journal of the American Medical Association, and Annals of Internal Medicine proposed that peer-reviewed publication should be the principal criterion for admitting scientific opinion testimony.  Brief for Amici Curiae New England Journal of Medicine, Journal of the American Medical Association, and Annals of Internal Medicine in Support of Respondent, Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993). But see Arnold S. Relman & Marcia Angell,“How Good Is Peer Review?321 New Eng. J. Med. 827, 828 (1989) (‘‘peer review is not and cannot be an objective scientific process, nor can it be relied on to guarantee the validity or honesty of scientific research’’).

Justice Blackmun, speaking for the majority in Daubert, steered a moderate course:

“Another pertinent consideration is whether the theory or technique has been subjected to peer review and publication. Publication (which is but one element of peer review) is not a sine qua non of admissibility; it does not necessarily correlate with reliability, see S. Jasanoff, The Fifth Branch: Science Advisors as Policymakers 61-76 (1990), and in some instances well-grounded but innovative theories will not have been published, see Horrobin, “The Philosophical Basis of Peer Review and the Suppression of Innovation,” 263 JAMA 1438 (1990). Some propositions, moreover, are too particular, too new, or of too limited interest to be published. But submission to the scrutiny of the scientific community is a component of “good science,” in part because it increases the likelihood that substantive flaws in methodology will be detected. See J. Ziman, Reliable Knowledge: An Exploration of the Grounds for Belief in Science 130-133 (1978); Relman & Angell, “How Good Is Peer Review?” 321 New Eng. J. Med. 827 (1989). The fact of publication (or lack thereof) in a peer reviewed journal thus will be a relevant, though not dispositive, consideration in assessing the scientific validity of a particular technique or methodology on which an opinion is premised.”

Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 593-94, 590 n.9 (1993).

This lukewarm endorsement from Justice Blackmun, in Daubert, sent a mixed message to lower federal courts, which tended to make peer review into somewhat of a mechanical test in their gatekeeping decisions.  Many federal judges (and state court judges in states that followed the Daubert precedent), were too busy, too indolent, or too lacking in analytical acumen, to look past the fact of publication and peer review.  These judges avoided the labor of independent thought by taking the fact of peer-review publication as dispositive of the validity of the science in the paper.  Some commentators encouraged this low level of scrutiny and mechanical test, by suggesting that peer review could be taken as an indication of good science.  See, e.g., Margaret A. Berger, “The Supreme Court’s Trilogy on the Admissibility of Expert Testimony,” in Federal Judicial Center, Reference Manual on Scientific Evidence 9, 17 (2d ed. 2000) (describing Daubert as endorsing peer review as one of the “indicators of good science”) (hereafter cited as Reference Manual).  Elevating peer review to be an indicator of good science, however, obscures its lack of epistemic warrant, misrepresents its real view in the scientific community, and enables judges to fall back into their pre-Daubert mindset of finding quick and easy, and invalid, proxies for scientific reliability.

In a similar vein, other commentators spoke in superlatives about peer review, and thus managed to mislead judges and decision makers further to regard anything as published as valid scientific data, data interpretation, and data analysis. For instance, Professor David Goodstein, writing in the Reference Manual, advises the federal judicial that peer review is the test that separates valid science from rubbish:

“In the competition among ideas, the institution of peer review plays a central role. Scientific articles submitted for publication and proposals for funding are often sent to anonymous experts in the field, in other words, peers of the author, for review. Peer review works superbly to separate valid science from nonsense, or, in Kuhnian terms, to ensure that the current paradigm has been respected.11 It works less well as a means of choosing between competing valid ideas, in part because the peer doing the reviewing is often a competitor for the same resources (pages in prestigious journals, funds from government agencies) being sought by the authors. It works very poorly in catching cheating or fraud, because all scientists are socialized to believe that even their bitterest competitor is rigorously honest in the reporting of scientific results, making it easy to fool a referee with purposeful dishonesty if one wants to. Despite all of this, peer review is one of the sacred pillars of the scientific edifice.”

David Goodstein, “How Science Works,” Reference Manual 67, at 74-75, 82 (emphasis added).

Criticisms of Reliance Upon Peer Review as a Proxy for Reliability and Validity

Other commentators have put forward a more balanced and realistic, if not jaundiced, view of peer review. Professor Susan Haack, a philosopher of science at the University of Miami, who writes frequently about epistemic claims of expert witnesses and judicial approaches to gatekeeping, described the disconnect in meaning of peer review to scientists and to lawyers:

“For example, though peer-reviewed publication is now standard practice at scientific and medical journals, I doubt that many working scientists imagine that the fact that a work has been accepted for publication after peer review is any guarantee that it is good stuff, or that it’s not having been published necessarily undermines its value.92 The legal system, however, has come to invest considerable epistemic confidence in peer-reviewed publication93 — perhaps for no better reason than that the law reviews are not peer-reviewed!”

Susan Haack, “Irreconcilable Differences?  The Troubled Marriage of Science and Law,” 72 Law & Contemporary Problems 1, 19 (2009).   Haack’s assessment of the motivation of actors in the legal system is, for a philosopher, curiously ad hominem, and her shameless dig at law reviews is ironic, considering that she publishes extensively in them.  Still, her assessment that peer review is not any guarantee of an article’s being “good stuff,” is one of her more coherent contributions to this discussion.

The absence of peer review hardly supports the inference that a study or an evaluation of studies is not reliable, unless of course we also know that the authors have failed after repeated attempts to find a publisher.  In today’s world of vanity presses, a researcher would be hard pressed to be unable to find a journal in which to publish a paper.  As Drummond Rennie, a former editor of the Journal of the American Medical Association (the same journal, acting as an amicus curiae to the Supreme Court, which oversold peer review), has remarked:

“There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print.”

Drummond Rennie, “Guarding the Guardians: A Conference on Editorial Peer Review,” 256 J. Am. Med. Ass’n 2391 (1986); D. Rennie, A. Flanagin, R. Smith, and J. Smith, “Fifth International Congress on Peer Review and Biomedical Publication: Call for Research”. 289 J. Am. Med. Ass’n 1438 (2003)

Other editors at leading medical journals seem to agree with Rennie.  Richard Horton, an editor of The Lancet, rejects the Goodstein view (from the Reference Manual) of peer review as the “sacred pillar of the scientific edifice”:

“The mistake, of course, is to have thought that peer review was any more than a crude means of discovering the acceptability — not the validity — of a new finding. Editors and scientists alike insist on the pivotal importance of peer review. We portray peer review to the public as a quasi-sacred process that helps to make science our most objective truth teller. But we know that the system of peer review is biased, unjust, unaccountable, incomplete, easily fixed, often insulting, usually ignorant, occasionally foolish, and frequently wrong.”

Richard Horton “Genetically modified food: consternation, confusion, and crack-up,” 172 Med. J. Australia 148 (2000).

In last year’s prestigious 2010 Sense About Science lecture, Fiona Godlee, the editor of the British Medical Journal, characterized peer review as deficient in at least seven different ways:

  • Slow
  • Expensive
  • Biased
  • Unaccountable
  • Stifles innovation
  • Bad at detecting error
  • Hopeless at detecting fraud

Godlee, “It’s time to stand up for science once more” (June 21, 2010).

Important research often goes unpublished, and never sees the light of day.  Anti-industry zealots are fond of pointing fingers at the pharmaceutical industry, although many firms, such as GlaxoSmithKline, have adopted a practice of posting study results on a website.  The anti-industry zealots overlook how many apparently neutral investigators suppress research results that do not fit in with their pet theories.  One of my favorite examples is the failure of the late-Dr. Irving Selikoff to publish his study of Johns-Manville factory workers:  William J. Nicholson, Ph.D. and Irving J. Selikoff, M.D., “Mortality experience of asbestos factory workers; effect of differing intensities of asbestos exposure,” Unpublished Manuscript.  This study investigated cancer and other mortality at a factory in New Jersey, where crocidolite was used in the manufacture of  insulation products.  Selikoff and Nicholson apparently had no desire to publish a paper that would undermine their unfounded claim that crocidolite asbestos was not used by American workers.  But this desire does not necessarily mean that Nicholson and Selikoff’s unpublished paper was of any lesser quality than their study of North American insulators, the results of which they published, and republished, with abandon.

Examples of Failed Peer Review from the Litigation Front

Phenylpropanolamine and Stroke

Then there are many examples from the litigation arena of studies that passed peer review at the most demanding journals, but which did not hold up under the more intense scrutiny of review by experts in the cauldron of litigation.

In In re Phenylpropanolamine Products Liability Litigation, Judge Rothstein conducted hearings and entertained extensive briefings on the reliability of plaintiffs’ expert witnesses’ opinions, which were based largely upon one epidemiologic study, known as the “Yale Hemorrhagic Stroke Project (HSP).”  The Project was undertaken by manufacturers, which created a Scientific Advisory Group, to oversee the study protocol.  The study was submitted as a report to the FDA, which reviewed the study and convened an advisory committee to review the study further.  “The prestigious NEJM published the HSP results, further substantiating that the research bears the indicia of good science.” In re Phenylpropanolamine Prod. Liab. Litig., 289 F. 2d 1230, 1239 (2003) (citing Daubert II for the proposition that peer review shows the research meets the minimal criteria for good science).  There were thus many layers of peer review for the HSP study.

The HSP study was subjected to much greater analysis in litigation.  Peer review, even in the New England Journal of Medicine, did not and could not carry this weight. The Defendants fought to fight to obtain the underlying data to the HSP, and that underlying data unraveled the HSP paper.  Despite the plaintiffs’ initial enthusiasm for a litigation that was built on the back of a peer-reviewed paper in one of the leading clinical journals of internal medicine, the litigation resulted in a string of notable defense verdicts.  After one of the early defense verdicts, plaintiffs’ challenged the defendant’s reliance upon underlying data that went behind the peer-reviewed publication.  The trial court rejected the request for a new trial, and spoke to the significance of challenging the superficial significance of peer review of the key study relied upon by plaintiffs in the PPA litigation:

“I mean, you could almost say that there was some unethical activity with that Yale Study.  It’s real close.  I mean, I — I am very, very concerned at the integrity of those researchers.”

“Yale gets — Yale gets a big black eye on this.”

O’Neill v. Novartis AG, California Superior Court, Los Angeles Cty., Transcript of Oral Argument on Post-Trial Motions, at 46 -47 (March 18, 2004) (Hon. Anthony J. Mohr)

Viagra and Ophthalmic Events

The litigation over ophthalmic adverse events after the use of Viagara provides another example of challenging peer review.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).  In this litigation, the court, after viewing litigation discovery materials, recognized that the authors of a key paper failed to use the methodologies that were described in their published paper.  The court gave the sober assessment that ‘[p]eer review and publication mean little if a study is not based on accurate underlying data.’’ Id.

MMR Vaccine and Autism

Plaintiffs’ expert witness in the MMR vaccine/autism litigation, Andrew Wakefield published a paper in The Lancet, in which he purported to find an association between measles-mumps-rubella vaccine and autism.  A.J. Wakefield, et al., “Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children,” 351 Lancet 637 (1998).  This published paper, in a well-regarded journal, opened a decade-long controversy, with litigation, over the safety of the MMR vaccine.  The study was plagued, however, not only by failure to disclose payments from plaintiffs’ attorneys and ethical lapses for failure to obtain ethics board approvals, but by substantially misleading reports of data and data analyses.  In 2010, Wakefield was sanctioned by the UK General Medical Council’s Fitness to Practise Panel.  Finally, in 2010, over a decade after initial publication,  the Lancet ‘‘fully retract[ed] this paper from the published record.’’  Editors of the Lancet, “Retraction—Ileal-lymphoidnodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children,” 375 Lancet 445 (2010).

Accutane and Suicide

In the New Jersey litigation over claimed health effects of Accutane, one of the plaintiffs’ expert witnesses was the author of a key paper that “linked” Accutane to depression.  Palazzolo v. Hoffman La Roche, Inc., 2010 WL 363834 (N.J. App. Div.).  Discovery revealed that the author, James Bremner, did not follow the methodology described in the paper.  Furthermore, Bremner could not document the data used in the paper’s analysis, and conceded that the statistical analyses were incorrect.  The New Jersey Appellate Division held that reliance upon Bremner’s study should be excluded as not soundly and reliably generated.  Id. at *5.

Silicone and Connective Tissue Disease

It is heartening that the scientific and medical communities decisively renounced the pathological science that underlay the silicone gel breast implant litigation.  The fact remains, however, that plaintiffs relied upon a large body of published papers, each more invalid than the other, to support their claims.  For many years, judges around the country blinked and let expert witnesses offer their causation opinions, in large part based upon papers by Smalley, Shanklin, Lappe, Kossovosky, Gershwin, Garrido, and others.  Peer review did little to stop the enthusiasm of editors for this “sexy” topic until a panel of court-appointed expert witnesses, and the Institute of Medicine put an end to the judicial gullibility.

Concluding Comments

One district court distinguished between pre-publication peer review and the important peer review that takes place after publication as other researchers quietly go about replicating or reproducing a study’s findings, or attempting to build on those findings.  “[J]ust because an article is published in a prestigious journal, or any journal at all, does not mean per se that it is scientifically valid.”  Pick v. Amer. Med. Sys., 958 F. Supp. 1151, 1178 n.19 (E.D. La. 1997), aff’d, 198 F.3d 241 (5th Cir. 1999).  With hindsight, we can say that Merrell Richardson’s strategy of emphasizing peer review has had some unfortunate, unintended consequences.  The Supreme Court elevated peer review into a factor for reliable science, and lower courts have elevated peer review into a criterion of validity.  The upshot is that many courts will now not go beyond statements in a peer-reviewed paper to determine whether they are based upon sufficient facts and data, or whether the statements are based upon sound inferences from the available facts and data.  These courts violate the letter and spirit of Rule 702, of the Federal Rules of Evidence.

Bad and Good Statistical Advice from the New England Journal of Medicine

July 2nd, 2011

Many people consider The New England Journal of Medicine (NEJM) a prestigious journal.  It is certainly widely read.  Judging from its “impact factor,” we know the journal is frequently cited.  So when the NEJM weighs in on issue that involves the intersection of law and science, I pay attention.

Unfortunately, this week’s issue contains an editorial “Perspective” piece that is filled with incoherent, inconsistent, and incorrect assertions, both on the law and the science.  Mark A. Pfeffer and Marianne Bowler, “Access to Safety Data – Stockholders versus Prescribers,” 364 New Engl. J. Med. ___ (2011).

Dr. Mark Pfeffer and the Hon. Marianne Bowler used the recent United States Supreme Court decision in Matrixx Initiatives, Inc. v. Siracusano, __ U.S. __, 131 S.Ct., 1309 (2011), to advance views, not supported by the law or the science.   Remarkably, Dr. Pfeffer is the Victor J. Dzau Professor of Medicine, at the Harvard Medical School.  He is both a physician, and he has received a Ph.D. degree in physiology and biophysics.  Ms. Bowler is both a lawyer and a federal judge.  Between the two, they should have provided better, more accurate, and more consistent advice.

1. The Authors Erroneously Characterize Statistical Significance in Inappropriate Bayesian Terms

The article begins with a relatively straightforward characterization of various legal burdens of proof.  The authors then try to collapse one of those burdens of proof, “beyond a reasonable doubt,” which has no accepted quantitative meaning, to a significance probability that is used to reject a pre-specified null hypothesis in scientific studies:

“To reject the null hypothesis (that a result occurred by chance) and deem an intervention effective in a clinical trial, the level of proof analogous to law’s ‘beyond a reasonable doubt’ standard would require an extremely stringent alpha level to permit researchers to claim a statistically significant effect, with the offsetting risk that a truly effective intervention would sometimes be deemed ineffective.  Instead, most randomized clinical trials are designed to achieve a lower level of evidence that in legal jargon might be called ‘clear and convincing’, making conclusions drawn from it highly probable or reasonably certain.”

Now this is both scientific and legal nonsense.  It is distressing that a federal judge characterizes the burden of proof that she must apply, or direct juries to apply, as “legal jargon.”  More important, these authors, scientist and judge, give questionable quantitative meanings to burdens of proof, and they misstate the meaning of statistical significance.  When judges or juries must determine guilt “beyond a reasonable doubt,” they are assessing the prosecution’s claim that the defendant is guilty, given the evidence at trial.  This posterior probability can be represented as:

Probability (Guilt | Evidence Adduced)

This is what is known as a posterior probability, and it is fundamentally different from significance probability.

The significance probability is a transposed conditional probability from the posterior probability that is used to assess guilt in a criminal trial, or contentions in a civil trial.  As law professor David Kaye and his statistician coauthor, the late David Freedman, described the p-value and significance probability:

“The p-value is the probability of getting data as extreme as, or more extreme than, the actual data, given that the null hypothesis is true:

p = Probability (extreme data | null hypothesis in model)

* * *

Conversely, large p-values indicate that the data are compatible with the null hypothesis: the observed difference is easy to explain by chance. In this context, small p-values argue for the plaintiffs, while large p-values argue for the defense.131Since p is calculated by assuming that the null hypothesis is correct (no real difference in pass rates), the p-value cannot give the chance that this hypothesis is true. The p-value merely gives the chance of getting evidence against the null hypothesis as strong or stronger than the evidence at hand—assuming the null hypothesis to be correct. No matter how many samples are obtained, the null hypothesis is either always right or always wrong. Chance affects the data, not the hypothesis. With the frequency interpretation of chance, there is no meaningful way to assign a numerical probability to the null hypothesis.132

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 122 (2ed. 2000).  Kaye and Freedman explained over a decade ago, for the benefit of federal judges:

“As noted above, it is easy to mistake the p-value for the probability that there is no difference. Likewise, if results are significant at the .05 level, it is tempting to conclude that the null hypothesis has only a 5% chance of being correct.142

This temptation should be resisted. From the frequentist perspective, statistical hypotheses are either true or false; probabilities govern the samples, not the models and hypotheses. The significance level tells us what is likely to happen when the null hypothesis is correct; it cannot tell us the probability that the hypothesis is true. Significance comes no closer to expressing the probability that the null hypothesis is true than does the underlying p-value.143

Id. at 124-25.

As we can see, our scientist from the Harvard School of Medical School and our federal judge have committed the transpositional fallacy by likening “beyond a reasonable doubt” to the alpha used to test for a statistically significant outcome in a clinical trial.  They are not the same; nor are they analogous.

This fallacy has been repeatedly described.  Not only has the Reference Manual on Scientific Manual (which is written specifically for federal judges) described the fallacy in detail, but legal and scientific writers have urged care to avoid this basic mistake in probabilistic reasoning.  Here is a recent admonition from one of the leading writers on the use (and misuse) of statistics in legal procedures:

“Some commentators, however, would go much further; they argue that is an arbitrary statistical convention and since preponderance of the evidence means 51% probability, lawyers should not use 5% as the level of statistical significance but 49% – thus rejecting the null hypothesis when there is up to a 49% chance that it is true. In their view, to use a 5% standard of significance would impermissibly raise the preponderance of evidence standard in civil trials. Of course the 5% figure is arbitrary (although widely accepted in statistics) but the argument is fallacious. It assumes that 5% (or 49% for that matter) is the probability that the null hypothesis is true. The 5% level of significance is not that, but the probability of the sample evidence if the null hypothesis were true. This is a very different matter. As I pointed out in Chapter1, the probability of the sample given the null hypothesis is not generally the same as the probability of the null hypothesis given the sample. To relate the level of significance to the probability of the null hypothesis would require an application of Bayes’s theorem and the assumption of a prior probability distribution. However, the courts have usually accepted the statistical standard, although with some justifiable reservations when the P-value is only slightly above the 5% cutoff.”

Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 54 (N.Y. 2009) (emphasis added).

2.  The Authors, Having Mischaracterized Burden-of-Proof and Significance Probabilities, Incorrectly Assess the Meaning of the Supreme Court’s Decision in Matrixx Initiatives.

I have written a good bit about the Court’s decision in Matrixx Initiatives, most recently with David Venderbush, for the Washington Legal Foundation.  See Schachtman & Venderbush, “Matrixx Unbounded: High Court’s Ruling Needlessly Complicates Scientific Evidence Principles,” W.L.F. Legal Backgrounder (June 17, 2011).

I was thus startled to see the claim of a federal judge that the Supreme Court, in Matrixx, had “applied the ‘fair preponderance of the evidence’ standard of proof used for civil matters.”  Matrixx was a case about the sufficiency of the pleadings, and thus there really could have been no such application of a burden of proof to an evidentiary display.  The very claim is incoherent, and at odds with the Supreme Court’s holding.

The NEJM authors went on to detail how the defendant in Matrixx had persuaded the trial court that the evidence against its product, Zicam, did not reach statistical significance, and therefore the evidence should not be considered “material.”  As I have pointed out before, Matrixx focused on adverse event reports, as raw number of reported events, which did not, and could not, be analyzed for statistical significance.  The very essence of Matrixx’s argument was nonsense, which perhaps explains the company’s nine-nothing loss in the Supreme Court.  The authors of the opinion piece in the NEJM, however, missed that it is not the evidence of adverse event reports, with or without a statistical analysis, that is material.  What was at issue was whether the company’s failure to disclose this information, along with a good deal more information, in the face of the company’s having made very aggressive, optimistic sales and profits projections for the future.

The NEJM authors proceed to tell us, correctly, that adverse events do not prove causality, but then they tell us, incorrectly, that the Matrixx case shows that “such a high level of proof did not have to be achieved.”  While the authors are correct about the sufficiency of adverse event reports for causal assessments, they miss the legal significance of there being no burden of proof at play in Matrixx; it was a case on the pleadings.  The issue was the sufficiency of those pleadings, and what the Supreme Court made clear was that in the context of a product subject to FDA regulation, causation was never the test for materiality because the FDA could withdraw the product on a showing far less than scientific causation of harm.  So the plaintiffs could allege less than causation, and still have pleaded a sufficient case of securities fraud.  The Supreme Court did not, and could not, address the issue that the NEJM authors discuss.  The authors’ assessment that the Matrixx case freed legal causation of any requirement of statistical significance is a tortured reading of obiter dictum, not the holding of the case.  This editorializing is troubling.

The NEJM authors similarly hold forth on what clinicians consider material, and they announce that “[c]linicians are well aware that to be considered material, information regarding drug safety does not have to reach the same level of certainty that we demand for demonstrating efficacy.”  This is true, but clinicians are ethically bound to err on the side of safety:  Primum non nocere. See, e.g., Tamraz v. Lincoln Elec. Co., 620 F.3d 665, 673 (6th Cir. 2010) (noting that treating physicians have more training in diagnosis than in etiologic assessments), cert. denied, ___ U.S.____ (2011).  Again, the authors’ statements have nothing to do with the Matrixx case, or with the standards for legal or scientific causation.

3.  The Authors, Inconsistently with Their Characterization of Various Probabilities, Proceed Correctly To Describe Statistical Significance Testing for Adverse Outcomes in Trials.

Having incorrectly described beyond a reasonable doubt as like p <0.05, the NEJM authors then, correctly point out that standard statistical testing cannot be used for “evaluating unplanned and uncommon adverse events.”  The authors also note that the flood of data in the assessment of causation of adverse events is filled with “biologic noise.”  Physicians and regulators may take the noise signals and claim that they hear a concert.  This is exactly why we should not confuse precautionary judgments with scientific assessments of causation.

Ninth Circuit Affirms Rule 702 Exclusion of Dr David Egilman in Diacetyl Case

June 20th, 2011

On June 17, 2011, the Ninth Circuit of the United States Court of Appeals affirmed a district judge’s decision to exclude Dr David S. Egilman from testifying in a consumer-exposure diacetyl case.  Newkirk v. Conagra Foods Inc. (9th Cir. 2011).

Plaintiff claimed to develop bronchiolitis obliterans from having popped and eaten an Homeric quantity of microwavable popcorn.  The case was thus a key test of “consumer” diacetyl exposure.  Another case, also involving Egilman, just finished a Daubert hearing in Colorado, last week.

To get the full “flavor” of this diacetyl case, you may have to read the district court’s opinion, which excluded Egilman and other witnesses, and entered summary judgment for the defense. Newkirk v. Conagra Foods, Inc., No. CV-08-273, 2010 WL 2680184 (E.D. Wash. July 2, 2010).

Plaintiff appealed, and so did Egilman.  (See attached Egilman Motion Appeal Diacetyl Exclusion 2011 and Egilman Declaration Newkirk Diacetyl Appeal 2011.)  In what some may consider scurrilous pleading, Egilman attacked the district judge for having excluded him from testifying.  If Egilman’s challenge to the trial judge was not bizarre enough, Egilman also claimed a right to intervene in the appeal by advancing the claim that the Rule 702 exclusion hurt his livelihood.  The following language is from paragraph 11 of Dr. Egilman’s declaration in support of his motion:

“The Daubert ruling eliminates my ability to testify in this case and in others. I will lose the opportunity to bill for services in this case and in others (although I generally donate most fees related to courtroom testimony to charitable organizations, the lack of opportunity to do so is an injury to me). Based on my experience, it is virtually certain that some lawyers will choose not to attempt to retain me as a result of this ruling. Some lawyers will be dissuaded from retaining my services because the ruling is replete with unsubstantiated pejorative attacks on my qualifications as a scientist and expert. The judge’s rejection of my opinion is primarily an ad hominem attack and not based on an actual analysis of what I said – in an effort to deflect the ad hominem nature of the attack the judge creates ‘straw man’ arguments and then knocks the straw men down, without ever addressing the substance of my positions.”

Egilman Declaration at Paragraph 11.

Egilman tempers his opinion about the prejudice he will suffer in front of judges in future cases.  Only judges who have not seen him before would likely be persuaded by Judge Peterson’s decision in Newkirk.  Those judges who have heard him testify before would, no doubt, see him for the brilliant crusading avenger that he is:

“This will generally not occur in cases heard before Judges where I have already appeared as a witness. For example a New York state trial judge has praised plaintiffs’ molecular-biology and public-health expert Dr. David Egilman as follows: ‘Dr. Egilman is a brilliant fellow and I always enjoy seeing him and I enjoy listening to his testimony . . . . He is brilliant, he really is.’ [Lopez v. Ford Motor Co., et al. (120954/2000; In re New York City Asbestos Litigation, Index No. 40000/88).]”

Egilman Declaration at p. 9 n. 2.

It does not appear as though Egilman’s attempt to intervene helped plaintiff before the Ninth Circuit, which may not have thought that he was as brilliant as the unidentified trial judge in Lopez.

The Newkirk case is interesting for several reasons.

First, the Circuit correctly saw that general causation must be shown before the plaintiff can invoke a differential etiology analysis.

Second, the Circuit saw that it is not sufficient that the substance in question can cause the outcome claimed; the substance must do so at the levels of exposure that were experienced by the plaintiff.  In Newkirk, even by consuming massive quantities of microwave popcorn, plaintiff had not shown exposure levels to diacetyl equivalent to the exposures among factory workers at risk for bronchiolitis obliterans.  The affirmance of the district court is a strong statement that exposure matters in the context of the current understanding of diacetyl causation.

Third, the Circuit was not intimidated or persuaded by the tactics of Dr David Egilman, expert witness.

Fourth, having dealt with the issues deftly, the Ninth Circuit issued a judgment from which there will be no appeal.

WLF Legal Backgrounder on Matrixx Initiatives

June 20th, 2011

In Matrixx Initiatives, Inc. v. Siracusano, ___ U.S. ___, ___ , 2011 WL 977060 (Mar. 22, 2011), the Supreme Court addressed a securities fraud case against an over-the-counter pharmaceutical company for speaking to the market about its rosy financial projections, but failing to provide information received about the hazards of the product.

Much or most of the holding of the case is an unexceptional application of settled principles of securities fraud litigation in the context of claims against a pharmaceutical company with products liability cases pending.  The defendant company, however, attempted to import Rule 702 principles of scientific evidence into a motion to dismiss on the pleadings, with much confusion resulting among the litigants, the amici, and the Court.  The Supreme Court ruled unanimously to affirm the reinstatement of the complaint against the defendant.

I have written about this case previously: “The Matrixx – A Comedy of Errors,” and “Matrixx Unloaded,” and “The Matrixx Oversold,” and “De-Zincing the Matrixx.”

Now, with the collaboration of David Venderbush from Alston & Bird LLP, we have collected our thoughts to share in the form of a Washington Legal Foundation Legal Backgrounder, which is available for download at the WLF’s website.  Schachtman & Venderbush, “Matrixx Unbounded: High Court’s Ruling Needlessly Complicates Scientific Evidence Principles,” 26 (14) Legal Backgrounder (June 17, 2011).

National Academies Press Publications Are Now Free

June 3rd, 2011

Publications of the National Research Council, as well as those of its constitutive organizations, the National Academy of Science, the Institute of Medicine, and the National Academy of Engineering, are often important resources for lawyers who litigate scientific and technical issues.  Right or wrong, these publications become forces in their own right in the courtroom, where they command serious attention from trial and appellate judges.

According to the National Academies Press’s website, all electronic versions of its books, in portable document format (pdf), will be available at its website, for free:

“As of June 2, 2011, all PDF versions of books published by the National Academies Press (NAP) will be downloadable to anyone free of charge.

That’s more than 4,000 books plus future reports produced by NAP – publisher for the National Academy of Sciences, National Academy of Engineering, Institute of Medicine, and National Research Council.”

Important works on forensic evidence, asbestos, dioxin, beryllium, research ethics, and data sharing published by the NAP, for the IOM or NRC, are now available for free.  The NAP charged upwards of $40 or 50 for some of these books previously.

This summer, the NRC’s Committee on Science, Technology and Law will release the Third Edition of the Reference Manual on Scientific Evidence, previously prepared by the Federal Judicial Center.  See http://sites.nationalacademies.org/PGA/stl/development_manual/index.htm

Statistical Power in the Academy

June 1st, 2011

Previously I have written about the concept of statistical power and how it is used and abused in the courts.  See here and here.

Statistical power was discussed in both the chapters on statistics and on epidemiology in the Second Edition of The Reference Manual on Scientific Evidence. In my earlier posts, I pointed out that the chapter on epidemiology provided some misleading, outdated guidance on the use of power.  See Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000) (recommending use of power curves to assess whether failure to achieve statistical significance is exonerative of the exposure in question).  This chapter suggests that “[t]he concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.” Id.; see also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000).

The fact of the matter is that power curves are rarely or never used in contemporary epidemiology, and post-hoc power calculations have been discouraged and severely criticized for a long time. After the data are collected, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.  These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out (on the basis of random variation only) by the data in a given study. Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis.

Twenty-five years ago, the use of post-hoc power was thoughtfully put in the dust bin of statistical techniques in the leading clinical medical journal:

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end.  This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

An accompanying editorial by Ken Rothman reinforced the guidance given by Simon:

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

These two papers must be added to the 20 consensus statements, textbooks, and articles I previously cited.  See Schachtman, Power in the Courts, Part Two (2011).

The danger of the Reference Manual’s misleading advice is illustrated in a recent law review article by Professor Gold, of the Rutgers Law School, who asks “[w]hat if, as is frequently the case, such study is possible but of limited statistical power?”  Steve C. Gold, “The ‘Reshapement’ of the False Negative Asymmetry in Toxic Tort Causation, 37 William Mitchell L. Rev. 101, 117 (2011) (available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1797826).

Never mind for the moment that Professor Gold offers no empirical evidence to support his assertion that studies of limited statistical power are “frequently” used in litigation.  Gold critically points to Dunn v. Sandoz Pharmaceuticals Corp., 275 F. Supp. 2d 672, 677–81, 684 (M.D.N.C. 2003), a Parlodel case in which case plaintiff relied upon a single case-control study that found an elevated odds ratio (8.4), which was not statistically significant.  Gold at 117.  Gold complains that “a study’s limited statistical power, rather than the absence of a genuine association, may lead to statistically insignificant results that courts treat as disproof of causation, particularly in situations without the large study samples that result from mass exposures.” Id.  Gold goes on to applaud two cases for emphasizing consideration of post-hoc power.  Id. at 117 & n. 80 – 81 (citing Smith v. Wyeth-Ayerst Labs. Co., 278 F. Supp. 2d 684, 692 – 93 (W.D.N.C. 2003) (“[T]he concept of power is key because it’s helpful in evaluating whether the study‘s outcome . . . is exonerative or inconclusive.”), and Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767, 774 (N.D. Ohio 2010) (prohibiting expert witness from opining that epidemiologic studies are evidence of no association unless the witness “has performed a methodologically reliable analysis of the studies’ statistical power to support that conclusion”).

What of Professor Gold’s suggestion that power should be considered in evaluating studies that do not have statistically significant outcomes of interest?  See id. at 117. Not only is Gold’s endorsement at odds with sound scientific and statistical advice, but his approach reveals a potential hypocrisy when considered in the light of his criticisms of significance testing.  Post-hoc power tests ignore the results obtained, including the variance of the actual study results, and they are calculated based upon a predetermined arbitrary measure of Type I error (alpha) that is the focus of so much of Gold’s discomfort with statistical evidence.  Of course, power calculations also are made on the basis of arbitrarily selected alternative hypotheses, but this level of arbitrariness seems not to disturb Gold so much.

Where does the Third Edition of the Reference Manual on Scientific Evidence come out on this issue?  The Third Edition is not yet published, but Professor David Kaye has posted his chapter on statistics on the internet.  David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” chapter 5.  http://www.personal.psu.edu/dhk3/pubs/11-FJC-Ch5-Stat.pdf (David Freedman died in 2008, after the chapter was submitted to the National Academy of Sciences for review; only Professor Kaye responded to the Academy’s reviews).

The chapter essentially continues the Second Edition’s advice:

“When a study with low power fails to show a significant effect, the results may therefore be more fairly described as inconclusive than negative. The proof is weak because power is low. On the other hand, when studies have a good chance of detecting a meaningful association, failure to obtain significance can be persuasive evidence that there is nothing much to be found.”

Chapter 5, at 44-46 (citations and footnotes omitted).

The chapter’s advice is not, of course, limited to epidemiologic studies, where a risk ratio or a risk difference is typically reported with an appropriate confidence interval.  In the broad generality of considering all statistical tests, some of which do not report a measure of “effect size,” and the variability of the sample statistic, the chapter’s advice is fine.  But, as we can see from Professor Gold’s discussion and case review, the advice runs into trouble when measured against the methodological standards for evaluating an epidemiologic study’s results when confidence intervals are available.  Gold’s assessment of the cases is considerably skewed by his failure to recognize the inappropriateness of post-hoc power assessments of epidemiologic studies.

Sub-group Analyses in Epidemiologic Studies — Dangers of Statistical Significance as a Bright-Line Test

May 17th, 2011

Both aggregation and disaggregation of outcomes poses difficult problems for statistical analysis, and for epidemiology.  If outcomes are bundled into a single composite outcome, there has to be some basis for the bundling to make sense.  Even so, a composite outcome, such as all cardiovascular disease events, could easily hide an association in a component outcome.  For instance, studies of a drug under scrutiny may show no increased risk for all cardiovascular events, but closer inspection may show an increased risk for heart attacks while also showing a decreased risk for strokes.

The opposite problem arises when studies report multiple subgroups.  The opportunity for post hoc data mining runs rampant, and the existence of multiple subgroups means that the usual level of statistical significance becomes ineffective for ruling out chance as an explanation for an increased or decreased risk in a subgroup.  This problem is well known and extensively explored in the epidemiology literature, but it receives no attention in the Federal Judicial Center’s current Reference Manual on Scientific Evidence.  I hope that the authors of the Third Edition, which is due out in a few months, give some attention to the problem of subgroup analysis in epidemiology.  This seems to be an area where judges need a good deal of assistance, and where the Reference Manual lets them down.

Litigation tends to be a fertile field for the data dredging or the Texas Sharp shooters’ approach to epidemiology. (The Texas Sharp shooter shoots first and draws the target later.) When studies look at many outcomes, or many subgroups, chance alone will lead to results that have p-values less than the usual level for statistical significance (p < 0.05).  Accepting a result as “significant” when there is a multiplicity of testing or comparisons resulting from subgroup analyses is a form of “data torturing.” Mills, “Data Torturing,” 329 New Engl. J. Med. 1196, 1196 (1993)(“If you torture the data long enough, they will confess.”).

The multiple testing or comparison issue arises in both cohort and case-control studies.  Cohort studies have the ability to look at cancer morbidity or mortality at 20 different organs, with multiple histological subtypes for each cancer.  There are hundreds of diseases, by World Health Organization disease codes, which can be a possible outcome in a cohort study.  The odds are very good that several disease outcomes will be significantly elevated or decreased by chance alone.  Similarly, in a case-control study, participants with the outcome of interest can be questioned about hundreds of lifestyle and exposure variables.  Again, the finding of a “risk factor,” with statistical significance is not very compelling under these circumstances.

The problem of subgroup analyses is exacerbated by defense counsel’s emphasis on statistical significance as a “bright-line” test.  When subgroup analyses yield a statistically significant result, at the usual p < 0.05, which they will often do by chance alone, plaintiffs’ counsel obtain a “gotcha” moment.  Having built up the importance of statistical significance, defense counsel are hard pressed to dismiss the “significant” finding, even though study design makes it highly questionable if not downright meaningless.

Although the Reference Manual ignores this recurrent problem, several authors have issued severe alerts to the issue. For instance, Lisa Bero, who writes frequently on science and the law issues, admonishes:

“Specifying subgroup analysis after data collection for the review has already begun can be a ‘fishing expedition’ or “data dredging” for statistically significant results and is not appropriate.”

L. Bero, “Evaluating Systematic Reviews and Meta-Analyses,” J. L. & Policy 569, 576 (2006).

Eggers and Davey Smith, two well-respected English authors, who write about methodological issues in epidemiology, warn:

“Similarly, unplanned data-driven subgroup analyses are likely to produce spurious results.”

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” 24 chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

Stewart and Parmar explain the genesis of the problem and the result of diluting the protection that statistical significance usually provides against Type I errors:

“In general, the results of these subgroup analyses can be very misleading owing to the very high probability that any observed differences is due solely to chance.8 For example, if 10 subgroup analyses are carried out, there is a 40% chance of finding at least one significant false-positive effect (5% significance level).  Further, when the results of subgroup analyses are reported, often only those that have yielded a significant result are presented, without noting that many other analyses have been performed.”

Stewart and Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996)

“Such data dredging must be avoided and subgroup analyses should be limited to those that are specified a priori in the trial protocol.”

Id. at 272.

“Readers and reviewers should be aware that subgroup analyses, exploratory or otherwise, are likely to be particularly unreliable in situations where no overall effect of treatment has been observed.  In this case, if one subgroup exhibits a particularly positive effect of treatment, then another subgroup has to have a counteracting negative effect.”

* * *

“Consequently, perhaps the most sensible advice to readers and reviewers is to be very skeptical about the results of subgroup analyses.”

Id.  See also Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000) (“Analysis of subgroup results in a clinical trial is surprisingly unreliable, even in a large trial.  This is the result of a combination of reduced statistical power, increased variance and the play of chance.  Reliance on such analyses is likely to be erroneous, and hence harmful, than application of the overall proportional (or relative) result in the whole trial to the estimate of absolute risk in that subgroup.  Plausible explanations can usually be found for effects that are, in reality, simply due to the play of chance.  When clinicians believe such subgroup analyses, there is a real damage of harm to the individual patient.”)

These warnings and admonitions are important caveats to statistical significance.  In emphasizing the importance of statistical significance in evaluating statistical evidence, defense lawyers are sometimes unwittingly hoisted with their own petard, in the form of studies that have results that meet the usual p-value threshold of lower than 5%.  Courts see these defense lawyers as engaged in special pleading when counsel argues that study multiplicity requires changing the p-value threshold to preserve the desired rate of Type I error, but that is exactly what must be done.

A few years ago, the New England Journal of Medicine published an article that detailed the problem and promulgated guidelines for avoiding the worst abuses.  R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).  Wang and colleagues provide some important insights for how subgroup analyses can lead to increased rates of Type I errors, and they provide guidelines for authors on appropriate descriptions of subgroup analyses:

“However, subgroup analyses also introduce analytic challenges and can lead to overstated and misleading results.”

Id. at 2189a.

“When multiple subgroup analyses are performed, the probability of a false positive finding can be substantial.”

Id. at 2190a.

“There are several methods for addressing multiplicity that are based on the use of more stringent criteria for statistical significance than the customary P < 0.05.”

Id. at 2190b.

“A pre-specified subgroup analysis is one that is planned and documented before any examination of the data, preferably in the study protocol.”

Id. at 2190b.

“Post hoc analyses refer to those in which the hypotheses being tested are not specified before any examination of the data. Such analyses are of particular concern because it is often unclear how many were undertaken and whether some were motivated by inspection of the data. However, both pre-specified and post hoc subgroup analyses are subject to inflated false positive rates arising from multiple testing. Investigators should avoid the tendency to pre-specify many subgroup analyses in the mistaken belief that these analyses are free of the multiplicity problem.”

Id. at 2190b.

“When properly planned, reported, and interpreted, subgroup analyses can provide valuable information.”

Id. at 2193b.

Although Wang and colleagues take their primary aim at the abuse of subgroup analyses in randomized clinical trials, they make clear that the abuse is equally present in observational studies:

“In other settings, including observational studies, we encourage complete and thorough reporting of the subgroup analyses in the spirit of the guidelines listed.”

Id. at 2193b.

Wang and colleagues provide some very specific guidelines for reporting subgroup analyses.  These guidelines are a helpful source for helping courts make sober assessments of results from subgroup analyses.

Recently, another guideline initiative, STROBE, in the field of observational epidemiology provided similar guidance to authors and journals for reporting subgroup analyses:

“[M]any debate the use and value of analyses restricted to subgroups of the study population. Subgroup analyses are nevertheless often done. Readers need to know which subgroup analyses were planned in advance, and which arose while analyzing the data. Also, it is important to explain what methods were used to examine whether effects or associations differed across groups … .”

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805, 817 (2007).

“There is debate about the dangers associated with subgroup analyses, and multiplicity of analyses in general.  In our opinion, there is too great a tendency to look for evidence of subgroup-specific associations, or effect-measure modification, when overall results appear to suggest little or no effect. On the other hand, there is value in exploring whether an overall association appears consistent across several,

preferably pre-specified subgroups especially when a study is large enough to have sufficient data in each subgroup. A second area of debate is about interesting subgroups that arose during the data analysis. They might be important findings, but might also arise by chance. Some argue that it is neither possible nor necessary to inform the reader about all subgroup analyses done as future analyses of other data will tell to what extent the early exciting findings stand the test of time. We advise authors to report which analyses were planned, and which were not   … . This will allow readers to judge the implications of multiplicity, taking into account the study’s position on the continuum from discovery to verification or refutation.”

Id. at 826-27.

Bibliography

E. Akl, M. Briel, J.J. You, et al., “LOST to follow-up Information in Trials (LOST-IT): a protocol on the potential impact,” 10 Trials 40 (2009).

Susan Assmann, Stuart Pocock, Laura Enos, Linda Kasten, “Subgroup analysis and other (mis)uses of baseline data in clinical trials,” Lancet 2000; 355: 1064–69.

M. Bhandari, P.J. Devereaux, P. Li, et al., “Misuse of baseline comparison tests and subgroup analyses in surgical trials,” 447 Clin. Orthoped. Relat. Res. 247 (2006).

S. T. Brookes, E. Whitely, M. Egger, et al., “Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test,” 57 J. Clin. Epid. 229 (2004).

A-W Chan, A. Hrobjartsson, K.J. Jorgensen, et al., “Discrepancies in sample size calculations and data analyses reported in randomised trials: comparison of publications with protocols,” 337 Brit. Med. J. a2299 (2008).

L. Cui, H.M. Hung, S.J. Wang, et al., “Issues related to subgroup analysis in clinical trials,” 12 J. Biopharm. Stat. 347 (2002).

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

J. Fletcher, “Subgroup analyses: how to avoid being misled,” 335 Brit. Med. J. 96 (2007).

Nick Freemantle,”Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic?” 322 Brit. Med. J. 989 (2001).

G. Guyatt, P.C. Wyer, J. Ioannidis, “When to Believe a Subgroup Analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 571-83 (2008).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Inconsistent trial assessments by the National Institute for Health and Clinical Excellence and IQWiG: standards for the performance and interpretation of subgroup analyses are needed,” 63 J. Clin. Epidem. 1298 (2010).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Standards for subgroup analyses are needed? We couldn’t agree more,”  64 J. Clin. Epidem. 451 (2011).

R. Hatala, S. Keitz, P. Wyer, et al., “Tips for learners of evidence-based medicine: 4. Assessing heterogeneity of primary studies in systematic reviews and whether to combine their results,” 172 Can. Med. Ass’n J. 661 (2005).

A.V. Hernandez, E.W. Steyerberg, G.S. Taylor, et al., “Subgroup analysis and covariate adjustment in randomized clinical trials of traumatic brain injury: a systematic review,” 57 Neurosurgery 1244 (2005).

A.V. Hernandez, E. Boersma, G.D. Murray, et al., “Subgroup analyses in therapeutic cardiovascular clinical trials: are most of them misleading?” 151 Am. Heart J. 257 (2006).

K. Hirji & M. Fagerland, “Outcome based subgroup analysis: a neglected concern,” 10 Trials 33 (2009).

Stephen W. Lagakos, “The Challenge of Subgroup Analyses — Reporting without Distorting,” 354 New Engl. J. Med. 1667 (2006).

C.M. Martin, G. Guyatt, V. M. Montori, “The sirens are singing: the perils of trusting trials stopped early and subgroup analyses,” 33 Crit. Care Med. 1870 (2005).

D. Moher, K. Schulz, D. Altman, et al.,“The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials,” 357 Lancet 1191 (2001).

V.M. Montori, R. Jaeschke, H.J. Schunemann, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004).

A.D. Oxman & G.H. Guyatt, “A consumer’s guide to subgroup analyses,” 116 Ann. Intern. Med. 78 (1992).

A. Oxman, G. Guyatt, L. Green, et al., “When to believe a subgroup analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 553-65 (2008).

S. Pocock, M. D. Hughes, R.J. Lee, “Statistical problems in the reporting of clinical trials:  A survey of three medical journals,” 317 New Engl. J. Med. 426 (1987).

S. Pocock, S. Assmann, L. Enos, et al., “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems,” 21 Statistics in Medicine 2917 (2002).

Peter Rothwell, “Subgroup analysis in randomised controlled trials:  importance, indications, and interpretation,” 365 Lancet 176 (2005).

Kenneth Schulz & David Grimes, “Multiplicity in randomised trials II: subgroup and interim analyses,” 365 Lancet 1657 (2005).

Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000).

Reuel Stallones, “The Use and Abuse of Subgroup Analysis in Epidemiological Research,” 16 Prev. Med. 183 (1987).

Stewart & Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996).

Xin Sun, Matthias Briel, Jason Busse, Elie A. Akl, John J .You, Filip Mejza, Malgorzata Bala, Natalia Diaz-Granados, Dirk Bassler, Dominik Mertz, Sadeesh K Srinathan, Per Olav Vandvik, German Malaga, Mohamed Alshurafa, Philipp Dahm, Pablo Alonso-Coello, Diane M Heels-Ansdell, Neera Bhatnagar, Bradley C. Johnston, Li Wang, Stephen D. Walter, Douglas G. Altman, and Gordon Guyatt, “Subgroup Analysis of Trials Is Rarely Easy (SATIRE): a study protocol for a systematic review to characterize the analysis, reporting, and claim of subgroup effects in randomized trials,” 10 Trials 1010 (2009).

A. Trevor & G. Sheldon, “Criteria for the Implementation of Research Evidence in Policy and Practice,” in A. Haines, ed., Getting Research Findings Into Practice 11 (2d ed. 2008).

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805–835 (2007).

Erik von Elm & Matthias Egger, “The scandal of poor epidemiological research Reporting guidelines are needed for observational epidemiology,” 329 Brit. Med. J. 868 (2004).

R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).

S. Yusuf, J. Wittes, J. Probstfield, et al., “Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials,” 266 J. Am. Med. Ass’n 93 (1991).

De-Zincing the Matrixx

April 12th, 2011

Although the plaintiffs, in Matrixx Intiatives, Inc. v. Siracusano,  generally were accurate in defining statistical significance than the defendant, or than the so-called “statistical expert” amici (Ziliak and McCloskey), the plaintiffs’ brief goes off the rails when it turned to discussing the requirements for proving causation.  Of course, the admissibility and sufficiency of evidence to show causation were not at issue in the case, but plaintiffs got pulled down the rabbit hole dug by the defendant, in its bid to establish a legal bright-line rule about pleading.

Differential Diagnosis

In an effort to persuade the Court that statistical significance is not required, the plaintiffs/respondents threw science and legal principles to the wind.  They contended that statistical significance is not at all necessary to causal determinations because

“[c]ourts have recognized that a physician’s differential diagnosis (which identifies a likely cause of certain symptoms after ruling out other possibilities) can be reliable evidence of causation.”

Respondents’ Brief at 49.   Perhaps this is simply the Respondents’ naiveté, but it seems to suggest scienter to deceive. Differential diagnosis is not about etiology; it is about diagnosis, which rarely incorporates an assessment of etiology.  Even if the differentials were etiologies and not diagnoses, the putative causes in the differential must already be shown, independently, to be capable of causing the outcome in question. See, e.g., Tamraz v. Lincoln Electric Co., 620 F.3d 665 (6th Cir. 2010).  A physician cannot rule in an etiology in a specific person simply by positing it among the differentials, without independent, reliable evidence that the ruled in “specific cause” can cause the outcome in question, under the circumstances of the plaintiff’s exposure.  Furthermore, differential diagnosis or etiology is nothing more than a process of elimination to select a specific cause; it has nothing to do with statistical significance because it has nothing to do with general causation.

This error in the Respondent’s brief about differential diagnosis unfortunately finds its way into Justice Sotomayor’s opinion.

Daubert Denial and the Recrudescence of Ferebee

In their zeal, the Respondents go further than advancing a confusion between general and specific causation, and an erroneous view of what must be shown before a putative cause can be inserted in a set of differential (specific) causes.  They cite one of the most discredited cases in 20th century American law of expert witnesses:

Ferebee v. Chevron Chem. Co., 736 F.2d 1529, 1536 (D.C. Cir. 1984) (“products liability law does not preclude recovery until a ‘statistically significant’ number of people have been injured”).”

Respondents’ Brief at 50.  This is not a personal, subjective opinion about this 1984 pre-Daubert decision.  Ferebee was wrongly decided when announced, and it was soon abandoned by the very court that issued the opinion.  It has been a derelict on the sea of evidence law for over a quarter of a century.  Citing to Ferebee, without acknowledging its clearly overruled status, raises an interesting issue about candor to the Court, and the responsibilities of counsel in trash picking in the dustbin of expert witness law.

Along with its apparent rejection of statistical significance, Ferebee is known for articulating an “anything goes” philosophy toward the admissibility and sufficiency of expert witnesses:

“Judges, both trial and appellate, have no special competence to resolve the complex and refractory causal issues raised by the attempt to link low-level exposure to toxic chemicals with human disease.  On questions such as these, which stand at the frontier of current medical and epidemiological inquiry, if experts are willing to testify that such a link exists, it is for the jury to decide to credit such testimony.”

Ferebee v. Chevron Chemical Co., 736 F.2d 1529, 1534 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).  Within a few years, the nihilism of Ferebee was severely limited by the court that decided the case:

“The question whether Bendectin causes limb reduction defects is scientific in nature, and it is to the scientific community that the law must look for the answer.  For this reason, expert witnesses are indispensable in a case such as this.  But that is not to say that the court’s hands are inexorably tied, or that it must accept uncritically any sort of opinion espoused by an expert merely because his credentials render him qualified to testify… . Whether an expert’s opinion has an adequate basis and whether without it an evidentiary burden has been met, are matters of law for the court to decide.”

Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 829 (D.C. Cir. 1988).

Of course, several important decisions intervened between Ferebee and Richardson.  In 1986, the Fifth Circuit expressed a clear message to trial judges that it would no longer continue to tolerate the anything-goes approach to expert witness opinions:

“We adhere to the deferential standard for review of decisions regarding the admission of testimony by xperts.  Nevertheless, we … caution that the standard leaves appellate judges with a considerable task.  We will turn to that task with a sharp eye, particularly in those instances, hopefully few, where the record makes it evident that the decision to receive expert testimony was simply tossed off to the jury under a ‘let it all in’ philosophy.  Our message to our able trial colleagues:  it is time to take hold of expert testimony in federal trials.

In re Air Crash Disaster, 795 F.2d 1230, 1234 (5th Cir. 1986) (emphasis added).

In the same intervening period between Ferebee and Richardson, Judge Jack Weinstein, a respected evidence scholar and well-known liberal judge, announced :

“The expert is assumed, if he meets the test of Rule 702, to have the skill to properly evaluate the hearsay, giving it probative force appropriate to the circumstances.  Nevertheless, the court may not abdicate its independent responsibilities to decide if the bases meet minimum standards of reliability as a condition of admissibility.  See Fed. Rule Ev. 104(a).  If the underlying data are so lacking in probative force and reliability that no reasonable expert could base an opinion on them, an opinion which rests entirely upon them must be excluded.”

In re “Agent Orange” Prod. Liab. Litig., 611 F. Supp. 1223, 1245 (E.D.N.Y. 1985)(excluding plaintiffs’ expert witnesses), aff’d, 818 F.2d 187 (2d Cir. 1987), cert. denied, 487 U.S. 1234 (1988).

The notion that technical decisions had to be evidence based, not opinion based, emerged elsewhere as well. For example, in the context of applying statistics, the federal courts pronounced that the ipse dixit of parties and witnesses did not count for much:

“When a litigant seeks to prove his point exclusively through the use of statistics, he is borrowing the principles of another discipline, mathematics, and applying these principles to the law. In borrowing from another discipline, a litigant cannot be selective in which principles are applied. He must employ a standard mathematical analysis. Any other requirement defies logic to the point of being unjust. Statisticians do not simply look at two statistics, such as the actual and expected percentage of blacks on a grand jury, and make a subjective conclusion that the statistics are significantly different. Rather, statisticians compare figures through an objective process known as hypothesis testing.”

Moultrie v. Martin, 690 F.2d 1078, 1082 (4th Cir. 1982)(citations omitted)

Of course, not long after the District of Columbia Circuit decided Ferebee, in 1993, the Supreme Court decided Daubert, followed by decisions in Joiner, Kumho Tire, and Weisgram.  In 2000, Congress approved a new Rule of Evidence 702, which incorporated the learning and experience in judicial gatekeeping from a wide range of cases and principles.

Do the Respondents have a defense to having cited an overruled, superseded, discredited precedent in the highest federal Court?  Perhaps they would argue that they are in pari delicto with courts (Daubert-Deniers), which remarkably have ignored the status of Ferebee, and cited it.  See, e.g., Betz v. Pneumo Abex LLC, 998 A.2d 962, 981 (Pa. Super. 2010); McCarrell v. Hoffman-La Roche, Inc., 2009 WL 614484, *23 (N.J.Super.A.D. 2009).  See also Rubanick v. Witco Chemical Corp., 125 N.J. 421, 438-39 (1991)(quoting Ferebee before it was overruled by the Supreme Court, but after it was disregarded by the D.C. Circuit in Richardson).

Matrixx Galvanized – More Errors, More Comedy About Statistics

April 9th, 2011

Matrixx Initiatives is a rich case – rich in irony, comedy, tragedy, and error.  It is well worth further exploration, especially in terms of how this 9-0 decision was reached, what it means, and how it should be applied.

It pains me that the Respondents (plaintiffs) generally did a better job in explaining significance testing than did the Petitioner (defendant).

At least some of the Respondents’ definitional efforts are unexceptional.  For instance:

“Researchers use the term ‘statistical significance’ to characterize a result from a test that satisfies a particular kind of test designed to show that the result is unlikely to have occurred by random chance.  See David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 83, 122 (Fed. Judicial Ctr., 2d ed. 2000) (“Reference Manual”).”

Brief for Respondents, at 38 – 39 (Nov 5, 2010).

“The purpose of significance testing in this context is to assess whether two events (here, taking Zicam and developing anosmia) occur together often enough to make it sufficiently implausible that no actual underlying relationship exists between them.”

Id. at 39.   These definitions seem acceptable as far as they go, as long as we realize that the relationship that remains, when chance is excluded, may not be causal, and indeed, it may well be a false-positive relationship that results from bias or confounding.

Rather than giving one good, clear definition, the Respondents felt obligated to and repeat and restate their definitions, and thus wandered into error:

“To test for significance, the researcher typically develops a ‘null hypothesis’ – e.g., that there is no relationship between using intranasal Zicam and the onset of burning pain and subsequent anosmia. The researcher then selects a threshold (the ‘significance level’) that reflects an acceptably low probability of rejecting a true null hypothesis – e.g., of concluding that a relationship between Zicam and anosmia exists based on observations that in fact reflect random chance.”

Id. at 39.  Perhaps the Respondents were using the “cooking frogs” approach.  As the practical wisdom has it, dropping a frog into boiling water risks having the frog jump out, but if you put a frog into a pot of warm water, and gradually bring the pot to a boil, you will have a cooked frog.  Here the Respondents repeat and morph their definition of statistical significance until they have brought it around to their rhetorical goal of confusing statistical significance with causation.  Note that now the definition is muddled, and the Respondents are edging closer towards claiming that statistical significance signals the existence of a “relationship” between Zicam and anosmia, when in fact, the statistical significance simply means that chance is not a likely explanation for the observations.  Whether a “relationship” exists requires further analysis, and usually a good deal more evidence.

“The researcher then calculates a value (referred to as p) that reflects the probability that the observed data could have occurred even if the null hypothesis were in fact true.”

Id. at 39-40 (emphasis in original). Well, this is almost true.  It’s not “even if,” but simply “if”; that is, the p-value is based upon the assumption that the null hypothesis is correct.  The “if” is not an incidental qualifier, it is essential to the definition of statistical significance. “Even” here adds nothing, but a slightly misleading rhetorical flourish.  And the p-value is not the probability that the observed data are correct; it’s the probability of observing the data obtained, or data more extreme, assuming the null hypothesis is true.

The Respondents/plaintiffs efforts at serious explication ultimately succumb to their hyperbolic rhetoric.  They explained that statistical significance may not be “practical significance,” which is true enough.  There are, of course, instances in which a statistical significant difference is not particularly interesting.  A large clinical trial, testing two cancer medications head to head, may show one extends life expectancy by a week or two, but has a worse side-effect profile.  The statistically significant “better” drug may be refused a license from regulatory agencies, or be rejected by knowledgeable oncologists and sensible patients, who are more concerned about quality of life issues.

The Respondents are also correct that invoking statistically significance does not provide the simple, bright-line test, Petitioner desired.  Someone would still have to specify the level of alpha, the acceptable level of Type I error, and this would further require a specification of either a one-sided or two-sided test.  To be sure, the two-sided test, with an alpha of 5%, is generally accepted in the world of biostatistics and biomedical research.  Regulatory agencies, including the FDA, however, lower the standard test to implement their precautionary principles and goals.  Furthermore, evaluation of statistical significance requires additional analysis to determine whether the observed deviation from expected is due to bias or confounding, or whether the statistical test has been unduly diluted by multiple comparisons, subgroup analyses, or data mining techniques.

Of course, statistical significance today usually occurs in conjunction with an assessment of “effect size,” usually through an analysis of a confidence interval around a point estimate of a risk ratio.  The Respondents’ complaint that the p-value does not convey the magnitude of the association is a bit off the mark, but not completely illegitimate.  For instance, if there were a statistically significant finding of anosmia from Zicam use, in the form of an elevated risk that was itself small, the FDA might well decide that the risk was manageable with a warning to users to discontinue the medication if they experienced a burning sensation upon use.

The Respondents, along with their two would-be “statistical expert” amici, misrepresent the substance of many of the objections to statistical significance in the medical literature.  A telling example is the Respondents’ citation to an article by Professor David Savitz:

David A. Savitz, “Is Statistical Significance Testing Useful in Interpreting Data?” 7 Reproductive Toxicology 95, 96 (1993) “[S]tatistical significance testing is not useful in the analysis or interpretation of scientific research.”).

Id. at 52, n. 40.

More complete quotations from Professor Savitz’ article, however, reveals a more nuanced, and rather different message:

“Although P values and statistical significance testing have become entrenched in the practice of biomedical research, their usefulness and drawbacks should be reconsidered, particularly in observational epidemiology. The central role for the null hypothesis, assuming an infinite number of replications, and the dichotomization of results as positive or negative are argued to be detrimental to the proper design and evaluation of research. As an alternative, confidence intervals for estimated parameters convey some information about random variation without several of these limitations. Elimination of statistical significance testing as a decision rule would encourage those who present and evaluate research to more comprehensively consider the methodologic features that may yield inaccurate results and shift the focus from the potential influence of random error to a broader consideration of possible reasons for erroneous results.”

Savitz, 7 Reproductive Toxicology at 95.  Respondents’ case would hardly have been helped by substituting a call for statistical significance with a call for using confidence intervals, along with careful scrutiny of the results for erroneous results.

“Regardless of what is taught in statistics courses or advocated by editorials, including the recent one in this journal, statistical tests are still routinely invoked as the primary criterion for assessing whether the hypothesized phenomenon has occurred.”

7 Reproductive Toxicology at 96 (internal citation omitted).

“No matter how carefully worded, “statistically significant” misleadingly conveys notions of causality and importance.”

Id. at 99.  This last quotation really unravels the Respondents’ fatuous use of citations.  Of course, the Savitz article is quite inconsistent generally with the message that the Respondents wished to convey to the Supreme Court, but intellectually honesty required a fuller acknowledgement of Prof. Savitz’ thinking about the matter.

Finally, there are some limited cases, in which the failure to obtain a conventionally statistically significant result is not fatal to an assessment of causality.  Such cases usually involve instances in which it is extremely difficult to find observational or experimental data to analyze for statistical significance, but other lines of evidence support the conclusion in a way that scientists accept.  Although these cases are much rarer than Respondents imagine, they may well exist, but they do not detract much from Sir Ronald Fisher’s original conception of statistical significance:

“In the investigation of living beings by biological methods statistical tests of significance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to the causes we wish to study, or are trying to detect, but to a combination of the many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.”

Ronald A. Fisher, “The Statistical Method in Psychical Research,” 39 Proceedings of the Society for Psychical Research 189, 191 (1929). Note that Fisher was talking about experiments, not observational studies, and that he hardly was advocating a mechanical, thoughtless criterion of significance.

The Supreme Court’s decision in Castenada illustrates how misleading statistical significance can be.  In a five-to-four decision, the Court held that a prima facie case of ethnic discrimination could be made out on the basis of statistical significance alone.  In dictum, the Court suggested that statistical evidence alone sufficed when the observed outcome was more than two or three standard deviations from the expected outcome.  Castaneda v. Partida, 430 U.S. 482, 496 n. 17 (1977).  The facts of Castaneda illustrate a compelling case in which the statistical significance observed was likely the result of confounding effects of reduced civic participation by poor, itinerant minorities, in a Texas county in which the ethnic minority controlled political power, and made up a majority of the petit jury that convicted Mr. Partida.

The Matrixx – A Comedy of Errors

April 6th, 2011

1. Incubi Curiae

As I noted in the Matrixx Unloaded, Justice Sotomayor’s scholarship, in discussing case law under Federal Rule of Evidence 702, was seriously off base.  Of course, Matrixx Initiatives was only a pleading case, and so there was no real reason to consider rules of admissibility or sufficiency, such as Rule 702.

Fortunately, Justice Sotomayor avoided further embarrassment by not discussing the fine details of significance or hypothesis testing.  Not so the two so-called “statistics experts” who submitted an amicus brief.

Consider the following statement by McCloskey and Ziliak, about adverse event reports (AER) and statistical significance.

“Suppose that a p-value for a particular test comes in at 9 percent.  Should this p-value be considered “insignificant” in practical, human, or economic terms? We respectfully answer, “No.” For a p-value of .09, the odds of observing the AER is 91 percent divided by 9 percent. Put differently, there are 10-to-l odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not).”

Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Zilliak in Support of Respondents, at 18 (Nov. 18, 2010), 2010 WL 4657930 (U.S.) (emphasis added).

Of course, the whole enterprise of using statistical significance to evaluate AER is suspect because there is no rate, either expected or observed.  A rate could be estimated from number of AER reported per total number of persons using the medication in some unit of time.  Pharmacoepidemiologists sometimes do engage in such speculative blue-sky enterprises to determine whether a “signal” may have been generated by the AER.  Even if a denominator were implied, and significance testing used, it would be incorrect to treat the association as causal.  Our statistics experts here have committed several serious mistakes; they have

  • treated the AERs as a rate, when it is simply a count;
  • treated the AERs as an observed rate that can be evaluated against a null hypothesis of no increase in rate, when there is no expected rate for the event in question; and
  • treated the pseudo-statistical analysis as if it provided a basis for causal assessment, when at best it would be a very weak observational study that raised an hypothesis for study.

Now that would be, and should be, enough error for any two “statistics experts” in a given day, and we might have hoped that these putative experts would have thought through their ideas before imposing themselves upon a very busy Court.  But there is another mistake, which is even more stunning for having come from self-styled “statistics experts.”  Their derivation of a probability (or an odds statement) that the null hypothesis of no increased rate of AER is false is statistically incorrect.  A p-value is based upon the assumption that the null hypothesis is true, and it measures the probability of having obtained data as extreme, or more extreme, from the expected value, as seen in the study.  The p-value is thus a conditional probability statement of the probability of the data given the hypothesis.  As every first year statistics student learns, you cannot reverse the order of the conditional probability statement without committing a transpositional fallacy.  In other words, you cannot obtain a statement of the probability of the hypothesis given the data, from the probability of the data given the hypothesis.  Bayesians, of course, point to this limitation as a “failing” of frequentist statistics, but the limitation cannot be overcome by semantic fiat.

No Confidence in Defendant’s Confidence Intervals

Lest anyone think I am picking on the “statistics experts,” consider the brief filed by Matrixx Initiatives.  In addition to the whole crazy business of relying upon statistical significance in the absence of a study that used a statistical test, there are the two following howlers.  You would probably think that the company putting forward a “no statistical significance” defense would want to state statistical concepts clearly, but take a look at the Petitioner’s brief:

“Various analytical methods can be used to determine whether data reflect a statistically significant result. One such method, calculating confidence intervals, is especially useful for epidemiological analysis of drug safety, because it allows the researcher to estimate the relative risk associated with taking a drug by comparing the incidence rate of an adverse event among a sample of persons who took a drug with the background incidence rate among those who did not. Dividing the former figure by the latter produces a relative risk figure (e.g., a relative risk of 2.0 indicates a 50% greater risk among the exposed population). The researcher then calculates the confidence interval surrounding the observed risk, based on the preset confidence level, to reflect the degree of certainty that the “true” risk falls within the calculated interval. If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate. Green et al., Reference Guide on Epidemiology, at 360-61. For further discussion, see id. at 348-61.”

Matrixx Initiatives Brief at p. 36 n. 18 (emphasis added). Both passages in bold are wrong.  The Federal Judicial Center’s Reference Manual does not support the bold statements. A relative risk of 2.0 represents a 100% increase in risk, not 50%, although Matrixx Initiatives may have been thinking of a very different risk metric – the attributable risk, which would be 50% when the relative risk is 2.0.

The second bold statement is much worse because there is no possible word choice that might make the brief a correct understanding of a confidence interval (CI). The CI does not permit us to make a direct probability statement about the truth of any point within the interval. Although the interval does provide some insight into the true value of the parameter, the meaning of the confidence interval must be understood operationally.  For a 95% interval, if 100 samples were taken and (100 – α) percent CIs constructed, we would expect that 95 of the intervals to cover, or include, the true value of the variable.  (And α is our measure of Type I error, or probability of false positives.)

To realize how wrong the Petitioner’s brief is, consider the following example.  The observed relative risk is 10, but it is not statistically significant on a two-tailed test of significance, with α set at 0.05.  Suppose further that the two-sided 95% confidence interval around the observed rate is (0.9 to 18).  Matrixx Initiatives asserts:

“If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate.

The Petitioner would thus have the Court believe that with the example of a relative risk of 10, with the CI noted above, the result should be interpreted to mean that it is equally probable that the true value is 1.0 or less.  This is statistically silliness.

I have collected some statements about the CI, from well-known statisticians, below, as an aid to avoid such distortions of statistical concepts, as we see in the Matrixx.


“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval …”

Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)

“Nevertheless, the difference between population means is much more likely to be near to the middle of the confidence interval than towards the extremes. Although the confidence interval is wide, the best estimate of the population difference is 6-0 mm Hg, the difference between the sample means.

* * *

“The two extremes of a confidence interval are sometimes presented as confidence limits. However, the word “limits” suggests that there is no going beyond and may be misunderstood because, of course, the population value will not always lie within the confidence interval. Moreover, there is a danger that one or other of the “limits” will be quoted in isolation from the rest of the results, with misleading consequences. For example, concentrating only on the upper figure and ignoring the rest of the confidence interval would misrepresent the finding by exaggerating the study difference. Conversely, quoting only the lower limit would incorrectly underestimate the difference. The confidence interval is thus preferable because it focuses on the range of values.”

Martin Gardner & Douglas Altman, “Confidence intervals rather than P values: estimation rather than hypothesis testing,” 292 Brit. Med. J. 746, 748 (1986)

“The main purpose of confidence intervals is to indicate the (im)precision of the sample study estimates as population values. Consider the following points for example: a difference of 20% between the percentages improving in two groups of 80 patients having treatments A and B was reported, with a 95% confidence interval of 6% to 34%*2 Firstly, a possible difference in treatment effectiveness of less than 6% or of more than 34% is not excluded by such values being outside the confidence interval-they are simply less likely than those inside the confidence interval. Secondly, the middle half of the confidence interval (13% to 27%) is more likely to contain the population value than the extreme two quarters (6% to 13% and 27% to 34%) – in fact the middle half forms a 67% confidence interval. Thirdly, regardless of the width of the confidence interval, the sample estimate is the best indicator of the population value – in this case a 20% difference in treatment response.”

Martin Gardner & Douglas Altman, “Estimating with confidence,” 296 Brit. Med. J. 1210 (1988)

“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible.”

“A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”

Kenneth J. Rothman, Sander Greenland, and Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008)

“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data.  But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”

Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)