TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Sub-group Analyses in Epidemiologic Studies — Dangers of Statistical Significance as a Bright-Line Test

May 17th, 2011

Both aggregation and disaggregation of outcomes poses difficult problems for statistical analysis, and for epidemiology.  If outcomes are bundled into a single composite outcome, there has to be some basis for the bundling to make sense.  Even so, a composite outcome, such as all cardiovascular disease events, could easily hide an association in a component outcome.  For instance, studies of a drug under scrutiny may show no increased risk for all cardiovascular events, but closer inspection may show an increased risk for heart attacks while also showing a decreased risk for strokes.

The opposite problem arises when studies report multiple subgroups.  The opportunity for post hoc data mining runs rampant, and the existence of multiple subgroups means that the usual level of statistical significance becomes ineffective for ruling out chance as an explanation for an increased or decreased risk in a subgroup.  This problem is well known and extensively explored in the epidemiology literature, but it receives no attention in the Federal Judicial Center’s current Reference Manual on Scientific Evidence.  I hope that the authors of the Third Edition, which is due out in a few months, give some attention to the problem of subgroup analysis in epidemiology.  This seems to be an area where judges need a good deal of assistance, and where the Reference Manual lets them down.

Litigation tends to be a fertile field for the data dredging or the Texas Sharp shooters’ approach to epidemiology. (The Texas Sharp shooter shoots first and draws the target later.) When studies look at many outcomes, or many subgroups, chance alone will lead to results that have p-values less than the usual level for statistical significance (p < 0.05).  Accepting a result as “significant” when there is a multiplicity of testing or comparisons resulting from subgroup analyses is a form of “data torturing.” Mills, “Data Torturing,” 329 New Engl. J. Med. 1196, 1196 (1993)(“If you torture the data long enough, they will confess.”).

The multiple testing or comparison issue arises in both cohort and case-control studies.  Cohort studies have the ability to look at cancer morbidity or mortality at 20 different organs, with multiple histological subtypes for each cancer.  There are hundreds of diseases, by World Health Organization disease codes, which can be a possible outcome in a cohort study.  The odds are very good that several disease outcomes will be significantly elevated or decreased by chance alone.  Similarly, in a case-control study, participants with the outcome of interest can be questioned about hundreds of lifestyle and exposure variables.  Again, the finding of a “risk factor,” with statistical significance is not very compelling under these circumstances.

The problem of subgroup analyses is exacerbated by defense counsel’s emphasis on statistical significance as a “bright-line” test.  When subgroup analyses yield a statistically significant result, at the usual p < 0.05, which they will often do by chance alone, plaintiffs’ counsel obtain a “gotcha” moment.  Having built up the importance of statistical significance, defense counsel are hard pressed to dismiss the “significant” finding, even though study design makes it highly questionable if not downright meaningless.

Although the Reference Manual ignores this recurrent problem, several authors have issued severe alerts to the issue. For instance, Lisa Bero, who writes frequently on science and the law issues, admonishes:

“Specifying subgroup analysis after data collection for the review has already begun can be a ‘fishing expedition’ or “data dredging” for statistically significant results and is not appropriate.”

L. Bero, “Evaluating Systematic Reviews and Meta-Analyses,” J. L. & Policy 569, 576 (2006).

Eggers and Davey Smith, two well-respected English authors, who write about methodological issues in epidemiology, warn:

“Similarly, unplanned data-driven subgroup analyses are likely to produce spurious results.”

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” 24 chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

Stewart and Parmar explain the genesis of the problem and the result of diluting the protection that statistical significance usually provides against Type I errors:

“In general, the results of these subgroup analyses can be very misleading owing to the very high probability that any observed differences is due solely to chance.8 For example, if 10 subgroup analyses are carried out, there is a 40% chance of finding at least one significant false-positive effect (5% significance level).  Further, when the results of subgroup analyses are reported, often only those that have yielded a significant result are presented, without noting that many other analyses have been performed.”

Stewart and Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996)

“Such data dredging must be avoided and subgroup analyses should be limited to those that are specified a priori in the trial protocol.”

Id. at 272.

“Readers and reviewers should be aware that subgroup analyses, exploratory or otherwise, are likely to be particularly unreliable in situations where no overall effect of treatment has been observed.  In this case, if one subgroup exhibits a particularly positive effect of treatment, then another subgroup has to have a counteracting negative effect.”

* * *

“Consequently, perhaps the most sensible advice to readers and reviewers is to be very skeptical about the results of subgroup analyses.”

Id.  See also Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000) (“Analysis of subgroup results in a clinical trial is surprisingly unreliable, even in a large trial.  This is the result of a combination of reduced statistical power, increased variance and the play of chance.  Reliance on such analyses is likely to be erroneous, and hence harmful, than application of the overall proportional (or relative) result in the whole trial to the estimate of absolute risk in that subgroup.  Plausible explanations can usually be found for effects that are, in reality, simply due to the play of chance.  When clinicians believe such subgroup analyses, there is a real damage of harm to the individual patient.”)

These warnings and admonitions are important caveats to statistical significance.  In emphasizing the importance of statistical significance in evaluating statistical evidence, defense lawyers are sometimes unwittingly hoisted with their own petard, in the form of studies that have results that meet the usual p-value threshold of lower than 5%.  Courts see these defense lawyers as engaged in special pleading when counsel argues that study multiplicity requires changing the p-value threshold to preserve the desired rate of Type I error, but that is exactly what must be done.

A few years ago, the New England Journal of Medicine published an article that detailed the problem and promulgated guidelines for avoiding the worst abuses.  R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).  Wang and colleagues provide some important insights for how subgroup analyses can lead to increased rates of Type I errors, and they provide guidelines for authors on appropriate descriptions of subgroup analyses:

“However, subgroup analyses also introduce analytic challenges and can lead to overstated and misleading results.”

Id. at 2189a.

“When multiple subgroup analyses are performed, the probability of a false positive finding can be substantial.”

Id. at 2190a.

“There are several methods for addressing multiplicity that are based on the use of more stringent criteria for statistical significance than the customary P < 0.05.”

Id. at 2190b.

“A pre-specified subgroup analysis is one that is planned and documented before any examination of the data, preferably in the study protocol.”

Id. at 2190b.

“Post hoc analyses refer to those in which the hypotheses being tested are not specified before any examination of the data. Such analyses are of particular concern because it is often unclear how many were undertaken and whether some were motivated by inspection of the data. However, both pre-specified and post hoc subgroup analyses are subject to inflated false positive rates arising from multiple testing. Investigators should avoid the tendency to pre-specify many subgroup analyses in the mistaken belief that these analyses are free of the multiplicity problem.”

Id. at 2190b.

“When properly planned, reported, and interpreted, subgroup analyses can provide valuable information.”

Id. at 2193b.

Although Wang and colleagues take their primary aim at the abuse of subgroup analyses in randomized clinical trials, they make clear that the abuse is equally present in observational studies:

“In other settings, including observational studies, we encourage complete and thorough reporting of the subgroup analyses in the spirit of the guidelines listed.”

Id. at 2193b.

Wang and colleagues provide some very specific guidelines for reporting subgroup analyses.  These guidelines are a helpful source for helping courts make sober assessments of results from subgroup analyses.

Recently, another guideline initiative, STROBE, in the field of observational epidemiology provided similar guidance to authors and journals for reporting subgroup analyses:

“[M]any debate the use and value of analyses restricted to subgroups of the study population. Subgroup analyses are nevertheless often done. Readers need to know which subgroup analyses were planned in advance, and which arose while analyzing the data. Also, it is important to explain what methods were used to examine whether effects or associations differed across groups … .”

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805, 817 (2007).

“There is debate about the dangers associated with subgroup analyses, and multiplicity of analyses in general.  In our opinion, there is too great a tendency to look for evidence of subgroup-specific associations, or effect-measure modification, when overall results appear to suggest little or no effect. On the other hand, there is value in exploring whether an overall association appears consistent across several,

preferably pre-specified subgroups especially when a study is large enough to have sufficient data in each subgroup. A second area of debate is about interesting subgroups that arose during the data analysis. They might be important findings, but might also arise by chance. Some argue that it is neither possible nor necessary to inform the reader about all subgroup analyses done as future analyses of other data will tell to what extent the early exciting findings stand the test of time. We advise authors to report which analyses were planned, and which were not   … . This will allow readers to judge the implications of multiplicity, taking into account the study’s position on the continuum from discovery to verification or refutation.”

Id. at 826-27.

Bibliography

E. Akl, M. Briel, J.J. You, et al., “LOST to follow-up Information in Trials (LOST-IT): a protocol on the potential impact,” 10 Trials 40 (2009).

Susan Assmann, Stuart Pocock, Laura Enos, Linda Kasten, “Subgroup analysis and other (mis)uses of baseline data in clinical trials,” Lancet 2000; 355: 1064–69.

M. Bhandari, P.J. Devereaux, P. Li, et al., “Misuse of baseline comparison tests and subgroup analyses in surgical trials,” 447 Clin. Orthoped. Relat. Res. 247 (2006).

S. T. Brookes, E. Whitely, M. Egger, et al., “Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test,” 57 J. Clin. Epid. 229 (2004).

A-W Chan, A. Hrobjartsson, K.J. Jorgensen, et al., “Discrepancies in sample size calculations and data analyses reported in randomised trials: comparison of publications with protocols,” 337 Brit. Med. J. a2299 (2008).

L. Cui, H.M. Hung, S.J. Wang, et al., “Issues related to subgroup analysis in clinical trials,” 12 J. Biopharm. Stat. 347 (2002).

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

J. Fletcher, “Subgroup analyses: how to avoid being misled,” 335 Brit. Med. J. 96 (2007).

Nick Freemantle,”Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic?” 322 Brit. Med. J. 989 (2001).

G. Guyatt, P.C. Wyer, J. Ioannidis, “When to Believe a Subgroup Analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 571-83 (2008).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Inconsistent trial assessments by the National Institute for Health and Clinical Excellence and IQWiG: standards for the performance and interpretation of subgroup analyses are needed,” 63 J. Clin. Epidem. 1298 (2010).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Standards for subgroup analyses are needed? We couldn’t agree more,”  64 J. Clin. Epidem. 451 (2011).

R. Hatala, S. Keitz, P. Wyer, et al., “Tips for learners of evidence-based medicine: 4. Assessing heterogeneity of primary studies in systematic reviews and whether to combine their results,” 172 Can. Med. Ass’n J. 661 (2005).

A.V. Hernandez, E.W. Steyerberg, G.S. Taylor, et al., “Subgroup analysis and covariate adjustment in randomized clinical trials of traumatic brain injury: a systematic review,” 57 Neurosurgery 1244 (2005).

A.V. Hernandez, E. Boersma, G.D. Murray, et al., “Subgroup analyses in therapeutic cardiovascular clinical trials: are most of them misleading?” 151 Am. Heart J. 257 (2006).

K. Hirji & M. Fagerland, “Outcome based subgroup analysis: a neglected concern,” 10 Trials 33 (2009).

Stephen W. Lagakos, “The Challenge of Subgroup Analyses — Reporting without Distorting,” 354 New Engl. J. Med. 1667 (2006).

C.M. Martin, G. Guyatt, V. M. Montori, “The sirens are singing: the perils of trusting trials stopped early and subgroup analyses,” 33 Crit. Care Med. 1870 (2005).

D. Moher, K. Schulz, D. Altman, et al.,“The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials,” 357 Lancet 1191 (2001).

V.M. Montori, R. Jaeschke, H.J. Schunemann, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004).

A.D. Oxman & G.H. Guyatt, “A consumer’s guide to subgroup analyses,” 116 Ann. Intern. Med. 78 (1992).

A. Oxman, G. Guyatt, L. Green, et al., “When to believe a subgroup analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 553-65 (2008).

S. Pocock, M. D. Hughes, R.J. Lee, “Statistical problems in the reporting of clinical trials:  A survey of three medical journals,” 317 New Engl. J. Med. 426 (1987).

S. Pocock, S. Assmann, L. Enos, et al., “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems,” 21 Statistics in Medicine 2917 (2002).

Peter Rothwell, “Subgroup analysis in randomised controlled trials:  importance, indications, and interpretation,” 365 Lancet 176 (2005).

Kenneth Schulz & David Grimes, “Multiplicity in randomised trials II: subgroup and interim analyses,” 365 Lancet 1657 (2005).

Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000).

Reuel Stallones, “The Use and Abuse of Subgroup Analysis in Epidemiological Research,” 16 Prev. Med. 183 (1987).

Stewart & Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996).

Xin Sun, Matthias Briel, Jason Busse, Elie A. Akl, John J .You, Filip Mejza, Malgorzata Bala, Natalia Diaz-Granados, Dirk Bassler, Dominik Mertz, Sadeesh K Srinathan, Per Olav Vandvik, German Malaga, Mohamed Alshurafa, Philipp Dahm, Pablo Alonso-Coello, Diane M Heels-Ansdell, Neera Bhatnagar, Bradley C. Johnston, Li Wang, Stephen D. Walter, Douglas G. Altman, and Gordon Guyatt, “Subgroup Analysis of Trials Is Rarely Easy (SATIRE): a study protocol for a systematic review to characterize the analysis, reporting, and claim of subgroup effects in randomized trials,” 10 Trials 1010 (2009).

A. Trevor & G. Sheldon, “Criteria for the Implementation of Research Evidence in Policy and Practice,” in A. Haines, ed., Getting Research Findings Into Practice 11 (2d ed. 2008).

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805–835 (2007).

Erik von Elm & Matthias Egger, “The scandal of poor epidemiological research Reporting guidelines are needed for observational epidemiology,” 329 Brit. Med. J. 868 (2004).

R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).

S. Yusuf, J. Wittes, J. Probstfield, et al., “Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials,” 266 J. Am. Med. Ass’n 93 (1991).

De-Zincing the Matrixx

April 12th, 2011

Although the plaintiffs, in Matrixx Intiatives, Inc. v. Siracusano,  generally were accurate in defining statistical significance than the defendant, or than the so-called “statistical expert” amici (Ziliak and McCloskey), the plaintiffs’ brief goes off the rails when it turned to discussing the requirements for proving causation.  Of course, the admissibility and sufficiency of evidence to show causation were not at issue in the case, but plaintiffs got pulled down the rabbit hole dug by the defendant, in its bid to establish a legal bright-line rule about pleading.

Differential Diagnosis

In an effort to persuade the Court that statistical significance is not required, the plaintiffs/respondents threw science and legal principles to the wind.  They contended that statistical significance is not at all necessary to causal determinations because

“[c]ourts have recognized that a physician’s differential diagnosis (which identifies a likely cause of certain symptoms after ruling out other possibilities) can be reliable evidence of causation.”

Respondents’ Brief at 49.   Perhaps this is simply the Respondents’ naiveté, but it seems to suggest scienter to deceive. Differential diagnosis is not about etiology; it is about diagnosis, which rarely incorporates an assessment of etiology.  Even if the differentials were etiologies and not diagnoses, the putative causes in the differential must already be shown, independently, to be capable of causing the outcome in question. See, e.g., Tamraz v. Lincoln Electric Co., 620 F.3d 665 (6th Cir. 2010).  A physician cannot rule in an etiology in a specific person simply by positing it among the differentials, without independent, reliable evidence that the ruled in “specific cause” can cause the outcome in question, under the circumstances of the plaintiff’s exposure.  Furthermore, differential diagnosis or etiology is nothing more than a process of elimination to select a specific cause; it has nothing to do with statistical significance because it has nothing to do with general causation.

This error in the Respondent’s brief about differential diagnosis unfortunately finds its way into Justice Sotomayor’s opinion.

Daubert Denial and the Recrudescence of Ferebee

In their zeal, the Respondents go further than advancing a confusion between general and specific causation, and an erroneous view of what must be shown before a putative cause can be inserted in a set of differential (specific) causes.  They cite one of the most discredited cases in 20th century American law of expert witnesses:

Ferebee v. Chevron Chem. Co., 736 F.2d 1529, 1536 (D.C. Cir. 1984) (“products liability law does not preclude recovery until a ‘statistically significant’ number of people have been injured”).”

Respondents’ Brief at 50.  This is not a personal, subjective opinion about this 1984 pre-Daubert decision.  Ferebee was wrongly decided when announced, and it was soon abandoned by the very court that issued the opinion.  It has been a derelict on the sea of evidence law for over a quarter of a century.  Citing to Ferebee, without acknowledging its clearly overruled status, raises an interesting issue about candor to the Court, and the responsibilities of counsel in trash picking in the dustbin of expert witness law.

Along with its apparent rejection of statistical significance, Ferebee is known for articulating an “anything goes” philosophy toward the admissibility and sufficiency of expert witnesses:

“Judges, both trial and appellate, have no special competence to resolve the complex and refractory causal issues raised by the attempt to link low-level exposure to toxic chemicals with human disease.  On questions such as these, which stand at the frontier of current medical and epidemiological inquiry, if experts are willing to testify that such a link exists, it is for the jury to decide to credit such testimony.”

Ferebee v. Chevron Chemical Co., 736 F.2d 1529, 1534 (D.C. Cir.), cert. denied, 469 U.S. 1062 (1984).  Within a few years, the nihilism of Ferebee was severely limited by the court that decided the case:

“The question whether Bendectin causes limb reduction defects is scientific in nature, and it is to the scientific community that the law must look for the answer.  For this reason, expert witnesses are indispensable in a case such as this.  But that is not to say that the court’s hands are inexorably tied, or that it must accept uncritically any sort of opinion espoused by an expert merely because his credentials render him qualified to testify… . Whether an expert’s opinion has an adequate basis and whether without it an evidentiary burden has been met, are matters of law for the court to decide.”

Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 829 (D.C. Cir. 1988).

Of course, several important decisions intervened between Ferebee and Richardson.  In 1986, the Fifth Circuit expressed a clear message to trial judges that it would no longer continue to tolerate the anything-goes approach to expert witness opinions:

“We adhere to the deferential standard for review of decisions regarding the admission of testimony by xperts.  Nevertheless, we … caution that the standard leaves appellate judges with a considerable task.  We will turn to that task with a sharp eye, particularly in those instances, hopefully few, where the record makes it evident that the decision to receive expert testimony was simply tossed off to the jury under a ‘let it all in’ philosophy.  Our message to our able trial colleagues:  it is time to take hold of expert testimony in federal trials.

In re Air Crash Disaster, 795 F.2d 1230, 1234 (5th Cir. 1986) (emphasis added).

In the same intervening period between Ferebee and Richardson, Judge Jack Weinstein, a respected evidence scholar and well-known liberal judge, announced :

“The expert is assumed, if he meets the test of Rule 702, to have the skill to properly evaluate the hearsay, giving it probative force appropriate to the circumstances.  Nevertheless, the court may not abdicate its independent responsibilities to decide if the bases meet minimum standards of reliability as a condition of admissibility.  See Fed. Rule Ev. 104(a).  If the underlying data are so lacking in probative force and reliability that no reasonable expert could base an opinion on them, an opinion which rests entirely upon them must be excluded.”

In re “Agent Orange” Prod. Liab. Litig., 611 F. Supp. 1223, 1245 (E.D.N.Y. 1985)(excluding plaintiffs’ expert witnesses), aff’d, 818 F.2d 187 (2d Cir. 1987), cert. denied, 487 U.S. 1234 (1988).

The notion that technical decisions had to be evidence based, not opinion based, emerged elsewhere as well. For example, in the context of applying statistics, the federal courts pronounced that the ipse dixit of parties and witnesses did not count for much:

“When a litigant seeks to prove his point exclusively through the use of statistics, he is borrowing the principles of another discipline, mathematics, and applying these principles to the law. In borrowing from another discipline, a litigant cannot be selective in which principles are applied. He must employ a standard mathematical analysis. Any other requirement defies logic to the point of being unjust. Statisticians do not simply look at two statistics, such as the actual and expected percentage of blacks on a grand jury, and make a subjective conclusion that the statistics are significantly different. Rather, statisticians compare figures through an objective process known as hypothesis testing.”

Moultrie v. Martin, 690 F.2d 1078, 1082 (4th Cir. 1982)(citations omitted)

Of course, not long after the District of Columbia Circuit decided Ferebee, in 1993, the Supreme Court decided Daubert, followed by decisions in Joiner, Kumho Tire, and Weisgram.  In 2000, Congress approved a new Rule of Evidence 702, which incorporated the learning and experience in judicial gatekeeping from a wide range of cases and principles.

Do the Respondents have a defense to having cited an overruled, superseded, discredited precedent in the highest federal Court?  Perhaps they would argue that they are in pari delicto with courts (Daubert-Deniers), which remarkably have ignored the status of Ferebee, and cited it.  See, e.g., Betz v. Pneumo Abex LLC, 998 A.2d 962, 981 (Pa. Super. 2010); McCarrell v. Hoffman-La Roche, Inc., 2009 WL 614484, *23 (N.J.Super.A.D. 2009).  See also Rubanick v. Witco Chemical Corp., 125 N.J. 421, 438-39 (1991)(quoting Ferebee before it was overruled by the Supreme Court, but after it was disregarded by the D.C. Circuit in Richardson).

Matrixx Galvanized – More Errors, More Comedy About Statistics

April 9th, 2011

Matrixx Initiatives is a rich case – rich in irony, comedy, tragedy, and error.  It is well worth further exploration, especially in terms of how this 9-0 decision was reached, what it means, and how it should be applied.

It pains me that the Respondents (plaintiffs) generally did a better job in explaining significance testing than did the Petitioner (defendant).

At least some of the Respondents’ definitional efforts are unexceptional.  For instance:

“Researchers use the term ‘statistical significance’ to characterize a result from a test that satisfies a particular kind of test designed to show that the result is unlikely to have occurred by random chance.  See David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 83, 122 (Fed. Judicial Ctr., 2d ed. 2000) (“Reference Manual”).”

Brief for Respondents, at 38 – 39 (Nov 5, 2010).

“The purpose of significance testing in this context is to assess whether two events (here, taking Zicam and developing anosmia) occur together often enough to make it sufficiently implausible that no actual underlying relationship exists between them.”

Id. at 39.   These definitions seem acceptable as far as they go, as long as we realize that the relationship that remains, when chance is excluded, may not be causal, and indeed, it may well be a false-positive relationship that results from bias or confounding.

Rather than giving one good, clear definition, the Respondents felt obligated to and repeat and restate their definitions, and thus wandered into error:

“To test for significance, the researcher typically develops a ‘null hypothesis’ – e.g., that there is no relationship between using intranasal Zicam and the onset of burning pain and subsequent anosmia. The researcher then selects a threshold (the ‘significance level’) that reflects an acceptably low probability of rejecting a true null hypothesis – e.g., of concluding that a relationship between Zicam and anosmia exists based on observations that in fact reflect random chance.”

Id. at 39.  Perhaps the Respondents were using the “cooking frogs” approach.  As the practical wisdom has it, dropping a frog into boiling water risks having the frog jump out, but if you put a frog into a pot of warm water, and gradually bring the pot to a boil, you will have a cooked frog.  Here the Respondents repeat and morph their definition of statistical significance until they have brought it around to their rhetorical goal of confusing statistical significance with causation.  Note that now the definition is muddled, and the Respondents are edging closer towards claiming that statistical significance signals the existence of a “relationship” between Zicam and anosmia, when in fact, the statistical significance simply means that chance is not a likely explanation for the observations.  Whether a “relationship” exists requires further analysis, and usually a good deal more evidence.

“The researcher then calculates a value (referred to as p) that reflects the probability that the observed data could have occurred even if the null hypothesis were in fact true.”

Id. at 39-40 (emphasis in original). Well, this is almost true.  It’s not “even if,” but simply “if”; that is, the p-value is based upon the assumption that the null hypothesis is correct.  The “if” is not an incidental qualifier, it is essential to the definition of statistical significance. “Even” here adds nothing, but a slightly misleading rhetorical flourish.  And the p-value is not the probability that the observed data are correct; it’s the probability of observing the data obtained, or data more extreme, assuming the null hypothesis is true.

The Respondents/plaintiffs efforts at serious explication ultimately succumb to their hyperbolic rhetoric.  They explained that statistical significance may not be “practical significance,” which is true enough.  There are, of course, instances in which a statistical significant difference is not particularly interesting.  A large clinical trial, testing two cancer medications head to head, may show one extends life expectancy by a week or two, but has a worse side-effect profile.  The statistically significant “better” drug may be refused a license from regulatory agencies, or be rejected by knowledgeable oncologists and sensible patients, who are more concerned about quality of life issues.

The Respondents are also correct that invoking statistically significance does not provide the simple, bright-line test, Petitioner desired.  Someone would still have to specify the level of alpha, the acceptable level of Type I error, and this would further require a specification of either a one-sided or two-sided test.  To be sure, the two-sided test, with an alpha of 5%, is generally accepted in the world of biostatistics and biomedical research.  Regulatory agencies, including the FDA, however, lower the standard test to implement their precautionary principles and goals.  Furthermore, evaluation of statistical significance requires additional analysis to determine whether the observed deviation from expected is due to bias or confounding, or whether the statistical test has been unduly diluted by multiple comparisons, subgroup analyses, or data mining techniques.

Of course, statistical significance today usually occurs in conjunction with an assessment of “effect size,” usually through an analysis of a confidence interval around a point estimate of a risk ratio.  The Respondents’ complaint that the p-value does not convey the magnitude of the association is a bit off the mark, but not completely illegitimate.  For instance, if there were a statistically significant finding of anosmia from Zicam use, in the form of an elevated risk that was itself small, the FDA might well decide that the risk was manageable with a warning to users to discontinue the medication if they experienced a burning sensation upon use.

The Respondents, along with their two would-be “statistical expert” amici, misrepresent the substance of many of the objections to statistical significance in the medical literature.  A telling example is the Respondents’ citation to an article by Professor David Savitz:

David A. Savitz, “Is Statistical Significance Testing Useful in Interpreting Data?” 7 Reproductive Toxicology 95, 96 (1993) “[S]tatistical significance testing is not useful in the analysis or interpretation of scientific research.”).

Id. at 52, n. 40.

More complete quotations from Professor Savitz’ article, however, reveals a more nuanced, and rather different message:

“Although P values and statistical significance testing have become entrenched in the practice of biomedical research, their usefulness and drawbacks should be reconsidered, particularly in observational epidemiology. The central role for the null hypothesis, assuming an infinite number of replications, and the dichotomization of results as positive or negative are argued to be detrimental to the proper design and evaluation of research. As an alternative, confidence intervals for estimated parameters convey some information about random variation without several of these limitations. Elimination of statistical significance testing as a decision rule would encourage those who present and evaluate research to more comprehensively consider the methodologic features that may yield inaccurate results and shift the focus from the potential influence of random error to a broader consideration of possible reasons for erroneous results.”

Savitz, 7 Reproductive Toxicology at 95.  Respondents’ case would hardly have been helped by substituting a call for statistical significance with a call for using confidence intervals, along with careful scrutiny of the results for erroneous results.

“Regardless of what is taught in statistics courses or advocated by editorials, including the recent one in this journal, statistical tests are still routinely invoked as the primary criterion for assessing whether the hypothesized phenomenon has occurred.”

7 Reproductive Toxicology at 96 (internal citation omitted).

“No matter how carefully worded, “statistically significant” misleadingly conveys notions of causality and importance.”

Id. at 99.  This last quotation really unravels the Respondents’ fatuous use of citations.  Of course, the Savitz article is quite inconsistent generally with the message that the Respondents wished to convey to the Supreme Court, but intellectually honesty required a fuller acknowledgement of Prof. Savitz’ thinking about the matter.

Finally, there are some limited cases, in which the failure to obtain a conventionally statistically significant result is not fatal to an assessment of causality.  Such cases usually involve instances in which it is extremely difficult to find observational or experimental data to analyze for statistical significance, but other lines of evidence support the conclusion in a way that scientists accept.  Although these cases are much rarer than Respondents imagine, they may well exist, but they do not detract much from Sir Ronald Fisher’s original conception of statistical significance:

“In the investigation of living beings by biological methods statistical tests of significance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to the causes we wish to study, or are trying to detect, but to a combination of the many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.”

Ronald A. Fisher, “The Statistical Method in Psychical Research,” 39 Proceedings of the Society for Psychical Research 189, 191 (1929). Note that Fisher was talking about experiments, not observational studies, and that he hardly was advocating a mechanical, thoughtless criterion of significance.

The Supreme Court’s decision in Castenada illustrates how misleading statistical significance can be.  In a five-to-four decision, the Court held that a prima facie case of ethnic discrimination could be made out on the basis of statistical significance alone.  In dictum, the Court suggested that statistical evidence alone sufficed when the observed outcome was more than two or three standard deviations from the expected outcome.  Castaneda v. Partida, 430 U.S. 482, 496 n. 17 (1977).  The facts of Castaneda illustrate a compelling case in which the statistical significance observed was likely the result of confounding effects of reduced civic participation by poor, itinerant minorities, in a Texas county in which the ethnic minority controlled political power, and made up a majority of the petit jury that convicted Mr. Partida.

The Matrixx – A Comedy of Errors

April 6th, 2011

1. Incubi Curiae

As I noted in the Matrixx Unloaded, Justice Sotomayor’s scholarship, in discussing case law under Federal Rule of Evidence 702, was seriously off base.  Of course, Matrixx Initiatives was only a pleading case, and so there was no real reason to consider rules of admissibility or sufficiency, such as Rule 702.

Fortunately, Justice Sotomayor avoided further embarrassment by not discussing the fine details of significance or hypothesis testing.  Not so the two so-called “statistics experts” who submitted an amicus brief.

Consider the following statement by McCloskey and Ziliak, about adverse event reports (AER) and statistical significance.

“Suppose that a p-value for a particular test comes in at 9 percent.  Should this p-value be considered “insignificant” in practical, human, or economic terms? We respectfully answer, “No.” For a p-value of .09, the odds of observing the AER is 91 percent divided by 9 percent. Put differently, there are 10-to-l odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not).”

Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Zilliak in Support of Respondents, at 18 (Nov. 18, 2010), 2010 WL 4657930 (U.S.) (emphasis added).

Of course, the whole enterprise of using statistical significance to evaluate AER is suspect because there is no rate, either expected or observed.  A rate could be estimated from number of AER reported per total number of persons using the medication in some unit of time.  Pharmacoepidemiologists sometimes do engage in such speculative blue-sky enterprises to determine whether a “signal” may have been generated by the AER.  Even if a denominator were implied, and significance testing used, it would be incorrect to treat the association as causal.  Our statistics experts here have committed several serious mistakes; they have

  • treated the AERs as a rate, when it is simply a count;
  • treated the AERs as an observed rate that can be evaluated against a null hypothesis of no increase in rate, when there is no expected rate for the event in question; and
  • treated the pseudo-statistical analysis as if it provided a basis for causal assessment, when at best it would be a very weak observational study that raised an hypothesis for study.

Now that would be, and should be, enough error for any two “statistics experts” in a given day, and we might have hoped that these putative experts would have thought through their ideas before imposing themselves upon a very busy Court.  But there is another mistake, which is even more stunning for having come from self-styled “statistics experts.”  Their derivation of a probability (or an odds statement) that the null hypothesis of no increased rate of AER is false is statistically incorrect.  A p-value is based upon the assumption that the null hypothesis is true, and it measures the probability of having obtained data as extreme, or more extreme, from the expected value, as seen in the study.  The p-value is thus a conditional probability statement of the probability of the data given the hypothesis.  As every first year statistics student learns, you cannot reverse the order of the conditional probability statement without committing a transpositional fallacy.  In other words, you cannot obtain a statement of the probability of the hypothesis given the data, from the probability of the data given the hypothesis.  Bayesians, of course, point to this limitation as a “failing” of frequentist statistics, but the limitation cannot be overcome by semantic fiat.

No Confidence in Defendant’s Confidence Intervals

Lest anyone think I am picking on the “statistics experts,” consider the brief filed by Matrixx Initiatives.  In addition to the whole crazy business of relying upon statistical significance in the absence of a study that used a statistical test, there are the two following howlers.  You would probably think that the company putting forward a “no statistical significance” defense would want to state statistical concepts clearly, but take a look at the Petitioner’s brief:

“Various analytical methods can be used to determine whether data reflect a statistically significant result. One such method, calculating confidence intervals, is especially useful for epidemiological analysis of drug safety, because it allows the researcher to estimate the relative risk associated with taking a drug by comparing the incidence rate of an adverse event among a sample of persons who took a drug with the background incidence rate among those who did not. Dividing the former figure by the latter produces a relative risk figure (e.g., a relative risk of 2.0 indicates a 50% greater risk among the exposed population). The researcher then calculates the confidence interval surrounding the observed risk, based on the preset confidence level, to reflect the degree of certainty that the “true” risk falls within the calculated interval. If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate. Green et al., Reference Guide on Epidemiology, at 360-61. For further discussion, see id. at 348-61.”

Matrixx Initiatives Brief at p. 36 n. 18 (emphasis added). Both passages in bold are wrong.  The Federal Judicial Center’s Reference Manual does not support the bold statements. A relative risk of 2.0 represents a 100% increase in risk, not 50%, although Matrixx Initiatives may have been thinking of a very different risk metric – the attributable risk, which would be 50% when the relative risk is 2.0.

The second bold statement is much worse because there is no possible word choice that might make the brief a correct understanding of a confidence interval (CI). The CI does not permit us to make a direct probability statement about the truth of any point within the interval. Although the interval does provide some insight into the true value of the parameter, the meaning of the confidence interval must be understood operationally.  For a 95% interval, if 100 samples were taken and (100 – α) percent CIs constructed, we would expect that 95 of the intervals to cover, or include, the true value of the variable.  (And α is our measure of Type I error, or probability of false positives.)

To realize how wrong the Petitioner’s brief is, consider the following example.  The observed relative risk is 10, but it is not statistically significant on a two-tailed test of significance, with α set at 0.05.  Suppose further that the two-sided 95% confidence interval around the observed rate is (0.9 to 18).  Matrixx Initiatives asserts:

“If the lower end of the interval dips below 1.0—the point at which the observed risk of an adverse event matches the background incidence rate—then the result is not statistically significant, because it is equally probable that the actual rate of adverse events following product use is identical to (or even less than) the background incidence rate.

The Petitioner would thus have the Court believe that with the example of a relative risk of 10, with the CI noted above, the result should be interpreted to mean that it is equally probable that the true value is 1.0 or less.  This is statistically silliness.

I have collected some statements about the CI, from well-known statisticians, below, as an aid to avoid such distortions of statistical concepts, as we see in the Matrixx.


“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval …”

Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)

“Nevertheless, the difference between population means is much more likely to be near to the middle of the confidence interval than towards the extremes. Although the confidence interval is wide, the best estimate of the population difference is 6-0 mm Hg, the difference between the sample means.

* * *

“The two extremes of a confidence interval are sometimes presented as confidence limits. However, the word “limits” suggests that there is no going beyond and may be misunderstood because, of course, the population value will not always lie within the confidence interval. Moreover, there is a danger that one or other of the “limits” will be quoted in isolation from the rest of the results, with misleading consequences. For example, concentrating only on the upper figure and ignoring the rest of the confidence interval would misrepresent the finding by exaggerating the study difference. Conversely, quoting only the lower limit would incorrectly underestimate the difference. The confidence interval is thus preferable because it focuses on the range of values.”

Martin Gardner & Douglas Altman, “Confidence intervals rather than P values: estimation rather than hypothesis testing,” 292 Brit. Med. J. 746, 748 (1986)

“The main purpose of confidence intervals is to indicate the (im)precision of the sample study estimates as population values. Consider the following points for example: a difference of 20% between the percentages improving in two groups of 80 patients having treatments A and B was reported, with a 95% confidence interval of 6% to 34%*2 Firstly, a possible difference in treatment effectiveness of less than 6% or of more than 34% is not excluded by such values being outside the confidence interval-they are simply less likely than those inside the confidence interval. Secondly, the middle half of the confidence interval (13% to 27%) is more likely to contain the population value than the extreme two quarters (6% to 13% and 27% to 34%) – in fact the middle half forms a 67% confidence interval. Thirdly, regardless of the width of the confidence interval, the sample estimate is the best indicator of the population value – in this case a 20% difference in treatment response.”

Martin Gardner & Douglas Altman, “Estimating with confidence,” 296 Brit. Med. J. 1210 (1988)

“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible.”

“A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”

Kenneth J. Rothman, Sander Greenland, and Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008)

“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data.  But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”

Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)

Matrixx Unloaded

March 29th, 2011

In writing for a unanimous Court in Matrixx Initiatives, Inc. v. Siracusano, Justice Sotomayor wandered far afield from the world of pleading rules to flyblow the world of expert witness jurisprudence.  How and why did this happen?  Why did Matrixx invoke the concept of statistical significance to counter case reports of adverse events? Did Matrixx oversell its scientific position, thereby handing Justice Sotomayor an opportunity to unravel decades of evolution of law on the admissibility of expert witness opinion testimony?  Inquiring minds want to know.

Still, whatever the occasion for the obiter dicta, Court’s pronouncements on expert witnesses are stunning for their irrelevance and questionable scholarship:

“We note that courts frequently permit expert testimony on causation based on evidence other than statistical significance. See, e.g., Best v. Lowe’s Home Centers, Inc., 563 F. 3d 171, 178 (6th Cir 2009); Westberry v. Gislaved Gummi AB, 178 F. 3d 257, 263–264 (4th Cir. 1999) (citing cases); Wells v. Ortho Pharmaceutical Corp., 788 F. 2d 741, 744–745 (11th Cir. 1986). We need not consider whether the expert testimony was properly admitted in those cases, and we do not attempt to define here what constitutes reliable evidence of causation.”

Id. at 12.  What is remarkable about this passage is that the first two cases cited involved differential etiology or diagnosis to assess specific causation, not general causation.  As most courts have recognized, this assessment strategy requires that general causation has already been established. See, e.g., Hall v. Baxter Healthcare, 947 F. Supp. 1387 (D. Ore. 1996).

The citation to the third case, Wells, is noteworthy because the case has nothing to do with adverse event reports or statistical significance.  Wells involved a claim of birth defects caused by the use of spermicidal jelly contraceptive, which had been the subject of several studies, one of which at least yielded a statistically significant increase in detected birth defects over what was expected.  Wells v. Ortho Pharmaceutical Corp., 615 F. Supp. 262 (N.D.Ga. 1985), aff’d and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S.950 (1986).  Wells could thus hardly be an example of a case in which there was a judgment of causation based upon a scientific study that lacked statistical significance in its findings. Of course, finding statistical significance is just the beginning of assessing the causality of an association; Wells was notorious for its poor assessment of all the determinants of scientific causation.

The citation to Wells is thus remarkable because the Wells decision was rightly and widely criticized for its failure to evaluate the entire evidentiary display, as well as for its failure to rule out bias and confounding in the studies relied upon by the plaintiff.  See , e.g., James L. Mills and Duane Alexander, “Teratogens and ‘Litogens’,” 15 New Engl. J. Med. 1234 (1986); Samuel R. Gross, “Expert Evidence,” 1991 Wis. L. Rev. 1113, 1121-24 (1991) (“Unfortunately, Judge Shoob’s decision is absolutely wrong. There is no scientifically credible evidence that Ortho-Gynol Contraceptive Jelly ever causes birth defects.”). See also Editorial, “Federal Judges v. Science,” N.Y. Times, December 27, 1986, at A22 (unsigned editorial);  David E. Bernstein, “Junk Science in the Courtroom,” Wall St. J. at A 15 (Mar. 24,1993) (pointing to Wells as a prominent example of how the federal judiciary had embarrassed American judicial system with its careless, non-evidence based approach to scientific evidence). A few years later, another case in the same judicial district against the same defendant for the same product resulted in the grant of summary judgment.  Smith v. Ortho Pharmaceutical Corp., 770 F. Supp. 1561 (N.D. Ga. 1991) (supposedly distinguishing Wells on the basis of more recent studies).

Perhaps the most remarkable aspect of the Court’s citation to Wells is that the case, and all it stands for, was overruled sub silentio by the Supreme Court’s own decisions in Daubert, Joiner, Kumho Tire, and Weisgram.  And if that did not kill the concept, then there was the simple matter of a supervening statute:  the 2000 amendment of Rule 702, of Federal Rules of Evidence.

Citing a case as jurisprudentially dead and discredited as Wells could have been sloppy scholarship and lawyering.  The principle of charity, however, suggests it was purposeful, and that is a frightful prospect.

Courts and Commentators on the Use of Relative Risks to Infer Specific Causation

March 18th, 2011

Below, I have collected some of the case law and commentary on the issue of using relative and attributable risks to satisfy plaintiff’s burden of showing, more likely than not, that an exposure or condition caused his or her disease or injury.


Radiation

Johnston v. United States, 597 F. Supp. 374, 412, 425-26 (D. Kan. 1984)

Allen v. United States, 588 F. Supp. 247 (1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987)

In re TMI Litig., 193 F.3d 613, 629 (3d Cir. 1999)(rejecting “doubling dose” trial court’s analysis), amended, 199 F.3d 158 (3d Cir. 2000)

In re Hanford Nuclear Reservation Litig., 1998 WL 775340, at *8 (E.D.Wash. Aug. 21, 1998), rev’d, 292 F.3d 1124, 1136-37 (9th Cir. 2002)


Swine Flu- GBS Cases

Cook v. United States, 545 F. Supp. 306, 308 (N.D. Cal. 1982)(“Whenever the relative risk to vaccinated persons is greater than two times the risk to unvaccinated persons, there is a greater than 50% chance that a given GBS case among vaccinees of that latency period is attributable to vaccination, thus sustaining plaintiff’s burden of proof on causation.”)

Padgett v. United States, 553 F. Supp. 794, 800 – 01 (W.D. Tex. 1982) (“From the relative risk, we can calculate the probability that a given case of GBS was caused by vaccination. . . . [A] relative risk of 2 or greater would indicate that it was more likely than not that vaccination caused a case of GBS.”);

Manko v. United States, 636 F. Supp. 1419, 1434 (W.D. Mo. 1986)(relative risk of 2, or less, means exposure not the probable cause of disease claimed), aff’d in relevant part, 830 F.2d 831 (8th Cir. 1987)


IUD Cases – Pelvic Inflammatory Disease

Marder v. G.D. Searle & Co., 630 F. Supp. 1087, 1092 (D.Md. 1986) (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987)(per curiam)


Bendectin cases

Lynch v. Merrill-National Laboratories, 646 F.Supp. 856 (D. Mass. 1986)(granting summary judgment), aff’d, 830 F.2d 1190, 1197 (1st Cir. 1987)(distinguishing between chances that “somewhat favor” plaintiff and plaintiff’s burden of showing specific causation by “preponderant evidence”)

DeLuca v. Merrell Dow Pharm., Inc., 911 F.2d 941, 958-9 (3d Cir. 1990)

Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1321 (9th Cir.)(“Daubert II”)(holding that for epidemiological testimony to be admissible to prove specific causation, there must have been a relative risk for the plaintiff of greater than 2) (“For an epidemiological study to show causation under a preponderance standard . . . the study must how that children whose mothers took Bendectin are more than twice as likely to develop limb reduction birth defects as children whose mothers did not.”), cert. denied, 516 U.S. 869 (1995)

DePyper v. Navarro, 1995 WL 788828 (Mich. Cir. Ct. Nov. 27, 1995)

Oxendine v. Merrell Dow Pharm., Inc., 1996 WL 680992 (D.C. Super. Ct. Oct. 24, 1996)

Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 716 (Tex. 1997) (holding, in accord with the weight of judicial authority, “that the requirement of a more than 50% probability means that epidemiological evidence must show that the risk of an injury or condition in the exposed population was more than double the risk in the unexposed or control population”); id. at at 719 (rejecting isolated statistically significant associations when not consistently found among studies)


Silicone Cases

Hall v. Baxter Healthcare, 947 F.Supp. 1387, 1392, 1397, 1403-04 (D. Ore. 1996)(discussing relative risk of 2.0)

Pick v. American Medical Systems, Inc., 958 F. Supp. 1151, 1160 (E.D.La. 1997) (noting, in penile implant case, that “any” increased risk suggests that the exposure “may” have played some causal role)

In re Breast Implant Litigation, 11 F. Supp. 2d 1217, 1226 -27 (D. Colo. 1998)(relative risk of 2.0 or less shows that the background risk is at least as likely to have given rise to the alleged injury)

Barrow v. Bristol-Myers Squibb Co., 1998 WL 812318 (M.D. Fla. Oct. 29, 1998)

Allison v. McGhan Med. Corp., 184 F.3d 1300, 1315n.16, 1316 (11th Cir. 1999)(affirming exclusion of expert testimony based upon a study with a risk ratio of 1.24; noting that statistically significant epidemiological study reporting an increased risk of marker of disease of 1.24 times in patients with breast implants was so close to 1.0 that it “was not worth serious consideration for proving causation”; threshold for concluding that an agent more likely than not caused a disease is 2.0, citing Federal Judicial Center, Reference Manual on Scientific Evidence 168-69 (1994))

Grant v. Bristol-Myers Squibb, 97 F. Supp. 2d 986, 992 (D. Ariz. 2000)

Pozefsky v. Baxter Healthcare Corp., No. 92-CV-0314, 2001 WL 967608, at *3 (N.D.N.Y. August 16, 2001) (excluding causation opinion testimony given contrary epidemiologic studies; noting that sufficient epidemiologic evidence requires relative risk greater than two)

In re Silicone Gel Breast Implant Litig., 318 F. Supp. 2d 879, 893 (C.D. Cal. 2004)

Norris v. Baxter Healthcare Corp., 397 F.3d 878 (10th Cir. 2005) (discussing but not deciding specific causation and the need for relative risk greater than two; no reliable showing of general causation)

Barrow v. Bristol-Meyers Squibb Co., 1998 WL 812318, at *23 (M.D. Fla., Oct. 29, 1998)

Minnesota Mining and Manufacturing v. Atterbury, 978 S.W.2d 183, 198 (Tex.App. – Texarkana 1998) (noting that “[t]here is no requirement in a toxic tort case that a party must have reliable evidence of a relative risk of 2.0 or greater”)


Asbestos

Washington v. Armstrong World Indus., Inc., 839 F.2d 1121 (5th Cir. 1988)(affirming grant of summary judgment on grounds that there was insufficient evidence that plaintiff’s colon cancer was caused by asbestos)

Lee v. Johns Manville Corp., slip op. at 3, Phila. Cty. Ct. C.P., Sept. Term 1978, No. 88 (123) (Oct. 26, 1983) (Forer, J.)(entering verdict in favor of defendants on grounds that plaintiff had failed to show that his colo rectal cancer had been caused by asbestos exposure after adducing evidence of a relative risk less than two)

Primavera v. Celotex Corp., Phila. Cty. Ct. C.P., December Term, 1981, No. 1283 (Bench Op. of Hon. Berel Caesar, (Nov. 2, 1988) (granting compulsory nonsuit on the plaintiff’s claim that his colorectal cancer was caused by his occupational exposure to asbestos)

Grassis v. Johns-Manville Corp., 248 N.J.Super. 446, 455-56, 591 A.2d 671, 676 (App. Div. 1991)

Landrigan v. Celotex Corp., 127 N.J. 404, 419, 605 A.2d 1079 (1992)

Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992)

In re Joint E. & S. Dist. Asbestos Litig., 758 F. Supp. 199 (S.D.N.Y. 1991), rev’d sub nom. Maiorano v. Owens Corning Corp., 964 F.2d 92 (2d Cir. 1992)

Maiorana v. National Gypsum, 827 F. Supp. 1014, 1043 (S.D.N.Y. 1993), aff’d in part and rev’d in part, 52 F.3d 1122, 1134 (2d Cir. 1995)

Jones v. Owens-Corning Fiberglas Corp., 288 N.J. Super. 258, 266, 672 A.2d 230, 235 (App. Div. 1996)

Keene Corp. v. Hall, 626 A.2d 997 (Md. Spec. Ct. App. 1993)(laryngeal cancer)

In re W.R. Grace & Co., 355 B.R. 462, 483 (Bankr. D. Del. 2006) (requiring showing of relative risk greater than two to support property damage claims based on unreasonable risks from asbestos insulation products).


Pharmaceutical Cases

Ambrosini v. Upjohn, 1995 WL 637650, at *4 (D.D.C. 1995)

Ambrosini v. Labarraque, 101 F.3d 129, 135 (D.C. Cir. 1996)(Depo-Provera, birth defects)

Miller v. Pfizer, 196 F. Supp. 2d 1062, 1079 (D. Kan. 2002) (acknowledging that most courts require a showing of RR > 2, but questioning their reasoning), aff’d, 356 F. 3d 1326 (10th Cir. 2004)

Smith v. Wyeth-Ayerst Laboratories Co., appears to recognize that risk and cause are distinct concepts. 278 F. Supp. 2d 684, 691 (W.D.N.C. 2003) (“Epidemiologic data that shows a risk cannot support an inference of cause unless (1) the data are statistically significant according to scientific standards used for evaluating such associations; (2) the relative risk is sufficiently strong to support an inference of ‘more likely than not’; and (3)  the epidemiologic data fits the plaintiff’s case in terms of exposure, latency, and other relevant variables.”)

Burton v. Wyeth-Ayherst Laboratories, 513 F. Supp. 2d 719 (N.D. Tex. 2007)

In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1172 (N.D. Calif. 2007)(observing that epidemiologic studies “can also be probative of specific causation, but only if the relative risk is greater than 2.0, that is, the product more than doubles the risk of getting the disease”)

In re Viagra Products Liab. Litigat., 572 F. Supp. 2d 1071, 1078 (D. Minn. 2008)(noting that some but not all courts have concluded relative risks under two support finding expert witness’s opinion to be inadmissible).


Toxic Tort Cases

In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 785, 836 (E.D.N.Y. 1984) (“A government administrative agency may regulate or prohibit the use of toxic substances through rulemaking, despite a very low probability of any causal relationship.  A court, in contrast, must observe the tort law requirement that a plaintiff establish a probability of more than 50% that the defendant’s action injured him. … This means that at least a two-fold increase in incidence of the disease attributable to Agent Orange exposure is required to permit recovery if epidemiological studies alone are relied upon.”), aff’d 818 F.2d 145, 150-51 (2d Cir. 1987)(approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 487 U.S. 1234 (1988)

Sanderson v. Int’l Flavors & Fragrances, Inc., 950 F. Supp. 981, 998 n. 17,  999-1000, 1004 (C.D.Cal.1996) (more than a doubling of risk is required in case involving aldehyde exposure and claimed multiple chemical sensitivities)

Wright v. Willamette Indus., Inc., 91 F.3d 1105 (8th Cir. 1996)(“Actions in tort for damages focus on the question of whether to transfer money from one individual to another, and under common-law principles (like the ones that Arkansas law recognizes) that transfer can take place only if one individual proves, among other things, that it is more likely than not that another individual has caused him or her harm.  It is therefore not enough for a plaintiff to show that a certain chemical agent sometimes causes the kind of harm that he or she is complaining of.  At a minimum, we think that there must be evidence from which the factfinder can conclude that the plaintiff was exposed to levels of that agent that are known to cause the kind of harm that the plaintiff claims to have suffered. See Abuan v. General Elec. Co., 3 F.3d at 333.  We do not require a mathematically precise table equating levels of exposure with levels of harm, but there must be evidence from which a reasonable person could conclude that a defendant’s emission has probably caused a particular plaintiff the kind of harm of which he or she complains before there can be a recovery.”)

McDaniel v. CSX Transp., Inc., 955 S.W.2d 257, 264 (1997) (doubling of risk is relevant but not required as a matter of law)

Lofgren v. Motorola, 1998 WL 299925 *14 (Ariz. Super. 1998) (TCE, cancer)

Berry v. CSX Transp., Inc., 709 So. 2d 552 (Fla. D. Ct.App. 1998)(solvents, toxic encephalopathy)

Bartley v. Euclid, Inc., 158 F.3d 261 (5th Cir. 1998)

Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 591-92 (D.N.J.2002) (‘‘the threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0’’), aff’d, 68 F. App’x 356 (3d Cir. 2003)

Ferguson v. Riverside School Dist. No. 416, 2002 WL 34355958 (E.D. Wash. Feb. 6, 2002)(No. CS-00-0097-FVS)

Daniels v. Lyondell-Citgo Refining Co., 99 S.W.3d 722, 727 (Tex. App. – Houston [1st Dist.] 2003)

Graham v Lautrec Ltd., 2003 WL 23512133 (Mich. Cir. Ct., July 24, 2003)

Theofanis v. Sarrafi, 791 N.E.2d 38,48 (Ill. App. 2003)(reversing and granting new trial to plaintiff who received an award of no damages when experts testified that relative risk was between 2.0 and 3.0)(“where the risk with the negligent act is at least twice as great as the risk in the absence of negligence, the evidence supports a finding that, more likely than not, the negligence in fact caused the harm”).

Cano v. Everest Minerals Corp., 362 F. Supp. 2d 814, 846 (W.D. Tex. 2005)(relative risk less than 3.0 represents only a weak association)

Mobil Oil Corp. v. Bailey, 187 S.W.3d 263, 268 (Tex. App. – Beaumont 2006)

Cook v. Rockwell Internat’l Corp., 580 F. Supp. 2d 1071, 1088-89 (D. Colo. 2006)

In re Lockheed Litig. Cases, 115 Cal. App. 4th 558 (2004), rev’d in part, 23 Cal. Rptr. 3d 762, 765 (Cal. App. 2d Dist. 2005), cert. dismissed, 192 P.3d 403 (Cal. 2007)

Watts v. Radiator Specialty Co., 990 So. 2d 143 (Miss. 2008)(“The threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0.”)

Henricksen v. Conocophillips Co., 605 F. Supp. 2d 1142, 1158 (E.D. Wash. 2009) (noting that under Circuit precedent, epidemiologic studies showing low-level risk may suffiicent to show general causation but are sufficient to show specific causation only if relative risk exceeds two) (excluding plaintiff‘s expert witness’s testimony because epidemiologic evidence iis “contradictory and inconsistent”)

George v. Vermont League of Cities and Towns, 2010 Vt. 1, 993 A.2d 367, 375 (2010)

City of San Antonio v. Pollock, 284 S.W.3d 809, 818 (Tex. 2009) (holding testimony admitted insufficient as matter of law).


ACADEMIC COMMENTATORS

Michael Dore, “A Commentary of the Use of Epidemiological Evidence in Demonstrating Cause-in-Fact,” 7 Harv. Envt’l L.Rev. 429, 431-40 (1983)

Bert Black & David E. Lilienfeld, Epidemiologic Proof in Toxic Tort Litigation, 52 Fordham L. Rev. 732, 767 – 69 (1984)

David E. Lilienfeld & Bert Black, “The Epidemiologist in Court,” 123 Am. J. Epidemiology 961, 963 (1986)(a relative risk of 1.5 allows an inference of attributable risk of 33%, which means any individual case is less likely than not to be causally related)

Powell, “How to Tell the Truth With Statistics: A New Statistical Approach to Analyzing the Bendectin Epidemiological Data in the Aftermath of Daubert v. Merrell Dow Pharmaceuticals,” 31 Houston L. Rev. 1241, 1310 (1994) (“The plaintiff who wishes to reach the jury on the issue of causation must submit a statistical analysis indicating that exposure to the drug in question more likely than not caused the birth defects in question.  To support a finding of causation, the meta-analysis summary odds ratio must exceed two.”)

Linda Bailey, et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence at 121, 168-69 (Federal Judical Ctr. 1st ed. 1994) (“The threshold for concluding that an agent was more likely the cause of a disease than not is a relative risk greater than 2.0 … .  A relative risk greater than 2.0 would permit an inference that an individual plaintiff’s disease was more likely than not caused by the implicated agent.”)

Ben Armstrong & Gilles Theriault, “Compensating Lung Cancer Patients Occupationally Exposed to Coal Tar Pitch Volatiles,” 53 Occup. Envt’l Med. 160 (1996)

Philip E. Enterline, “Toxic Torts:  Are They Poisoning Scientific Literature?” 30 Am. J. Indus. Med. 121 (1996)

Joseph V. Rodricks & Susan H. Rieth, “Toxicological Risk Assessment in the Court:  Are Available Methodologies Suitable for Evaluating Toxic Tort and Product Liability Claims?,” 27 Reg. Toxicol. & Pharmacol. 21, 25-30 (1998)

Michael Green et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 333, 381, 383 (Federal Judicial Center ed., 2d ed. 2000), available at http://www.fjc.gov ( “[E]pidemiology addresses whether an agent can cause a disease, not whether an agent did cause a specific plaintiff’s disease.  * * *  Nevertheless, the specific causation issue is a necessary legal element in a toxic substance case. The plaintiff must establish not only that the defendant’s agent is capable of causing disease but also that it did cause the plaintiff’s disease.  Thus, a number of courts have confronted the legal question of what is acceptable proof of specific causation and the role that epidemiologic evidence plays in answering that question. This question is not a question that is addressed by epidemiology. Rather, it is a legal question a number of courts have grappled with.”) (“[t]he civil burden of proof is described most often as requiring the fact finder to believe that what is sought to be proved is more likely true than not true. The relative risk from epidemiologic studies can be adapted to this 50% plus standard to yield a probability or likelihood that an agent caused an individual’s disease.”)

David W. Barnes, “Too Many Probabilities:  Statistical Evidence of Tort Causation,” 64 Law and Contemp. Problems 191, 206 (2001) (criticizing the uncritical use of a relative risk greater than two to signify the probability, but acknowledging that sometimes a credible, precise RR, greater than 1.0, will be too small to support specific causation, such as the RR of 1.24 seen in the Allison case)

Russellyn S. Carruth & Bernard D. Goldstein, “Relative Risk Greater than Two in Proof of Causation in Toxic Tort Litigation,” 41 Jurimetrics 195 (2001) (criticizing the use of a relative risk of two benchmark, but acknowledging that when a disease has multiple causes and a substantial base rate in the general population, “there is no objective means to determine if a particular person’s disease was caused by some other environmental exposure, or by a non-environmental cause.”)

Richard W. Clapp & David Ozonoff, “Environment and Health:  Vital Intersection or Contested Territory?” 36 Am. J. L. & Med. 189, 210 (2004)( incorrectly describing the meaning of a confidence interval:  “A relative risk of 1.8, with confidence interval of 1.3 to 2.9 could very likely represent a true relative risk greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”)

Erica Beecher-Monas, Evaluating Scientific Evidence 58, 67 (N.Y. 2007)(“No matter how persuasive epidemiological or toxicological studies may be, they could not show individual causation, although they might enable a (probabilistic) judgment about the association of a particular chemical exposure to human disease in general.”)(“While significance testing characterizes the probability that the relative risk would be the same as found in the study as if the results were due to chance, a relative risk of 2 is the threshold for a greater than 50 percent chance that the effect was caused by the agent in question.”)(incorrectly describing significance probability as a point probability as opposed to tail probabilities)

Andrew W. Jurs, “Daubert, Probabilities and Possibilities and the Ohio Solution:  A Sensible Approach to Relevance Under Rule 702 in Civil and Criminal Applications,” 41 Akron L. Rev. 609, 637 (2008)(acknowledging that relative risks less than 2.0 invite jury speculation about individual, specific causation)

Relative Risks and Individual Causal Attribution Using Risk Size

March 18th, 2011

The relative risk argument is simple.  A relative risk of 1.0 means that the rate of disease incidence or mortality is the same among the exposed and control populations.  A relative risk of 2.0 means that the incidence rate in the exposed population is twice that in the controls.  The existence of an observed rate among the non-exposed controls suggests that we are dealing with a disease of “ordinary life,” for which there is an expected rate of occurrence.  Most chronic diseases, such as cancer, autoimmune disease, cardiovascular diseases, fall into this category of diseases of ordinary life.

If a study of a disease that is prevalent in the general population, say colon cancer, is conducted in an exposed cohort of workers, say asbestos insulators, and the study finds a relative risk of 1.5, we would have to take several steps to assess the finding’s relevance in litigation.  First, this positive association would have to be evaluated for causality.  Bias and confounding would have to be ruled out as explaining the apparent increase in risk.  Furthermore, the association would have to be evaluated for various indicia of causality, such as consistency with other studies, dose-response relationship between exposure and outcome, biological plausibility and coherence, and support from experimental studies.  In the case of asbestos and colon cancer, the causal hypothesis has repeated failed to be supported by such evaluations, but even if we were to assume general causation, arguendo, we would be left without a way to infer causation in a given case.  If plaintiff supported his case with evidence or a relative risk of 1.5, we would have 50% more observed cases than expected.  So if the observed population was expected to experience 100 colon cancer cases over the observation period, a relative risk of 1.5 means that 150 such cases were observed, or 100 expected cases and 50 putative excess cases.  Alas, there is no principled way to tell an excess case from an expected case, and the odds favor the defense two to one that any given case arose from the expected population as opposed to the excess group.  As a probability, the probability that plaintiff’s case arose from the excess portion is 33%, well below what is needed to support a sustainable claim.  Again, this assumes many facts in plaintiff’s favor, such as a perfect epidemiologic study, without bias or confounding, and with consistency among the findings of similar studies.  (None of these assumptions is even close to satisfied for asbestos and colon cancer.)

In the Agent Orange litigation, Judge Weinstein implicitly recognized the problem that very large relative risks suggested that an individual case was likely to have been related to its antecedent risks.  Small relative risks suggested that any inference of specific causation from the antecedent risk was largely speculative, in the absence of some reliable marker of exposure-related causation. See In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 785, 817 (E.D.N.Y. 1984)(plaintiffs must prove at least a two-fold increase in rate of disease allegedly caused by the exposure), aff’d, 818 F.2d 145, 150-51 (2d Cir. 1987)(approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988); see also In re “Agent Orange” Prod. Liab. Litig., 611 F. Supp. 1223, 1240, 1262 (E.D.N.Y. 1985)(excluding plaintiffs’ expert witnesses), aff’d, 818 F.2d 187 (2d Cir. 1987), cert. denied, 487 U.S. 1234 (1988). 

Ever since Judge Weinstein embraced the relative risk of two, as an important benchmark to be exceeded if plaintiffs hoped to show specific causation, scientists who practice medicine for the redistribution of wealth have attacked the concept.  The challengers have urged that small relative risks, including relative risks of two or less, could suffice to support causal attribution in a given case, especially in the presence of relevant clinical findings.  The challengers, however been vague and evasive when it comes to identifying what are the relevant clinical findings and how they operate to show that the risk has actually operated to become part of the causal pathway that has led to the individual’s injury or disease.

Among the most vociferous of the challengers has been Professor Sander Greenland, of the University of California Los Angeles School of Public Health.  Greenland has published his criticisms of the inference of a probability of individual causation from the relative risk on many occasions.  See, e.g., Sander Greenland & James Robins, “Conceptual Problems in the Definition and Interpretation of Attributable Fractions,” 128 Am. J. Epidem. 1185 (1988); James Robins & Sander Greenland, “The Probability of Causation Under a Stochastic Model for Individual Risk,” 45 Biometrics 1125 (1989); James Robins & Sander Greenland, “Estimability and Estimation of Excess and Etiologic Fractions,” 8 Statistics in Medicine 845 (1989); James Robins & Sander Greenland, “Estimability and Estimation of Expected Years of Life Lost Due to a Hazardous Exposure,” 10 Statistics in Medicine 79 (1991); Jan Beyea & Sander Greenland, “The Importance  of Specifying the Underyling Biologic Model in Estimating the Probability of Causation,” 76 Health Physics 269 (1999; Sander Greenland, “Relation of Probability of Causation to Relative Risk and Doubling Dose:  A Methodologic Error That Has Become a Social Problem,” 89 Am. J. Pub. Health 1166 (1999); Sander Greenland & James Robins, “Epidemiology, Justice, and the Probability of Causation,” 40 Jurimetrics 321 (2000).

Greenland’s criticisms turn on various assumptions such as the risk may not be evenly distributed within the sampled population, or the causal mechanism may accelerate onset of disease in such a way as to leave the relative risk unchanged in the study under consideration.  Greenland is correct that it is important to have a clear causal model in mind when evaluating the possibility of causal attributions in the light of population studies and their measures of relative risk.  He is also correct that his clever assumptions, if true, could affect the reasonableness of claiming that a relative risk of two or less supports the defense position in many toxic tort cases.  Unfortunately, Greenland’s clever assumptions and his arguments prove too much, because in many, if not most, cases the causal model is not defined.  There is often no evidence to support the plaintiffs’ claims of acceleration, or of sequestration of risk within the sampled population, and certainly no basis for claiming that the plaintiff belongs to a subset of “vulnerable” exposed persons with a higher than average risk that is reflected in the study relative risk.  Without evidence to support Greenland’s various assumptions, even higher relative risks than 2.0, say risks in the range of 2.0 to 20.0, would be unhelpful to support a plaintiffs’ case.  We would be thrown back to the early case law that held that risk can never support individual attributions, and Judge Weinstein’s rather pragmatic pronouncement in Agent Orange would be thrown aside, to the benefit of defendants in toxic tort cases. 

Last year, the Vermont Supreme Court reaffirmed the continuing vitality of the relative risk argument, on the original pragmatic justification offered by Judge Weinstein in the Agent Orange cases.  George v. Vermont League of Cities and Towns, 2010 Vt. 1, 993 A.2d 367 (Vt. 2010).  Indeed, George may well have been one of the best, and the least unheralded, decisions of 2010.

Mr. George had been a fireman before he died of non-Hodgkin’s lymphoma (NHL).  In administrative workman’s compensation proceedings, the Commissioner ruled that widow failed to show a causal connection between firefighting and NHL, although there was an “association.” His widow appealed the denial of benefits.  On de novo review, the trial court excluded plaintiffs’ expert witnesses on Rule 702 grounds.  (Vermont law follows federal law on requiring relevance and reliability of expert witnesses’ opinions.) The case ended up before the Vermont Supreme Court, which had to review the trial court’s handling of the Rule 702 issues.

Several issues were at play.  The plaintiff had presented multiple expert witnesses, Drs. Tee Guidotti and James Lockey, who had presented general and/or specific causation opinions on firefighting and NHL.  These witnesses relied upon epidemiologic studies, some of which had been incorporated into a meta-analysis, and a so-called “weight of the evidence” methodology.

The Vermont Supreme Court recognized the limits of using epidemiology to resolve the specific causation question in George. The Court found the Texas Supreme Court’s treatment of this issue to be persuasive: 

“epidemiological studies can assist in demonstrating a general association between a substance and a disease or condition, but they cannot prove that a substance actually caused a disease or condition in a particular individual.”

Id. at 374 (relying upon and quoting from Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 715 (Tex.1997)).

The Court also quoted from, and relied upon, the pronouncement of the Federal Judicial Center’s Reference Manual, which explains that ‘‘epidemiology is concerned with the incidence of disease in populations and does not address the question of the cause of an individual’s disease.  This question, sometimes referred to as specific causation, is beyond the domain of the science of epidemiology.’’ Id. at 375 (quoting from M. Green et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 333, 381 (2d ed. 2000); footnote omitted in court’s quotation of this source).

Faced with the academic and judicial criticisms of using the relative risk (which is sometimes referred to as “effect size”), the Court recognized the pragmatic compromise between science and the needs of the legal system, embraced by using the relative risk as a benchmark showing for plaintiffs to make in toxic tort litigation:

“The trial court here adopted a relative risk factor of 2.0 as a benchmark, finding that it easily tied into Vermont’s ‘more likely than not’ civil standard and that such a benchmark was helpful in this case because the eight epidemiological studies relied upon by claimant’s experts reflected widely varying degrees of relative risk.”

 Id. at 375.

“Given claimant’s burden of proof, however, and the inherent limitations of epidemiological data in addressing specific causation, the trial court reasonably found the 2.0 standard to be a helpful benchmark in evaluating the epidemiological evidence underlying Dr. Guidotti’s opinion.”

Id. at 377.

“Mindful of this balance, we conclude that the trial court did not abuse its discretion in considering a relative risk greater than 2.0 as a reasonable and helpful benchmark under the circumstances presented here.”

 Id. at 378.

 The Vermont Supreme Court was also clearly worried about how and why plaintiff’s expert witnesses selected some studies to include in their “weight of evidence” methodology.  Without an adequate explanation of selection and weighting criteria, the choices seemed like arbitrary “cherry picking.”  Id. at 389. This worry is amply justified.  Weight of the evidence methodology is notoriously vague and indeterminate; unless the criteria for weighting are pre-specified and rigorously followed, claims based upon this methodology may be little more than subjective preferences. See, e.g., Douglas L.Weed, “Weight of Evidence: A Review of Concept and Methods,” 25 Risk Analysis 1545 (2005). 

In part, plaintiff’s expert witnesses also relied upon a meta-analysis of observational studies that looked at NHL risk among firefighters.  The Court was concerned about the plaintiffs’ expert witnesses’ failure to explain selection and weighting of studies in the meta-analysis methodology.  This criticism may well be simply plaintiff’s witnesses’ failure to explain the methodology of a published study, which in turn may have properly used an acceptable methodology to provide a summary estimate of risk of NHL among firefighters.  The meta-analysis in question, however, appears to have found a summary risk estimate of 1.51, with a 95% confidence interval, 1.31-1.73.  G.K. LeMasters, et al., “Cancer risk among firefighters: a review and meta-analysis of 32 studies,” 48 J. Occup. Envt’l Med. 1189 (2006).  The plaintiff’s expert witnesses were thus relying upon a study that quantifying the increased risk at 51%, with an upper bound from sampling variability, at 73%.  To the extent that the plaintiff had succeeded in providing reliable evidence of increased risk, she had also succeeded in showing that a doubling, or more, of the risk for NHL was statistically unlikely.  This is hardly a propitious way to win a lawsuit.

Risk ≠ Causation

March 12th, 2011

Evidence of risk is not evidence of causation.  It never has been; it never will be. Risk and causation are distinct concepts.  Processes, events, or exposures may be risks; that is, they may be capable of causing an outcome of interest.  Risk, however, is an ex ante concept.  We can speak of a risk only before the outcome of interest has occurred.  After its occurrence, we are interested in what caused the outcome.

Before the tremendous development of epidemiology in the decades after World War II, most negligence and products liability cases involved mechanistic conceptions of causation.  Juries and courts considered claims of causation that conceptually were framed in the manner of billiard balls hitting one another until the final, billiard-of-ball of interest, went into the pocket.  Litigants and courts did not need to consider statistical evidence when considering whether a saw dismembered a plaintiff, or even whether chronic asbestos exposure caused inflammation and scarring in the lungs of workers.  In some instances, judicial efforts to cast causation as a mechanistic process smack of quackery.  Claims that blunt trauma caused malignant tumors at the site of the trauma, within days or weeks of the impact, come to mind as an example of magical thinking that plagued courts and juries in a era that was short on scientific gatekeeping, and long on deferring to clinical judgment, empty of meaningful scientific support.  See, e.g., Baker v. DeRosa, 413 Pa. 164, 196 A.2d 387 (1964)(holding that question whether car accident caused tumor was for the jury).

The advent of epidemiologic evidence introduced an entirely different class of claims, ones that were based upon stochastic concepts of causation.  The exposure, event, or process that was a putative cause had a probabilistic element to its operation.  The putative cause exercised its contribution to the outcome through a random process, which left changed the frequency of the harmful outcome in those who encountered the exposure.  In addition, the outcome that resulted from the “putative cause” was frequently indistinguishable from those outcomes that arose spontaneously or from other causes in the environment or from normal human aging.  Discerning which risks (or “putative causes”) operated in a given case of chronic human disease (such cancer, cardiovascular disease, autoimmune disease) became a key issue for courts and litigants’ expert witnesses.  The black box of epidemiology, however, sheds little or no light on the issue, and no other light source was available.

Today, expert witnesses, typically for plaintiffs, equate risk with causation.  Because risk is an ex ante concept, the inference from risk to causation is problematic.  In rare instances, the risk is absolute under the circumstances of the plaintiff’s manifestation, such that the outcome can be tied to the exposure that created the risk.  In most cases, however, there will have been other competing risks, which alone could have operated to produce the outcome of which the plaintiff complains.  In toxic tort litigation, we frequently see a multiplicity of pre-existing risks for a chronic disease that is prevalent in the entire population.  When claimants attempt to show causation for such outcomes by epidemiologic evidence, the inference of causation from a particular prior risk is typically little more than a guess.

One well-known epidemiologist explained the limits of inferences with respect to stochastic causation:

“An elementary but essential principal that epidemiologists must keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between exposure and disease.”   ****

“In a courtroom, experts are asked to opine whether the disease of a given patient has been caused by a specific exposure.  This approach of assigning causation in a single person is radically differentfrom the epidemiologic approach, which does not attempt to attribute causation in any individual instance.  Rather, the epidemiologic approach is to evaluate the proposition that the exposure is a cause of the disease in a theoretical sense, rather than in a specific person.”

Kenneth Rothman, Epidemiology: An Introduction 44 (Oxford 2002)(emphasis added). 

Another epidemiologist, who wrote the chapter in the Federal Judicial Center’s Reference Manual on Scientific Evidence, on epidemiology, put the matter thus:

“Epidemiology answers questions about groups, whereas the court often requires information about individuals.”

Leon Gordis, Epidemiology 3d ed. (Philadelphia 2004)(emphasis in original).  Accord G. Friedman, Primer of Epidemiology 2 (2d ed. 1980 (epidemiologic studies address causes of disease in populations, not causation in individuals); Sander Greenland, “Relation of the Probability of Causation to Relative Risk and Doubling Dose:  A Methodologic Error that Has Become a Social Problem,” 89 Am. J. Pub. Health1166, 1168 (1999)(“[a]ll epidemiologic measures (such as rate ratios and rate fractions) reflect only the net impact of exposure on a population”); Joseph V. Rodricks & Susan H. Rieth, “Toxicological Risk Assessment in the Courtroom:  Are Available Methodologies Suitable for Evaluating Toxic Tort and Product Liability Claims?” 27 Regulatory Toxicol. & Pharmacol. 21, 24-25 (1998)(noting that a population risk applies to individuals only if all persons within the population are the same with respect to the influence of the risk on outcome).

These cautionary notes are important reminders of the limits of epidemiologic method.  What these authors miss is that there may be no other principled way to connect one pre-existing risk, among several, to an outcome that is claimed to be tortious.  As the young, laconic Wittgenstein wrote: 

“Wovon man nicht sprechen kann, darüber muß man schweigen.” 

L. Wittgenstein, Tractatus Logico-Philosophicus, Proposition 7 (1921)(translated by Ogden as “Whereof one cannot speak, thereof one must be silent”).  Unfortunately, expert witnesses in legal proceedings sometimes do not feel the normative force of Wittgenstein’s Proposition 7, and they speak without restraint.  As a contemporary philosopher explained in a more accessible idiom,

“Bullshit is unavoidable whenever circumstances require someone to talk without knowing what he is talking about.  Thus the production of bullshit is stimulated whenever a person’s obligations or opportunities to speak about some topic exceed his knowledge of the facts that are relevant to that topic.”

Harry Frankfurt, On Bullshit 63 (Princeton University Press 2005).

Judicial Innumeracy and the MDL Process

February 26th, 2011

In writing previously about the Avandia MDL Court’s handling of the defendants’ Daubert motion, I noted the trial court’s erroneous interpretation of statistical evidence.  See “Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion” (Jan. 10, 2011).  In fact, the Avandia court badly misinterpreted the meaning of a p-value, a basic concept in statistics:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance).”

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011) (internal citation omitted).  The Avandia MDL court was not, however, the first to commit this howler.  Professor David Kaye collected examples of statistical blunders from published cases in a 1986 law review, and again in his chapter on statistical evidence in the Federal Judicial Center’s Reference Manual on Scientific Evidence created a list of erroneous interpretations:

United States v. Georgia Power Co., 474 F.2d. 906, 915 (5th Cir. 1973)

National Lime Ass’n v. EPA, 627 F.2d 416, 453 (D.C. Cir. 1980)

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982) (“A variation of two standard deviations would indicate that the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke.”);

Vuyanich v. Republic Nat’l Bank, 505 F. Supp. 224, 272 (N.D. Tex. 1980) (“[I]f a 5% level of significance is used, a sufficiently large t-statistic for the coefficient indicates that the chances are less than one in 20 that the true coefficient is actually zero.”), vacated, 723 F.2d 1195 (5th Cir. 1984)

Craik v. Minnesota State Univ. Bd., 731 F.2d 465, 476n.13 (8th Cir. 1984)(“[a] finding that a disparity is statistically significant at the 0.095 or 0.01 level means that there is a 5 per cent. Or 1 per cent. Probability, respectively, that the disparity is due to chance.”  See also id. at 510 (Swygert, J., dissenting)(stating that coefficients were statistically significant at 1% level, allowing him to say that “we can be 99% confident that each was different from zero.”)

Sheehan v. Daily Racing Form, Inc., 104 F.3d 940, 941 (7th Cir. 1997) (“An affidavit by a statistician . . . states that the probability that the retentions . . . are uncorrelated with age is less than 5 percent.”)

Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991) (“Social scientists consider a finding of two standard deviations significant, meaning there is about one chance in 20 that the explanation for a deviation could be random . . . .”)

David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in Reference Manual on Scientific Evidence 83, 122-24 (2nd ed. 2000); David H. Kaye, “Is Proof of Statistical Significance Relevant?” 61 Wash. L. Rev. 1333, 1347 (1986)(pointing out that before 1970, there were virtually no references to “statistical significance” or p-values in reported state or federal cases. 

Notwithstanding the educational efforts of the Federal Judicial Center, the innumeracy continues, and with the ascent of the MDL model for addressing mass torts, many recent howlers have come from trial judges given responsibility for overseeing the pretrial coordination of thousands of lawsuits.  In addition to the Avandia MDL Court, here are some other recent erroneous statements that can be added to Professor Kaye’s lists: 

“Scientific convention defines statistical significance as “P ≤ .05,” i.e., no more than one chance in twenty of a finding a false association due to sampling error.  Plaintiffs, however, need only prove that causation is more-probable-than-not.”

In re Ephedra Prods. Liab. Litig., 393 F.Supp.2d 181, 193 (S.D.N.Y. 2005)(confusing the standard for Type I statistical error with the burden of proof).

“More-probable-than-not might be likened to P < .5, so that preponderance of the evidence is nearly ten times less significant (whatever that might mean) than the scientific standard.”

Id. at 193 n.9 (same). 

In the Phenylpropanolamine litigation, the error was even more clearly stated, for both p-values and confidence intervals:

“P-values measure the probability that the reported association was due to chance… .”

“… while confidence intervals indicate the range of values within which the true odds ratio is likely to fall.”

In re Phenylpropanolamine Products Liab. Litig., 289 F. 2d 1230, 1236n.1 (2003)

These misstatements raise important questions about judicial competency for gatekeeping, the selection, education, and training of judges, the assignment of MDL cases to individual trial judges, and the aggregation of Rule 702 motions to a trial judge for a single, one-time decision that will control hundreds if not thousands of cases.

Recently, a student published a bold note that argued for the dismantling of judicial gatekeeping.  Note, “Admitting Doubt: A New Standards for Scientific Evidence,” 123 Harvard Law Review 2021 (2010).  With all the naiveté of someone who has never tried a jury trial, the student argued that juries are at least as good, if not better, at handling technical questions.  The empirical evidence for such a suggestion is slim, and ignores the geographic variability in jury pools.  The above instances of erroneous statistical interpretations might seem to support the student’s note, but the argument would miss two important points: 

  • these errors are put on display for all to see, and for commentators to note and correct, whereas jury decisions obscure their mistakes; and
  • judges can be singled out for their technical competencies, and given appropriate assignments (which hardly ever happens at present), and judges can be required to partake in professional continuing legal education, which might well include training in technical areas to improve their decision making.

The Federal Judicial Center, and its state court counterparts, have work to do.  Lawyers also have an obligation to help courts get difficult, technical issue right.  Finally, courts, lawyers, and commentators need to rethink how the so-called Daubert process works, and does not work, especially in the high-stakes arena of multi-district litigation.

Can Daubert Survive the Multi-District Litigation Process?

February 23rd, 2011

The so-called Daubert process, by which each side in a lawsuit may challenge and seek preclusion of the other side’s expert witnesses, arose in the setting of common-law judges making rulings in individual cases.  Indeed, the Daubert case itself, although one of many cases involving claims of birth defects allegedly caused by Bendectin, was an individual case. 

In the silicone gel breast implant (SGBI) litigation, the process evolved over time, with decisions from different judges, each of whom saw the evidence differently.  The different judges brought different insights and aptitudes to bear on the evidence, and the expert witnesses themselves may have varied in their approaches and reliance upon different studies.  This incrementalist approach, in the context of the SBGI litigation, worked to the benefit of the defendants, in part because their counsel learned about the fraudulent evidence underlying certain studies, and about serious lapses in the standard of research care on the part of some investigators whose studies were prominently relied upon by plaintiffs’ counsel.  In the case of one dubious study, one of its authors, Marc Lappe, a prominent expert witness for plaintiffs, withdrew his support from the conclusions advanced in the study.

Early decisions in the SGBI cases (shortly after the Supreme Court’s decision in Daubert, in 1993) denied the defendants’ applications to preclude plaintiffs’ expert witnesses’ opinion testimony.  Later decisions converged upon the unavoidable truth that the case for SGBIs causing atypical or typical connective tissue diseases was a house of cards, built mostly with jokers.  If the Daubert process had been censored after the first hearing, the result would have been to deem all the breast implant cases trial and jury worthy, to the detriment of the judicial process, to the public’s interest in knowing the truth about silicone biomaterials, to the defendants’ reputational and financial interests, and to the interests of the claimants who had been manipulated by their counsel and support group leaders.

The evolutionary approach taken in the SGBI litigation was indirectly supported by the late Judge Sam Pointer, who presided over the SGBI federal multi-district litigation (MDL).  Judge Pointer strongly believed that the decision to exclude expert testimony belonged to individual trial judges, who received cases on remand from the MDL 926, when the cases were ready for trial.  Judge Pointer ruled on expert witness challenges in cases set for trial before him, but he was not terribly enthusiastic about the Daubert process, and denied most of the motions in a fairly perfunctory fashion.  Because of this procedural approach, Judge Pointer’s laissez-faire attitude towards expert witness testimony did not interfere with the evolutionary process that allowed other courts to see through the dense fog in the plaintiffs’ case.

Since MDL 926, the MDL process has absorbed the ritual of each side’s challenging the other’s expert witnesses, and MDL judges view their role as including the hearing and deciding all pre-trial Daubert challenges.  It has been over 17 years since the Supreme Court decided Daubert, and in that time, the MDL model, both state and federal, has become dominant.  As a result, the Daubert process has often been truncated and abridged to a single motion, decided at one time, by one judge.  The results of this abridgement have not always been happy for ensuring reliable and accurate gatekeeping. 

The MDL process appears to have broken the promise of Rule 702 in many cases.  By putting the first and only Rule 702 gatekeeping decision in the hands of a single judge, charged with making pre-trial rulings in the entire MDL, the MDL process has sapped the gatekeeping process of its dynamic, evolutionary character.  No longer can litigants and judges learn from previous efforts, as well as from commentary by scientists and legal scholars on the prior outcomes.  For judges who lack scientific and analytical acumen, this isolation from the scientific community works to the detriment of the entire process.

To be sure, the MDL process for deciding Rule 702 is efficient.  In many cases, expensive motions, briefings, and hearings are reduced to one event.  The incorporation of expert challenges into an MDL may improve fairness in some instances by allowing well-qualified plaintiffs’ counsel to wrest control of the process from unprepared plaintiffs’ counsel who are determined to control their individual cases.  Defendants may embrace the MDL process because it permits a single, unified document production and discovery schedule of corporate executives.  Perhaps defendants see the gains from MDL process as sufficiently important to forgo the benefit of a fuller opportunity to litigate the expert witness issues.  Whatever can be said in favor of using the MDL forum to resolve expert witness challenges, it is clear that MDL procedures limit the parties’ ability to refine their challenges over time, and to incorporate new evidence and discovery gained after the first challenges are resolved.  In the SGBI litigation, for instance, the defendants learned of significant scientific malfeasance and misfeasance that undermined key studies relied upon by plaintiffs, including some studies done by apparently neutral, well-credential scientists.  The omnibus MDL Daubert motion prevents either side, or the judiciary, from learning from the first and only motion.

Another example of an evidentiary display that has changed over time comes from the asbestos litigation, where plaintiffs continue to claim that asbestos causes gastrointestinal cancer.  The first such cases were pressed by plaintiffs in the early 1980s, with the support of Dr Selikoff and his cadre of testifying physicians and scientists.  A few years ago, however, the Institutes of Medicine convened a committee to review non-pulmonary cancers and asbestos, and concluded that the studies, now accumulated over 35 years since Dr Selikoff’s ipse dixit, do not support a conclusion that asbestos causes colorectal cancer.  Institute of Medicine of the National Academies, Asbestos: Selected Health Effects (2006).

Unfortunately, many trial judges view the admissibility and sufficiency of causation opinions on asbestos and colorectal cancer as “grandfathered” by virtue of the way business has been conducted in trial courts for over three decades.  Still, defendants have gained the opportunity to invoke an important systematic review, which shows that the available evidence does not reliably support the conclusion urged by plaintiffs’ expert witnesses. 

The current approach of using the MDL as the vehicle for resolving expert witness challenges raises serious questions about how MDLs are assigned to judges, and whether those judges have the analytical or quantitative skills to resolve Daubert challenges.  Assigning an MDL to a judge, who will have to rule on the admissibility of expert witness opinion testimony she or he does not understand, does not inspire confidence in the judicial process.  At least in the ad hoc approach employed in the SGBI, the parties could size up their trial judge, and decide that they would forgo their expert challenges based upon their assessment.  Furthermore, an anomalous outcome could be corrected over a series of decisions.  The MDL process, on the other hand, frequently places the Rule 702 decision in the discretion of a single judge.  The selection criteria for that sole decision maker becomes critical.  As equity in days of old varied with the size of the Chancellor’s foot, today’s scientific equity under Rule 702 may vary with accuracy of the trial judge’s slide rule.