TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Reference Manual on Scientific Evidence on Relative Risk Greater Than Two For Specific Causation Inference

April 25th, 2015

The first edition of the Reference Manual on Scientific Evidence [Manual] was published in 1994, a year after the Supreme Court delivered its opinion in Daubert. The Federal Judicial Center organized and produced the Manual, in response to the kernel panic created by the Supreme Court’s mandate that federal trial judges serve as gatekeepers of the methodological propriety of testifying expert witnesses’ opinions. Considering the intellectual vacuum the Center had to fill, and the speed with which it had to work, the first edition was a stunning accomplishment.

In litigating specific causation in so-called toxic tort cases, defense counsel quickly embraced the Manual’s apparent endorsement of the doubling-of-the-risk argument, which would require relative risks in excess of two in order to draw inferences of specific causation in a given case. See Linda A. Bailey, Leon Gordis, and Michael D. Green, “Reference Guide on Epidemiology,” in Federal Judicial Center, Reference Manual on Scientific Evidence 123, 150, 168 (Washington, DC:, 1st ed., 1994) (“The relative risk from an epidemiological study can be adapted to this 50% plus standard to yield a probability or likelihood that an agent caused an individual’s disease. The threshold for concluding that an agent was more likely than not the cause of a disease than not is a relative risk greater than 2.0.”) (internal citations omitted).

In the Second Edition of the Manual, the authorship of the epidemiology chapter shifted, and so did its treatment of doubling of the risk. By adopting a more nuanced analysis, the Second Edition deprived defense counsel of a readily citable source for the proposition that low relative risks do not support inferences of specific causation. The exact conditions for when and how the doubling argument should prevail were, however, left fuzzy and unspecified. See Michael D. Green, D. Michal Freedman , and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, Reference Manual on Scientific Evidence 333, 348-49 (Wash., DC, 2d ed. 2000)

The latest edition of the Manual attempts to correct the failings of the Second Edition by introducing an explanation and a discussion of some of the conditions that might undermine an inference, or opposition thereto, of specific causation from magnitude of relative risk. Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, Reference Manual on Scientific Evidence 549, 612 (Wash., DC 3d ed., 2011).

The authors of the Manual now acknowledge that doubling of risk inference has “a certain logic as far as it goes,” but point out that there are some “significant assumptions and important caveats that require explication.” Id.

What are the assumptions according the Manual?

First, and foremost, there must be “[a] valid study and risk estimate.” Id. (emphasis in original). The identification of this predicate assumption is, of course, correct, but the authors overlook that the assumption is often trivially satisfied by the legal context in which the doubling argument arises. For instance, in the Landrigan and Caterinichio cases, cited below, the doubling issue arose not as an admissibility question of expert witness opinion, but on motions for directed verdict. In both cases, plaintiffs’ expert witnesses committed to opinions about plaintiffs’ being at risk from asbestos exposure, based upon studies that they identified. Defense counsel in those cases did not concede the existence of risk, the size of the risk, or the validity of the study, but rather stipulated such facts solely for purposes of their motions to dismiss. In other words, even if the plaintiffs’ relied upon studies were valid and the risk estimates accurate (with relative risks of 1.5), plaintiffs could not prevail because no reasonable jury could infer that plaintiffs’ colorectal cancers were caused by their occupational asbestos exposure. The procedural context of the doubling risk thus often pretermits questions of validity, bias, and confounding.

Second, the Manual identifies that there must be “[s]imilarity among study subjects and plaintiff.” Id. at 613. Again, this assumption is often either pretermitted for purposes of lodging a dispositive motion, conceded, or included as part of the challenge to an expert witness’s opinion’s admissibility. For example, in some litigations, plaintiffs will rely upon high-dose or high-exposure studies that are not comparable to the plaintiff’s actual exposure, and the defense may have shown that the only reliable evidence is that there is a small (relative risk less than two) or no risk at all from the plaintiff’s exposure. External validity objections may well play a role in a contest under Rule 702, but the resolution of a doubling of risk issue will require an appropriate measure of risk for the plaintiff whose injury is at issue.

In the course of identifying this second assumption, the Manual now points out that the doubling argument turns on applying “an average risk for the group” to each individual in the group. Id. This point again is correct, but the Manual does not come to terms with the challenge often made to what I call the assumption of stochastic risk. The Manual authors quote a leading textbook on epidemiology:

“We cannot measure the individual risk, and assigning the average value to everyone in the category reflects nothing more than our ignorance about the determinants of lung cancer that interact with cigarette smoke. It is apparent from epidemiological data that some people can engage in chain smoking for many decades without developing lung cancer. Others are or will become primed by unknown circumstances and need only to add cigarette smoke to the nearly sufficient constellation of causes to initiate lung cancer. In our ignorance of these hidden causal components, the best we can do in assessing risk is to classify people according to measured causal risk indicators and then assign the average observed within a class to persons within the class.”

Id at n.198., quoting from Kenneth J. Rothman, Sander Greenland, and Tim L. Lash, Modern Epidemiology 9 (3d ed. 2008). Although the textbook on this point is unimpeachable, taken at face value, it would introduce an evidentiary nihilism for judicial determinations of specific causation in cases in which epidemiologic measures of risk size are the only basis for drawing probabilistic inferences of specific causation. See also Manual at 614 n. 198., citing Ofer Shpilberg, et al., The Next Stage: Molecular Epidemiology, 50 J. Clin. Epidem. 633, 637 (1997) (“A 1.5-fold relative risk may be composed of a 5-fold risk in 10% of the population, and a 1.1-fold risk in the remaining 90%, or a 2-fold risk in 25% and a 1.1-fold for 75%, or a 1.5-fold risk for the entire population.”). The assumption of stochastic risk is, as Judge Weinstein recognized in Agent Orange, often the only assumption on which plaintiffs will ever have a basis for claiming individual causation on typical datasets available to support health effects claims. Elsewhere, the authors of the Manual’s chapter suggest that statistical “frequentists” would resist the adaptation of relative risk to provide a probability of causation because for the frequentist, the individual case either is or is not caused by the exposure at issue. Manual at 611 n.188. This suggestion appears to confuse the frequentist enterprise for evaluating evidence on the basis of statistical measures of the probability of observing at least as great a departure from expected in a sample rather than attempting to affixing a probability to the population parameter. The doubling argument derives from the well-known “urn model” in probability theory, which is not really at issue in the frequentist-Bayesian wars.

Third, the Manual authors state that the doubling argument assumes the “[n]onacceleration of disease.” In some cases, this statement is correct, but there is no evidence of acceleration, and because an acceleration-of-onset theory would diminish damages, typically defendants would have the burden of going forward with identifying the acceleration phenomenon. The authors go further, however, in stating that “for most of the chronic diseases of adulthood, it is not possible for epidemiologic studies to distinguish between acceleration of disease and causation of new disease.” Manual at 614. The inability to distinguish acceleration from causation of new cases would typically redound to the disadvantage of defendants that are making the doubling argument. In other words, the defendants would, by this supposed inability, be unable to mitigate damages by showing that the alleged harm would have occurred any way, but only later in time. See Manual at 615 n. 199 (“If acceleration occurs, then the appropriate characterization of the harm for purposes of determining damages would have to be addressed. A defendant who only accelerates the occurrence of harm, say, chronic back pain, that would have occurred independently in the plaintiff at a later time is not liable for the same amount of damages as a defendant who causes a lifetime of chronic back pain.”). More important, however, the Manual appears to be wrong that epidemiologic studies cannot identify acceleration of onset of a particular disease in an epidemiologic study or clinical trial. Many modern longitudinal epidemiologic studies and clinical trials use survival analysis and time windows to identify latency or time lagged outcomes in association with identified exposures.

The fourth assumption identified in the Manual is that the exposure under study acts independently of other exposures. The authors give the time-worn example of multiplicative synergy between asbestos and smoking, what elsewhere has been referred to as “The Mt. Sinai Catechism” (June 7, 2013). The example was improvidently chosen given that the multiplicative nature was doubtful when first advanced, and now has effectively been retracted or modified by the researchers following the health outcomes of asbestos insulators in the United States. More important for our purposes here, interactions can be quantified and added to the analysis of attributable risk; interactions are not insuperable barriers to reasonable apportiontment of risk.

Fifth, the Manual identifies two additional assumptions in that (a) the exposure at issue is not responsible for another outcome that competes with morbidity or mortality, and (b) the exposure does not provide a protective “effect” in a subpopulation of those studied. Manual at 615. On the first of these assumptions, the authors suggest that this assumption is required “because in the epidemiologic studies relied on, those deaths caused by the alternative disease process will mask the true magnitude of increased incidence of the studied disease when the study subjects die before developing the disease of interest.” Id. at 615 n.202. Competing causes, however, are frequently studied and can be treated as confounders in an appropriate regression or propensity score analysis to yield a risk estimate for each individual putative effect at issue. The second of the two assumptions is a rehash of the speculative assertion that the epidemiologic study (and the population it samples) may not have a stochastic distribution of risk. Although the stochastic assumption may not be correct, it is often favorable to the party asserting the claim who otherwise would not be able to show that he was not in a sub-population of people not affected at all, or even benefitted from the exposure. Again, modern epidemiology does not stop at identifying populations at risk, but continues to refine the assessment by trying to identify subpopulations that have the risk exclusively. The existence of multi-modal distributions of risk within a population is, again, not a barrier to the doubling argument.

With sufficiently large samples, epidemiologic studies may be able to identify subgroups that have very large relative risks, even when the overall sample under study had a relative risk under two. The possibility of such subgroups, however, should not be an invitation to wholesale speculation that a given plaintiff is in a “vulnerable” subgroup without reliable, valid evidence of what the risks for the identified subgroup are. Too often, the vulnerable plaintiff or subgroup claim is merely hand waving in an evidentiary vacuum. The Manual authors seem to adopt this hand-waving attitude when they give a speculative hypothetical example:

“For example, genetics might be known to be responsible for 50% of the incidence of a disease independent of exposure to the agent. If genetics can be ruled out in an individual’s case, then a relative risk greater than 1.5 might be sufficient to support an inference that the agent was more likely than not responsible for the plaintiff’s disease.”

Manual at 615-16 (internal citations omitted). The hypothetical is unclear whether “the genetics” cases are part of the study that yielded a relative risk of 1.5, but of course if the “genetics” were uniformly distributed in the population, and also in the sample studied in the epidemiologic study, then the “genetics” would appear to drop out of playing any role in elevating risk. But as the authors pointed out in their caveats about interaction, there may well be a role of interaction between the “genetics” and the exposure in the study such that “the genetics” cases occurred earlier or did not add anything to the disease burden that would have been caused by the exposure under study that reported out a relative risk of 1.5. So bottom line, plaintiff would need a study that applied the “genetics” to the epidemiologic study to see what relative risks might be observed in people without the genes at issue.

The Third Edition of the Manual does add more nuance to the doubling of risk argument, but alas more nuance yet is needed. The chapter is an important source to include in any legal argument for or against inferences of specific causation, but it is hardly the final word.

Below, I have updated a reference list of cases that reference the doubling argument.


Radiation

Johnston v. United States, 597 F. Supp. 374, 412, 425-26 (D. Kan. 1984) (rejecting even a relative risk of greater than two as supporting an inference of specific causation)

Allen v. United States, 588 F. Supp. 247, 418 (1984) (rejecting mechanical application of doubling of risk), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987), cert. denied, 484 U.S. 1004 (1988)

In re TMI Litig., 927 F. Supp. 834, 845, 864–66 (M.D. Pa. 1996), aff’d, 89 F.3d 1106 (3d Cir. 1996), aff’d in part, rev’d in part, 193 F.3d 613 (3d Cir. 1999) (rejecting “doubling dose” trial court’s analysis), modified 199 F.3d 158 (3d Cir. 2000) (stating that a dose below ten rems is insufficient to infer more likely than not the existence of a causal link)

In re Hanford Nuclear Reservation Litig., 1998 WL 775340, at *8 (E.D. Wash. Aug. 21, 1998) (“‘[d]oubling of the risk’ is the legal standard for evaluating the sufficiency of the plaintiffs’ evidence and for determining which claims should be heard by the jury,” citing Daubert II), rev’d, 292 F.3d 1124, 1136-37 (9th Cir. 2002) (general causation)

In re Berg Litig., 293 F.3d 1127 (9th Cir. 2002) (companion case to In re Hanford)

Cano v. Everest Minerals Corp., 362 F. Supp. 2d 814, 846 (W.D. Tex. 2005) (relative risk less than 3.0 represents only a weak association)

Cook v. Rockwell Internat’l Corp., 580 F. Supp. 2d 1071, 1083n.8, 1084, 1088-89 (D. Colo. 2006) (citing Daubert II and “concerns” by Sander Greenland and David Egilman, plaintiffs’ expert witnesses in other cases), rev’d and remanded on other grounds, 618 F.3d 1127 (10th Cir. 2010), cert. denied, ___ U.S. ___ (May 24, 2012)

Cotroneo v. Shaw Envt’l & Infrastructure, Inc., No. H-05- 1250, 2007 WL 3145791, at *3 (S.D. Tex. Oct. 25, 2007) (citing Havner, 953 S.W.2d at 717) (radioactive material)


Swine Flu- GBS Cases

Cook v. United States, 545 F. Supp. 306, 308 (N.D. Cal. 1982)(“Whenever the relative risk to vaccinated persons is greater than two times the risk to unvaccinated persons, there is a greater than 50% chance that a given GBS case among vaccinees of that latency period is attributable to vaccination, thus sustaining plaintiff’s burden of proof on causation.”)

Robinson v. United States, 533 F. Supp. 320, 325-28 (E.D. Mich. 1982) (finding for the government and against claimant who developed acute signs and symptoms of GBS 17 weeks after innoculation, in part because of relative and attributable risks)

Padgett v. United States, 553 F. Supp. 794, 800 – 01 (W.D. Tex. 1982) (“From the relative risk, we can calculate the probability that a given case of GBS was caused by vaccination. . . . [A] relative risk of 2 or greater would indicate that it was more likely than not that vaccination caused a case of GBS.”)

Manko v. United States, 636 F. Supp. 1419, 1434 (W.D. Mo. 1986) (relative risk of 2, or less, means exposure not the probable cause of disease claimed) (incorrectly suggesting that relative risk of two means that there was a 50% chance the disease was caused by “chance alone”), aff’d in relevant part, 830 F.2d 831 (8th Cir. 1987)


IUD Cases – Pelvic Inflammatory Disease

Marder v. G.D. Searle & Co., 630 F. Supp. 1087, 1092 (D.Md. 1986) (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987) (per curiam)


Bendectin cases

Lynch v. Merrill-National Laboratories, 646 F.Supp. 856 (D. Mass. 1986)(granting summary judgment), aff’d, 830 F.2d 1190, 1197 (1st Cir. 1987)(distinguishing between chances that “somewhat favor” plaintiff and plaintiff’s burden of showing specific causation by “preponderant evidence”)

DeLuca v. Merrell Dow Pharm., Inc., 911 F.2d 941, 958-59 (3d Cir. 1990) (commenting that ‘‘[i]f New Jersey law requires the DeLucas to show that it is more likely than not that Bendectin caused Amy DeLuca’s birth defects, and they are forced to rely solely on Dr. Done’s epidemiological analysis in order to avoid summary judgment, the relative risk of limb reduction defects arising from the epidemiological data Done relies upon will, at a minimum, have to exceed ‘2’’’)

Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1321 (9th Cir.) (“Daubert II”) (holding that for epidemiological testimony to be admissible to prove specific causation, there must have been a relative risk for the plaintiff of greater than 2; testimony that the drug “increased somewhat the likelihood of birth defects” is insufficient) (“For an epidemiological study to show causation under a preponderance standard . . . the study must how that children whose mothers took Bendectin are more than twice as likely to develop limb reduction birth defects as children whose mothers did not.”), cert. denied, 516 U.S. 869 (1995)

DePyper v. Navarro, 1995 WL 788828 (Mich. Cir. Ct. Nov. 27, 1995)

Oxendine v. Merrell Dow Pharm., Inc., 1996 WL 680992 (D.C. Super. Ct. Oct. 24, 1996) (noting testimony by Dr. Michael Bracken, that had Bendectin doubled risk of birth defects, overall rate of that birth defect should have fallen 23% after manufacturer withdrew drug from market, when in fact the rate remained relatively steady)

Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 716 (Tex. 1997) (holding, in accord with the weight of judicial authority, “that the requirement of a more than 50% probability means that epidemiological evidence must show that the risk of an injury or condition in the exposed population was more than double the risk in the unexposed or control population”); id. at at 719 (rejecting isolated statistically significant associations when not consistently found among studies)


Silicone Cases

Hall v. Baxter Healthcare, 947 F. Supp. 1387, 1392, 1397, 1403-04 (D. Ore. 1996) (discussing relative risk of 2.0)

Pick v. American Medical Systems, Inc., 958 F. Supp. 1151, 1160 (E.D. La. 1997) (noting, correctly but irrelevantly, in penile implant case, that “any” increased risk suggests that the exposure “may” have played some causal role)

In re Breast Implant Litigation, 11 F. Supp. 2d 1217, 1226 -27 (D. Colo. 1998) (relative risk of 2.0 or less shows that the background risk is at least as likely to have given rise to the alleged injury)

Barrow v. Bristol-Myers Squibb Co., 1998 WL 812318 (M.D. Fla. Oct. 29, 1998)

Minnesota Mining and Manufacturing v. Atterbury, 978 S.W.2d 183, 198 (Tex.App. – Texarkana 1998) (noting that Havner declined to set strict criteria and that “[t]here is no requirement in a toxic tort case that a party must have reliable evidence of a relative risk of 2.0 or greater”)

Allison v. McGhan Med. Corp., 184 F.3d 1300, 1315n.16, 1316 (11th Cir. 1999) (affirming exclusion of expert testimony based upon a study with a risk ratio of 1.24; noting that statistically significant epidemiological study reporting an increased risk of marker of disease of 1.24 times in patients with breast implants was so close to 1.0 that it “was not worth serious consideration for proving causation”; threshold for concluding that an agent more likely than not caused a disease is 2.0, citing Federal Judicial Center, Reference Manual on Scientific Evidence 168-69 (1994))

Grant v. Bristol-Myers Squibb, 97 F. Supp. 2d 986, 992 (D. Ariz. 2000)

Pozefsky v. Baxter Healthcare Corp., No. 92-CV-0314, 2001 WL 967608, at *3 (N.D.N.Y. August 16, 2001) (excluding causation opinion testimony given contrary epidemiologic studies; noting that sufficient epidemiologic evidence requires relative risk greater than two)

In re Silicone Gel Breast Implant Litig., 318 F. Supp. 2d 879, 893 (C.D. Cal. 2004) (“The relative risk is obtained by dividing the proportion of individuals in the exposed group who contract the disease by the proportion of individuals who contract the disease in the non-exposed group.”) (noting that relative risk must be more than doubled at a minimum to permit an inference that the risk was operating in plaintiff’s case)

Norris v. Baxter Healthcare Corp., 397 F.3d 878 (10th Cir. 2005) (discussing but not deciding specific causation and the need for relative risk greater than two; no reliable showing of general causation)

Barrow v. Bristol-Meyers Squibb Co., 1998 WL 812318, at *23 (M.D. Fla., Oct. 29, 1998)

Minnesota Mining and Manufacturing v. Atterbury, 978 S.W.2d 183, 198 (Tex. App. – Texarkana 1998) (noting that “[t]here is no requirement in a toxic tort case that a party must have reliable evidence of a relative risk of 2.0 or greater”)


Asbestos

Lee v. Johns Manville Corp., slip op. at 3, Phila. Cty. Ct. C.P., Sept. Term 1978, No. 88 (123) (Oct. 26, 1983) (Forer, J.)(entering verdict in favor of defendants on grounds that plaintiff had failed to show that his colo rectal cancer had been caused by asbestos exposure after adducing evidence of a relative risk less than two)

Washington v. Armstrong World Indus., Inc., 839 F.2d 1121 (5th Cir. 1988) (affirming grant of summary judgment on grounds that there was insufficient evidence that plaintiff’s colon cancer was caused by asbestos)

Primavera v. Celotex Corp., Phila. Cty. Ct. C.P., December Term, 1981, No. 1283 (Bench Op. of Hon. Berel Caesar, (Nov. 2, 1988) (granting compulsory nonsuit on the plaintiff’s claim that his colorectal cancer was caused by his occupational exposure to asbestos)

In re Fibreboard Corp.,893 F.2d 706, 712 (5th Cir.1990) (“It is evident that these statistical estimates deal only with general causation, for population-based probability estimates do not speak to a probability of causation in any one case; the estimate of relative risk is a property of the studied population, not of an individual’s case.” (internal quotation omitted) (emphasis in original))

Grassis v. Johns-Manville Corp., 248 N.J. Super. 446, 455-56, 591 A.2d 671, 676 (App. Div. 1991) (rejecting doubling of risk threshold in asbestos gastrointestinal cancer claim)

Landrigan v. Celotex Corp., 127 N.J. 404, 419, 605 A.2d 1079 (1992) (reversing judgment entered on directed verdict for defendant on specific causation of claim that asbestos caused decedent’s colon cancer)

Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992) (reversing judgment entered on directed verdict for defendant on specific causation of claim that asbestos caused plaintiff’s colon cancer)

In re Joint E. & S. Dist. Asbestos Litig., 758 F. Supp. 199 (S.D.N.Y. 1991), rev’d sub nom. Maiorano v. Owens Corning Corp., 964 F.2d 92, 97 (2d Cir. 1992);

Maiorana v. National Gypsum, 827 F. Supp. 1014, 1043 (S.D.N.Y. 1993), aff’d in part and rev’d in part, 52 F.3d 1124, 1134 (2d Cir. 1995) (stating a preference for the district court’s instructing the jury on the science and then letting the jury weigh the studies)

Keene Corp. v. Hall, 626 A.2d 997 (Md. Spec. Ct. App. 1993) (laryngeal cancer)

Jones v. Owens-Corning Fiberglas Corp., 288 N.J. Super. 258, 266, 672 A.2d 230, 235 (App. Div. 1996) (rejecting doubling of risk threshold in asbestos gastrointestinal cancer claim)

In re W.R. Grace & Co., 355 B.R. 462, 483 (Bankr. D. Del. 2006) (requiring showing of relative risk greater than two to support property damage claims based on unreasonable risks from asbestos insulation products)

Kwasnik v. A.C. & S., Inc. (El Paso Cty., Tex. 2002)

Sienkiewicz v. Greif (U.K.) Ltd., [2009] EWCA (Civ) 1159, at ¶23 (Lady Justice Smith) (“In my view, it must now be taken that, saving the expression of a different view by the Supreme Court, in a case of multiple potential causes, a claimant can demonstrate causation in a case by showing that the tortious exposure has at least doubled the risk arising from the non-tortious cause or causes.”)

Sienkiewicz v. Greif  Ltd., [2011] UKSC 10.

“Where there are competing alternative, rather than cumulative, potential causes of a disease or injury, such as in Hotson, I can see no reason in principle why epidemiological reason should not be used to show that one of the causes was more than twice as likely as all the others put together to have caused the disease or injury.” (Lord Philips, at ¶ 93)

(arguing that statistical evidence should be considered without clearly identifying the nature and extent of its role) (Baroness Hale, ¶ 172-73)

(insisting upon difference between fact and probability of causation, with statistical evidence not probative of the former) (Lord Roger, at ¶143-59)

(“the law is concerned with the rights and wrongs of an individual situation, and should not treat people and even companies as statistics,” although epidemiologic evidence can appropriately be used he identified “in conjunction with specific evidence”) (Lord Mance, at ¶205)

(concluding that epidemiologic evidence can establish the probability, but not the fact of causation, and vaguely suggesting that whether epidemiologic evidence should be allowed was a matter of policy) (Lord Dyson, ¶218-19)

Dixon v. Ford Motor Co., 47 A. 3d 1038, 1046-47 & n.11 (Md. Ct. Special Appeals 2012)(“we can explicitly derive the probability of causation from the statistical measure known as ‘relative risk’, as did the U.S. Court of Appeals for the Third Circuit in DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 958 (3d Cir.1990), in a holding later adopted by several courts. For reasons we need not explore in detail, it is not prudent to set a singular minimum ‘relative risk’value as a legal standard. But even if there were some legal threshold, Dr. Welch provided no information that could help the finder of fact to decide whether the elevated risk in this case was ‘substantial’.”)(internal citations omitted), rev’d, 433 Md. 137, 70 A.3d 328 (2013)


Pharmaceutical Cases

Ambrosini v. Upjohn, 1995 WL 637650, at *4 (D.D.C. Oct. 18, 1995) (excluding plaintiff’s expert witness, Dr. Brian Strom, who was unable to state that mother’s use of Depo-Provero to prevent miscarriage more than doubled her child’s risk of a birth defect)

Ambrosini v. Labarraque, 101 F.3d 129, 135 (D.C. Cir. 1996)(Depo-Provera, birth defects) (testimony “does not warrant exclusion simply because it fails to establish the causal link to a specified degree of probability”)

Siharath v. Sandoz Pharms. Corp., 131 F. Supp. 2d 1347, 1356 (N.D. Ga. 2001)

Cloud v. Pfizer Inc., 198 F. Supp. 2d 1118, 1134 (D. Ariz. 2001) (sertraline and suicide)

Miller v. Pfizer, 196 F. Supp. 2d 1062, 1079 (D. Kan. 2002) (acknowledging that most courts require a showing of RR > 2, but questioning their reasoning; “Court rejects Pfizer’s argument that unless Zoloft is shown to create a relative risk [of akathisia] greater than 2.0, [expert’s] testimony is inadmissible”), aff’d, 356 F. 3d 1326 (10th Cir.), cert. denied, 543 U.S. 917 (2004)

XYZ, et al. v. Schering Health Care Ltd., [2002] EWHC 1420, at ¶21, 70 BMLR 88 (QB 2002) (noting with approval that claimants had accepted the need to  prove relative risk greater than two; finding that most likely relative risk was 1.7, which required finding against claimants even if general causation were established)

Smith v. Wyeth-Ayerst Laboratories Co., 278 F. Supp. 2d 684, 691 (W.D.N.C. 2003) (recognizing that risk and cause are distinct concepts) (“Epidemiologic data that shows a risk cannot support an inference of cause unless (1) the data are statistically significant according to scientific standards used for evaluating such associations; (2) the relative risk is sufficiently strong to support an inference of ‘more likely than not’; and (3)  the epidemiologic data fits the plaintiff’s case in terms of exposure, latency, and other relevant variables.”) (citing FJC Reference Manual at 384 – 85 (2d ed. 2000))

Kelley v. Sec’y of Health & Human Servs., 68 Fed. Cl. 84, 92 (Fed. Cl. 2005) (quoting Kelley v. Sec’y of Health & Human Servs., No. 02-223V, 2005 WL 1125671, at *5 (Fed. Cl. Mar. 17, 2005) (opinion of Special Master explaining that epidemiology must show relative risk greater than two to provide evidence of causation), rev’d on other grounds, 68 Fed. Cl. 84 (2005))

Pafford v. Secretary of HHS, No. 01–0165V, 64 Fed. Cl. 19, 2005 WL 4575936 at *8 (2005) (expressing preference for “an epidemiologic study demonstrating a relative risk greater than two … or dispositive clinical or pathological markers evidencing a direct causal relationship”) (citing Stevens v. Secretary of HHS, No.2001 WL 387418 at *12), aff’d, 451 F.3d 1352 (Fed. Cir. 2006)

Burton v. Wyeth-Ayerst Labs., 513 F. Supp. 2d 719, 730 (N.D. Tex. 2007) (affirming exclusion of expert witness testimony that did not meet Havner’s requirement of relative risks greater than two, Merrell Dow Pharm., Inc. v. Havner, 953 S.W.2d 706, 717–18 (Tex. 1997))

In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1172 (N.D. Calif. 2007) (observing that epidemiologic studies “can also be probative of specific causation, but only if the relative risk is greater than 2.0, that is, the product more than doubles the risk of getting the disease”)

In re Bextra & Celebrex, 2008 N.Y. Misc. LEXIS 720, *23-24, 239 N.Y.L.J. 27 (2008) (“Proof that a relative risk is greater than 2.0 is arguably relevant to the issue of specific, as opposed to general causation and is not required for plaintiffs to meet their burden in opposing defendants’ motion.”)

In re Viagra Products Liab. Litigat., 572 F. Supp. 2d 1071, 1078 (D. Minn. 2008) (noting that some but not all courts have concluded relative risks under two support finding expert witness’s opinion to be inadmissible)

Vanderwerf v. SmithKlineBeecham Corp., 529 F.Supp. 2d 1294, 1302 n.10 (D. Kan. 2008), appeal dism’d, 603 F.3d 842 (10th Cir. 2010) (“relative risk of 2.00 means that a particular event of suicidal behavior has a 50 per cent chance that is associated with the exposure to Paxil … .”)

Wright v. American Home Products Corp., 557 F. Supp. 2d 1032, 1035-36 (W.D. Mo. 2008) (fenfluramine case)

Beylin v. Wyeth, 738 F.Supp. 2d 887, 893 n.3 (E.D.Ark. 2010) (MDL court) (Wilson, J. & Montgomery, J.) (addressing relative risk of two argument in dictum; holding that defendants’ argument that for an opinion to be relevant it must show that the medication causes the relative risk to exceed two “was without merit”)

Merck & Co. v. Garza, 347 S.W.3d 256 (Tex. 2011), rev’g 2008 WL 2037350, at *2 (Tex. App. — San Antonio May 14, 2008, no pet. h.)

Scharff v. Wyeth, No. 2:10–CV–220–WKW, 2012 WL 3149248, *6 & n.9, 11 (M.D. Ala. Aug. 1, 2012) (post-menopausal hormone therapy case; “A relative risk of 2.0 implies a 50% likelihood that an exposed individual’s disease was caused by the agent. The lower relative risk in this study reveals that some number less than half of the additional cases could be attributed to [estrogen and progestin].”)

Cheek v. Wyeth, LLC (In re Diet Drugs), 890 F.Supp. 2d 552 (E.D. Pa. 2012)


Medical Malpractice – Failure to Prescribe; Delay in Treatment

Merriam v. Wanger, 757 A.2d 778, 2000 Me. 159 (2000) (reversing judgment on jury verdict for plaintiff on grounds that plaintiff failed to show that defendant failure to act were, more likely than not, a cause of harm)

Bonesmo v. The Nemours Foundation, 253 F. Supp. 2d 801, 809 (D. Del. 2003)

Theofanis v. Sarrafi, 791 N.E.2d 38,48 (Ill. App. 2003) (reversing and granting new trial to plaintiff who received an award of no damages when experts testified that relative risk was between 2.0 and 3.0)(“where the risk with the negligent act is at least twice as great as the risk in the absence of negligence, the evidence supports a finding that, more likely than not, the negligence in fact caused the harm”)

Cottrelle v. Gerrard, 67 OR (3d) 737 (2003), 2003 CanLII 50091 (ONCA), at ¶ 25 (Sharpe, J.A.) (less than a probable chance that timely treatment would have made a difference for plaintiff is insufficient), leave to appeal den’d SCC (April 22, 2004)

Joshi v. Providence Health System of Oregon Corp., 342 Or. 152, 156, 149 P. 3d 1164, 1166 (2006) (affirming directed verdict for defendants when expert witness testified that he could not state, to a reasonable degree of medical probability, beyond 30%, that administering t-PA, or other anti-coagulant would have changed the outcome and prevented death)

Ensink v. Mecosta County Gen. Hosp., 262 Mich. App. 518, 687 N.W.2d 143 (Mich. App. 2004) (affirming summary judgment for hospital and physicians when patient could not greater than 50% probability of obtaining a better result had emergency physician administered t-PA within three hours of stroke symptoms)

Lake Cumberland, LLC v. Dishman, 2007 WL 1229432, *5 (Ky. Ct. App. 2007) (unpublished) confusing 30% with a “reasonable probability”; citing without critical discussion an apparently innumerate opinion of expert witness Dr. Lawson Bernstein)

Mich. Comp. Laws § 600.2912a(2) (2009) (“In an action alleging medical malpractice, the plaintiff has the burden of proving that he or she suffered an injury that more probably than not was proximately caused by the negligence of the defendant or defendants. In an action alleging medical malpractice, the plaintiff cannot recover for loss of an opportunity to survive or an opportunity to achieve a better result unless the opportunity was greater than 50%.”)

O’Neal v. St. John Hosp. & Med. Ctr., 487 Mich. 485, 791 N.W.2d 853 (Mich. 2010) (affirming denial of summary judgment when failure to administer therapy (not t-PA) in a timely fashion supposedly more than doubled the risk of stroke)

Kava v. Peters, 450 Fed. Appx. 470, 478-79 (6th Cir. 2011) (affirming summary judgment for defendants when plaintiffs expert witnesses failed to provide clear testimony that plaintiff specific condition would have been improved by timely administration of therapy)

Smith v. Bubak, 643 F.3d 1137, 1141–42 (8th Cir. 2011) (rejecting relative benefit testimony and suggesting in dictum that absolute benefit “is the measure of a drug’s overall effectiveness”)

Young v. Mem’l Hermann Hosp. Sys., 573 F.3d 233, 236 (5th Cir. 2009) (holding that Texas law requires a doubling of the relative risk of an adverse outcome to prove causation), cert. denied, ___ U.S. ___, 130 S.Ct. 1512 (2010)

Gyani v. Great Neck Medical Group, 2011 WL 1430037 (N.Y. S.Ct. for Nassau Cty., April 4, 2011) (denying summary judgment to medical malpractice defendant on stroke patient’s claims that failure to administer t-PA, on naked assertions of proximate cause by plaintiff’s expert witness, and without considering actual magnitude of risk increased by alleged failure to treat)

Samaan v. St. Joseph Hospital, 670 F.3d 21 (1st Cir. 2012)

Goodman v. Viljoen, 2011 ONSC 821 (CanLII)(treating a risk ratio of 1.7 for harm, or 0.6 for prevention, as satisfying the “balance of probabilities” when taken with additional unquantified, unvalidated speculation), aff’d, 2012 ONCA 896 (CanLII), leave appeal den’d, Supreme Court of Canada No. 35230 (July 11, 2013)

Briante v. Vancouver Island Health Authority, 2014 Brit. Columbia S.Ct 1511, at ¶ 317 (plaintiff must show “on a balance of probabilities that the defendant caused the injury”)


Toxic Tort Cases

In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 785, 836 (E.D.N.Y. 1984) (“A government administrative agency may regulate or prohibit the use of toxic substances through rulemaking, despite a very low probability of any causal relationship.  A court, in contrast, must observe the tort law requirement that a plaintiff establish a probability of more than 50% that the defendant’s action injured him. … This means that at least a two-fold increase in incidence of the disease attributable to Agent Orange exposure is required to permit recovery if epidemiological studies alone are relied upon.”), aff’d 818 F.2d 145, 150-51 (2d Cir. 1987)(approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 487 U.S. 1234 (1988)

Wright v. Willamette Indus., Inc., 91 F.3d 1105 (8th Cir. 1996)(“Actions in tort for damages focus on the question of whether to transfer money from one individual to another, and under common-law principles (like the ones that Arkansas law recognizes) that transfer can take place only if one individual proves, among other things, that it is more likely than not that another individual has caused him or her harm.  It is therefore not enough for a plaintiff to show that a certain chemical agent sometimes causes the kind of harm that he or she is complaining of.  At a minimum, we think that there must be evidence from which the factfinder can conclude that the plaintiff was exposed to levels of that agent that are known to cause the kind of harm that the plaintiff claims to have suffered. See Abuan v. General Elec. Co., 3 F.3d at 333.  We do not require a mathematically precise table equating levels of exposure with levels of harm, but there must be evidence from which a reasonable person could conclude that a defendant’s emission has probably caused a particular plaintiff the kind of harm of which he or she complains before there can be a recovery.”)

Sanderson v. Internat’l Flavors & Fragrances, Inc., 950 F. Supp. 981, 998 n. 17,  999-1000, 1004 (C.D. Cal.1996) (more than a doubling of risk is required in case involving aldehyde exposure and claimed multiple chemical sensitivities)

McDaniel v. CSX Transp., Inc., 955 S.W.2d 257, 264 (1997) (doubling of risk is relevant but not required as a matter of law)

Schudel v. General Electric Co., 120 F.3d 991, 996 (9th Cir. 1997) (polychlorinated biphenyls)

Lofgren v. Motorola, 1998 WL 299925 *14 (Ariz. Super. June 1, 1998) (suggesting that relative risk requirement in tricholorethylene cancer medical monitoring case was arbitrary, but excluding plaintiffs’ expert witnesses on other grounds)

Berry v. CSX Transp., Inc., 709 So. 2d 552 (Fla. D. Ct.App. 1998) (reversing exclusion of plaintiff’s epidemiologist in case involving claims of toxic encephalopathy from solvent exposure, before Florida adopted Daubert standard)

Bartley v. Euclid, Inc., 158 F.3d 261 (5th Cir. 1998) (evidence at trial more than satisfied the relative risk greater than two requirement), rev’d on rehearing en banc, 180 F.3d 175 (5th Cir. 1999)

Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 591-92, 605 n.27, 606–07 (D.N.J. 2002) (“When the relative risk reaches 2.0, the risk has doubled, indicating that the risk is twice as high among the exposed group as compared to the non-exposed group. Thus, ‘the threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0’.”) (quoting FJC Reference Manual at 384), aff’d, 68 F. App’x 356 (3d Cir. 2003)

Allison v. Fire Ins. Exchange, 98 S.W.3d 227, 239 (Tex. App. — Austin 2002, no pet. h.)

Ferguson v. Riverside School Dist. No. 416, 2002 WL 34355958 (E.D. Wash. Feb. 6, 2002) (No. CS-00-0097-FVS)

Daniels v. Lyondell-Citgo Refining Co., 99 S.W.3d 722, 727 (Tex. App. – Houston [1st Dist.] 2003) (affirming exclusion of expert witness testimony that did not meet Havner’s requirement of relative risks greater than two)

Exxon Corp. v. Makofski, 116 S.W.3d 176, 184-85 (Tex. App. — Houston 2003)

Frias v. Atlantic Richfield Co., 104 S.W.3d 925 (Tex. App. — Houston 2003)

Graham v. Lautrec, Ltd., 2003 WL 23512133, at *1 (Mich. Cir. Ct. 2003) (mold)

Mobil Oil Corp. v. Bailey, 187 S.W.3d 263, 268 (Tex. App. – Beaumont 2006) (affirming exclusion of expert witness testimony that did not meet Havner’s requirement of relative risks greater than two)

In re Lockheed Litig. Cases, 115 Cal. App. 4th 558 (2004)(alleging brain, liver, and kidney damage), rev’d in part, 23 Cal. Rptr. 3d 762, 765 (Cal. App. 2d Dist. 2005) (“[A] court cannot exclude an epidemiological study from consideration solely because the study shows a relative risk of less than 2.0.”), rev. dismissed, 192 P.3d 403 (Cal. 2007)

Novartis Grimsby Ltd. v. Cookson, [2007] EWCA (Civ) 1261, at para. 74 (causation was successfully established by risk ratio greater than two; per Lady Justice Smith: “Put in terms of risk, the occupational exposure had more than doubled the risk [of the bladder cancer complained of] due to smoking. . . . if the correct test for causation in a case such as this is the “but for” test and nothing less will do, that test is plainly satisfied on the facts as found. . . . In terms of risk, if the occupational exposure more than doubles the risk due to smoking, it must, as a matter of logic, be probable that the disease was caused by the former.”)

Watts v. Radiator Specialty Co., 990 So. 2d 143 (Miss. 2008) (“The threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0.”)

King v. Burlington Northern Santa Fe Ry, 762 N.W.2d 24, 36-37 (Neb. 2009) (reversing exclusion of proffered testimony of Arthur Frank on claim that diesel exposure caused multiple myeloma, and addressing in dicta the ability of expert witnesses to speculate reasons why specific causation exists even with relative risk less than two) (“If a study shows a relative risk of 2.0, ‘the agent is responsible for an equal number of cases of disease as all other background causes.’ This finding ‘implies a 50% likelihood that an exposed individual’s disease was caused by the agent.’ If the relative risk is greater than 2.0, the study shows a greater than 50–percent likelihood that the agent caused the disease.”)(internal citations to Reference Manual on Scientific Evidence (2d ed. 2000) omitted)

Henricksen v. Conocophillips Co., 605 F. Supp. 2d 1142, 1158 (E.D. Wash. 2009) (noting that under Circuit precedent, epidemiologic studies showing low-level risk may suffiicent to show general causation but are sufficient to show specific causation only if relative risk exceeds two) (excluding plaintiff‘s expert witness’s testimony because epidemiologic evidence is “contradictory and inconsistent”)

City of San Antonio v. Pollock, 284 S.W.3d 809, 818 (Tex. 2009) (holding testimony admitted insufficient as matter of law)

George v. Vermont League of Cities and Towns, 2010 Vt. 1, 993 A.2d 367, 375 (2010)

Blanchard v. Goodyear Tire & Rubber Co., No. 837-12-07 Wrcv (Eaton, J., June 28, 2010) (excluding expert witness, David Goldsmith, and entering summary judgment), aff’d, 190 Vt. 577, 30 A.3d 1271 (2011)

Pritchard v. Dow Agro Sciences, 705 F. Supp. 2d 471, 486 (W.D. Pa. 2010) (excluding opinions of Dr. Omalu on Dursban, in part because of low relative risk) (“Therefore, a relative risk of 2.0 is not dispositive of the reliability of an expert’s opinion relying on an epidemiological study, but it is a factor, among others, which the Court is to consider in its evaluation.”), aff’d, 430 Fed. Appx. 102, 2011 WL 2160456 (3d Cir. 2011)

Faust v. BNSF Ry., 337 S.W.3d 325, 337 (Tex. Ct. App. 2d Dist. 2011) (“To be considered reliable scientific evidence of general causation, an epidemiological study must (1) have a relative risk of 2.0 and (2) be statistically significant at the 95% confidence level.”) (internal citations omitted)

Nonnon v. City of New York, 88 A.D.3d 384, 398-99, 932 N.Y.S.2d 428, 437-38 (1st Dep’t 2011) (holding that the strength of the epidemiologic evidence, with relative risks greater than 2.0, permitted an inference of causation)

Milward v. Acuity Specialty Products Group, Inc., 969 F. Supp. 2d 101, 112-13 & n.7 (D. Mass. 2013) (avoiding doubling of risk issue and holding that plaintiffs’ expert witnesses failed to rely upon a valid exposure estimate and lacked sufficient qualifications to evaluate and weigh the epidemiologic studies that provided estimates of relative risk) (generalities about the “core competencies” of physicians or specialty practices cannot overcome an expert witness’s explicit admission of lacking the epidemiologic expertise needed to evaluate and weigh the epidemiologic studies and methods at issue in the case. Without the requisite qualifications, an expert witness cannot show that the challenged opinion has a sufficiently reliable scientific foundation in epidemiologic studies and method.)

Berg v. Johnson & Johnson, 940 F.Supp.2d 983 (D.S.D. 2013) (talc and ovarian cancer)


Other

In re Hannaford Bros. Co. Customer Data Sec. Breach Litig., 293 F.R.D. 21, 2:08-MD-1954-DBH, 2013 WL 1182733, *1 (D. Me. Mar. 20, 2013) (Hornby, J.) (denying motion for class certification) (“population-based probability estimates do not speak to a probability of causation in any one case; the estimate of relative risk is a property of the studied population, not of an individual’s case.”)

Cherry Picking; Systematic Reviews; Weight of the Evidence

April 5th, 2015

In a paper prepared for one of Professor Margaret Berger’s symposia on law and science, Lisa Bero, a professor of clinical pharmacy in the University of California San Francisco’s School of Pharmacy identified a major source of error in published reviews of putative health effects:

“The biased citation of studies in a review can be a major source of error in the results of the review. Authors of reviews can influence their conclusions by citing only studies that support their preconceived, desired outcome.”

Lisa Bero, “Evaluating Systematic Reviews and Meta-Analyses,” 14 J. L. & Policy 569, 576 (2006). Biased citation, consideration, and reliance are major sources of methodological error in courtroom proceedings as well. Sometimes astute judges recognize and bar expert witnesses who would pass off their opinions, as well considered, when they are propped up only by biased citation. Unfortunately, courts have been inconsistent, sometimes rewarding cherry picking of studies by admitting biased opinions[1], sometimes unhorsing the would-be expert witnesses by excluding their opinions[2].

Given that cherry picking or “biased citation” is recognized in the professional community as a rather serious methodological sin, judges may be astonished to learn that both phrases, “cherry picking” and “biased citation” do not appear in the third edition of the Reference Manual on Scientific Evidence. Of course, the Manual could have dealt with the underlying issue of biased citation by affirmatively promoting the procedure of systematic reviews, but here again, the Manual falls short. There is no discussion of systematic review in the chapters on toxicology[3], epidemiology[4], or statistics[5]. Only the chapter on clinical medicine discusses the systematic review, briefly[6]. The absence of support for the procedures of systematic reviews, combined with the occasional cheerleading for “weight of the evidence,” in which expert witnesses subjectively include and weight studies to reach pre-ordained opinions, tends to undermines the reliability of the latest edition of the Manual[7].


[1] Spray-Rite Serv. Corp. v. Monsanto Co., 684 F.2d 1226, 1242 (7th Cir. 1982) (failure to consider factors identified by opposing side’s expert did not make testimony inadmissible).

[2] In re Zoloft, 26 F. Supp. 3d 449 (E.D. Pa. 2014) (excluding perinatal epidemiologist, Anick Bérard, for biased cherry picking of data points); In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div. Atlantic Cty. Feb. 20, 2015) (excluding opinions Drs. Arthur Kornbluth and David Madigan because of their authors’ unjustified dismissal of studies that contradicted or undermined their opinions); In re Bextra & Celebrex Mktg. Sales Practices & Prods. Liab. Litig., 524 F.Supp. 2d 1166, 1175–76, 1179 (N.D.Cal.2007) (holding that expert witnesses may not ‘‘cherry-pick[ ]’’ observational studies to support a conclusion that is contradicted by randomized controlled trials, meta-analyses of such trials, and meta-analyses of observational studies; excluding expert witness who ‘‘ignores the vast majority of the evidence in favor of the few studies that support her conclusion’’); Grant v. Pharmative, LLC, 452 F. Supp. 2d 903, 908 (D. Neb. 2006) (excluding expert witness opinion testimony that plaintiff’s use of black cohash caused her autoimmune hepatitis) (“Dr. Corbett’s failure to adequately address the body of contrary epidemiological evidence weighs heavily against admission of his testimony.”); Downs v. Perstorp Components, Inc., 126 F. Supp. 2d 1090,1124-29 (E.D. Tenn. 1999) (expert’s opinion raised seven “red flags” indicating that his testimony was litigation biased), aff’d, 2002 U.S. App. Lexis 382 (6th Cir. Jan. 4, 2002).

[3] Bernard D. Goldstein & Mary Sue Henifin, “Reference Guide on Toxicology,” in Reference Manual on Scientific Evidence 633 (3d ed. 2011).

[4] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 549 (3d ed. 2011).

[5] David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in Reference Manual on Scientific Evidence 209 (3d ed. 2011).

[6] John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, “Reference Guide on Medical Testimony,” in Federal Judicial Center and National Research Council, Reference Manual on Scientific Evidence 687 (3d ed. 2011).

[7] See Margaret A. Berger, “The Admissibility of Expert Testimony,” in Reference Manual on Scientific Evidence 11, 20 & n.51 (3d ed. 2011) (posthumously citing Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11, 26 (1st Cir. 2011), with approval, for reversing exclusion of expert witnesses who advanced “weight of the evidence” opinions).

The Misbegotten Judicial Resistance to the Daubert Revolution

December 8th, 2013

David Bernstein is a Professor at the George Mason University School of Law.  Professor Bernstein has been writing about expert witness evidentiary issues for almost as long as I have been litigating them.  I have learned much from his academic writings on expert witness issues, which include his contributions to two important multi-authored texts, The New Wigmore: Expert Evidence (2d ed. 2010), Phantom Risk: Scientific Inference and the Law (MIT Press 1993).

Bernstein’s draft article on the Daubert Counter-revolution, which some might call a surge by judicial reactionaries, has been available on the Social Science Research Network, and on his law school’s website. SeeDavid Bernstein on the Daubert Counterrevolution” (April 19, 2013).  Professor Bernstein’s article has now been published in the current issue of the Notre Dame Law Review, and is available at its website. David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89 Notre Dame L. Rev. 27 (2013).  This article might well replace the out-dated chapter by the late Professor Berger in the latest edition of the Reference Manual on Scientific Evidence.

 

 

Manganese Meta-Analysis Further Undermines Reference Manual’s Toxicology Chapter

October 15th, 2012

Last October, when the ink was still wet on the Reference Manual on Scientific Evidence (3d 2011), I dipped into the toxicology chapter only to find the treatment of a number of key issues to be partial and biased.  SeeToxicology for Judges – The New Reference Manual on Scientific Evidence” (Oct. 5, 2011).

The chapter, “Reference Guide on Toxicology,” was written by Professor Bernard D. Goldstein, of the University of Pittsburgh Graduate School of Public Health, and Mary Sue Henifin, a partner in the law firm of Buchanan Ingersoll, P.C.  In particular, I noted the authors’ conflicts of interest, both financial and ideological, which may have resulted in an incomplete and tendentious presentation of important concepts in the chapter.  Important concepts in toxicology, such as hormesis, were omitted completely from the chapter.  See, e.g., Mark P. Mattson and Edward J. Calabrese, eds., Hormesis: A Revolution in Biology, Toxicology and Medicine (N.Y. 2009); Curtis D. Klaassen, Casarett & Doull’s Toxicology: The Basic Science of Poisons 23 (7th ed. 2008) (“There is considerable evidence to suggest that some non-nutritional toxic substances may also impart beneficial or stimulatory effects at low doses but that, at higher doses, they produce adverse effects. This concept of “hormesis” was first described for radiation effects but may also pertain to most chemical responses.”)(internal citations omitted); Philip Wexler, et al., eds., 2 Encyclopedia of Toxicology 96 (2005) (“This type of dose–response relationship is observed in a phenomenon known as hormesis, with one explanation being that exposure to small amounts of a material can actually confer resistance to the agent before frank toxicity begins to appear following exposures to larger amounts.  However, analysis of the available mechanistic studies indicates that there is no single hormetic mechanism. In fact, there are numerous ways for biological systems to show hormetic-like biphasic dose–response relationship. Hormetic dose–response has emerged in recent years as a dose–response phenomenon of great interest in toxicology and risk assessment.”).

The financial conflicts are perhaps more readily appreciated.  Goldstein has testified in any number of so-called toxic tort cases, including several in which courts had excluded his testimony as being methodologically unreliable.  These cases are not cited in the ManualSee, e.g., Parker v. Mobil Oil Corp., 7 N.Y.3d 434, 857 N.E.2d 1114, 824 N.Y.S.2d 584 (2006) (dismissing leukemia (AML) claim based upon claimed low-level benzene exposure from gasoline) , aff’g 16 A.D.3d 648 (App. Div. 2d Dep’t 2005); Exxon Corp. v. Makofski, 116 S.W.3d 176 (Tex.App.–Houston [14th Dist.] 2003, pet. denied) (benzene and ALL claim).

One of the disappointments of the toxicology chapter was its failure to remain neutral in substantive disputes, unless of course it could document its position against adversarial claims.  Table 1 in the chapter presents, without documentation or citation,  a “Sample of Selected Toxicological End Points and Examples of Agents of Concern in Humans.” Although many of the agent/disease outcome relationships in the table are well accepted, one was curiously unsupported at the time; namely the claim that manganese causes Parkinson’s disease (PD).  Reference Manual at 653.This tendentious claim undermines the Manual’s attempt to remain disinterested in what was then an ongoing litigation effort.  Last year, I noted that Goldstein’s scholarship was questionable at the time of publication because PD is generally accepted to have no known cause.  Claims that manganese can cause PD had been addressed in several reviews. See, e.g., Karin Wirdefeldt, Hans-Olaf Adami, Philip Cole, Dimitrios Trichopoulos, and Jack Mandel, “Epidemiology and etiology of Parkinson’s disease: a review of the evidence.  26 European J. Epidemiol. S1, S20-21 (2011); Tomas R. Guilarte, “Manganese and Parkinson’s Disease: A Critical Review and New Findings,” 118 Environ Health Perspect. 1071, 1078 (2010) (“The available evidence from human and non­human primate studies using behavioral, neuroimaging, neurochemical, and neuropathological end points provides strong sup­port to the hypothesis that, although excess levels of [manganese] accumulation in the brain results in an atypical form of parkinsonism, this clini­cal outcome is not associated with the degen­eration of nigrostriatal dopaminergic neurons as is the case in PD.”).

More recently, three neuro-epidemiologists have published a systematic review and meta-analysis of the available analytical epidemiologic studies.  What they found was an inverse association between welding, a trade that involves manganese fume exposure, and Parkinson’s disease. James Mortimer, Amy Borenstein, and Lorene Nelson, “Associations of welding and manganese exposure with Parkinson disease: Review and meta-analysis,” 79 Neurology 1174 (2012).

Here are the summary figures from the published meta-analysis:

 

The Fourth Edition should aim at a better integration of toxicology into the evolving science of human health effects.

Pin the Tail on the Significance Test

July 14th, 2012

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply.  See .  An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve.  Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

 

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance.  As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.110 One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading.  The reason for the two-tailed test, however, is not really tied to multiple testing.  If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both  above and below the null hypothesis count in favor of rejecting the null.  A two-tailed test results.  Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication.  Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing?  There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature.  One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

 

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.49 Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321.  This statement is not quite consistent with the chapter on statistics, and it introduces new problems.  The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical.  Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions.  If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value.  The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests.  Id. at 321 n. 49.  The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

 

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression.  Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law.  I believe that the authors’ suggestion about the interest of “most investigators” is incorrect.  The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case).  The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test.  More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine.  The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke.  United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims.  As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims.  See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988).

 

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value.  According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association.  In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test.  One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand.  Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing  as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex.  Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

 

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests.  While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001).  These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance.  As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation  relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,4[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district.  A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009).  See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240.  In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared.  One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)

Love is Blind but What About Judicial Gatekeeping of Expert Witnesses? – Viagra Part I

July 7th, 2012

The Viagra litigation over claimed vision loss vividly illustrates the difficulties that trial judges have in understanding and applying the concept of statistical significance.  In this MDL, plaintiffs sued for a specific form of vision loss, non-arteritic ischemic optic neuropathy (NAION), which they claimed was caused by their use of defendant’s medication, Viagra.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071 (D. Minn. 2008).  Plaintiffs’ key expert witness, Gerald McGwin considered three epidemiologic studies; none found a statistically significant elevation of risk of NAION after Viagra use.  Id. at 1076. The defense filed a Rule 702 motion to exclude McGwin’s testimony, based in part upon the lack of statistical significance of the risk ratios he relied upon for his causal opinion.  The trial court held that this lack did not render McGwin’s testimony and unreliable and inadmissible  Id. at 1090.

One of the three studies considered by McGwin was his own published paper.  G. McGwin, Jr., M. Vaphiades, T. Hall, C. Owsley, ‘‘Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction,’’ 90 Br. J. Ophthalmol. 154 (2006)[“McGwin 2006”].    The MDL court noted that McGwin had stated that his paper reported an odds ratio (OR) of 1.75, with a 95% confidence interval (CI), 0.48 to 6.30.  Id. at 1080.  The study also presented multiple subgroup analyses of men who had reported Viagra use after a history of heart attack (OR = 10.7) or hypertension (OR = 6.9), but the MDL court did not provide p-values or confidence intervals for the subgroup analysis results.

Curiously, Judge Magnuson eschewed the guidance of the Reference Manual on Scientific Evidence, in dealing with statistics of sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.  The new third edition again conveys the same basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The point of the RSME‘s guidance is, of course, that the standard error, or the confidence interval (C.I.) based upon a specified number of standard errors, is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

The statistical data on which McGwin was basing his opinion was readily available from McGwin 2006:

“Overall, males with NAION were no more likely to report a history of Viagra … use compared to similarly aged controls (odd ratio (OR) 1.75, 95% confidence interval (CI) 0.48 to 6.30.  However, for those with a history of myocardial infarction, a statistically significant association was observed (OR 10.7, 95% CI 1.3 to 95.8). A similar association was observed for those with a history of hypertension though it lacked statistical significance (OR 6.9, 95% CI 0.8 to 63.6).”

McGwin 2006, at 154.  Following the RSME‘s guidance would have assisted the MDL court in its gatekeeping responsibility in several distinct ways.  First, the court would have focused on how wide the 95% confidence intervals were.  The width of the intervals pointed to statistical imprecision and instability in the point estimates urged by McGwin.  Second, the MDL court would have confronted the extent to which there were multiple ad hoc subgroup analyses in McGwin’s paper.  See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002)(“It is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns.”) Third, the court would have confronted the extent to which the study’s validity was undermined by several potent biases.  Statistical significance was the least of the problems faced by McGwin 2006.

The second study considered and relied upon by McGwin was referred to as Margo & French.  McGwin cited this paper for an “elevated OR of 1.10,” id. at 1081, but again, had the court engaged with the actual evidence, it would have found that McGwin had cherry picked the data he chose to emphasize.  The Margo & French study was a retrospective cohort study using the National Veterans Health Administration’s pharmacy and clinical databases.  C. Margo & D. French, ‘‘Ischemic optic neuropathy in male veterans prescribed phosphodiesterase-5 inhibitors,’’ 143 Am. J. Ophthalmol. 538 (2007).  There were two outcomes ascertained:  NAION and “possible” NAION.  The relative risk of NAION among men prescribed a PDE-5 inhibitor (the class to which Viagra belongs) was 1.02 (95% confidence interval [CI]: 0.92 to 1.12.  In other words, the Margo & French paper had very high statistical precision, and it reported essentially no increased risk at all.  Judge Magnuson cited uncritically McGwin’s endorsement of a risk ratio that included ‘‘possible’’ NAION cases, which could not bode well for a gatekeeping process that is supposed to protect against speculative evidence and conclusions.

McGwin’s citation of Margo & French for the proposition that men who had taken the PDE-5 inhibitors had a 10% increased risk was wrong on several counts.  First, he relied upon an outcome measure that included ‘‘possible’’ cases of NAION.  Second, he completely ignored the sampling error that is captured in the confidence interval.  The MDL court failed to note or acknowledge the p-value or confidence interval for any result in Margo & French. The consideration of random error was not an optional exercise for the expert witness or the court; nor was ignoring it a methodological choice that simply went to the ‘‘disagreement among experts.’’

The Viagra MDL court not only lost its way by ignoring the guidance of the RMSE, it appeared to confuse the magnitude of the associations with the concept of statistical significance.  In the midst of the discussion of statistical significance, the court digressed to address the notion that the small relative risk in Margo & French might mean that no plaintiff could show specific causation, and then in the same paragraph returned to state that ‘‘persuasive authority’’ supported the notion that the lack of statistical significance did not detract from the reliability of a study.  Id. at 1081 (citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., MDL No. 1407, 289 F.Supp.2d 1230, 1241 (W.D.Wash. 2003)).  The magnitude of the observed odds ratio is an independent concept from that of whether an odds ratio as extreme or more so would have occurred by chance if there really was no elevation.

Citing one case, at odds with a great many others, however, did not create an epistemic warrant for ignoring the lack of statistical significance.  The entire notion of cited caselaw for the meaning and importance of statistical significance for drawing inferences is wrong headed.  Even more to the point, the lack of statistical significance in the key study in the PPA litigation did not detract from the reliability of the study, although other features of that study certainly did.  The lack of statistical significance in the PPA study did, however, detract from the reliability of the inference from the study’s estimate of ‘‘effect size’’ to a conclusion of causal association. Indeed, nowhere in the key PPA study did its authors draw a causal conclusion with respect to PPA ingestion and hemorrhagic stroke.  See Walter Kernan, Catherine Viscoli, Lawrence Brass, Joseph Broderick, Thomas Brott, Edward Feldmann, Lewis Morgenstern, Janet Lee Wilterdink, and Ralph Horwitz, ‘‘Phenylpropanolamine and the Risk of Hemorrhagic Stroke,’’ 343 New England J. Med. 1826 (2000).

The MDL court did attempt to distinguish the Eighth Circuit’s decision in Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001), cited by the defense:

‘‘[I]n Glastetter … expert evidence was excluded because ‘rechallenge and dechallenge data’ presented statistically insignificant results and because the data involved conditions ‘quite distinct’ from the conditions at issue in the case. Here, epidemiologic data is at issue and the studies’ conditions are not distinct from the conditions present in the case. The Court does not find Glastetter to be controlling.’’

Id. at 1081 (internal citations omitted; emphasis in original).  This reading of Glastetter, however, misses important features of that case and the Parlodel litigation more generally.  First, the Eighth Circuit commented not only upon the rechallenge-dechallenge data, which involved arterial spasms, but upon an epidemiologic study of stroke, from which Ms. Glastetter suffered.  The Glastetter court did not review the epidemiologic evidence itself, but cited to another court, which did discuss and criticize the study for various ‘‘statistical and conceptual flaws.’’  See Glastetter, 252 F.3d at 992 (citing Siharath v. Sandoz Pharms.Corp., 131 F.Supp. 2d 1347, 1356-59 (N.D.Ga.2001)).  Glastetter was binding authority, and not so easily dismissed and distinguished.

The Viagra MDL court ultimately placed its holding upon the facts that:

‘‘the McGwin et al. and Margo et al. studies were peer-reviewed, published, contain known rates of error, and result from generally accepted epidemiologic research.’’

In re Viagra, 572 F. Supp. 2d at 1081 (citations omitted).  This holding was a judicial ipse dixit substituting for the expert witness’s ipse dixit.  There were no known rates of error for the systematic errors in the McGwin study, and the ‘‘known’’ rates of error for random error in McGwin 2006  were intolerably high.  The MDL court never considered any of the error rates, systematic or random, for the Margo & French study.  The court appeared to have abdicated its gatekeeping responsibility by delegating it to unknown peer reviewers, who never considered whether the studies at issue in isolation or together could support a causal health claim.

With respect to the last of the three studies considered, the Gorkin study, McGwin opined that it was  too small, and the data were not suited to assessing temporal relationship.  Id.  The court did not appear inclined to go beyond McGwin’s ipse dixit.  The Gorkin study was hardly small, in that it was based upon more than 35,000 patient-years of observation in epidemiologic studies and clinical trials, and provided an estimate of incidence for NAION among users of Viagra that was not statistically different from the general U.S. population.  See L. Gorkin, K. Hvidsten, R. Sobel, and R. Siegel, ‘‘Sildenafil citrate use and the incidence of nonarteritic anterior ischemic optic neuropathy,’’ 60 Internat’l J. Clin. Pract. 500, 500 (2006).

Judge Magnuson did proceed, in his 2008 opinion, to exclude all the other expert witnesses put forward by the plaintiffs.  McGwin survived the defendant’s Rule 702 challenge, largely because the court refused to consider the substantial random variability in the point estimates from the studies relied upon by McGwin. There was no consideration of the magnitude of random error, or for that matter, of the systematic error in McGwin’s study.  The MDL court found that the studies upon which McGwin relied had a known and presumably acceptable ‘‘rate of error.’’  In fact, the court did not consider the random or sampling error in any of the three cited studies; it failed to consider the multiple testing and interaction; and it failed to consider the actual and potential biases in the McGwin study.

Some legal commentators have argued that statistical significance should not be a litmus test.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 241 (‘‘Statistical significance should not be a litmus test. However, there are many situations where the lack of significance combined with other aspects of the research should be enough to exclude an expert’s testimony.’’)  While I agree that significance probability should not be evaluated in a mechanical fashion, without consideration of study validity, multiple testing, bias, confounding, and the like, handing waving about litmus tests does not excuse courts or commentators from totally ignoring random variability in studies based upon population sampling.  The dataset in the Viagra litigation was not a close call.

Let’s Require Health Claims to Be Evidence Based

June 28th, 2012

Litigation arising from the FDA’s refusal to approval “health claims” for foods and dietary supplements is a fertile area for disputes over the interpretation of statistical evidence.  A ‘‘health claim’’ is ‘‘any claim made on the label or in labeling of a food, including a dietary supplement, that expressly or by implication … characterizes the relationship of any substance to a disease or health-related condition.’’ 21 C.F.R. § 101.14(a)(1); see also 21 U.S.C. § 343(r)(1)(A)-(B).

Unlike the federal courts exercising their gatekeeping responsibility, the FDA has committed to pre-specified principles of interpretation and evaluation. By regulation, the FDA gives notice of standards for evaluating complex evidentiary displays for the ‘‘significant scientific agreement’’ required for approving a food or dietary supplement health claim.  21 C.F.R. § 101.14.  See FDA – Guidance for Industry: Evidence-Based Review System for the Scientific Evaluation of Health Claims – Final (2009).

If the FDA’s refusal to approve a health claim requires pre-specified criteria of evaluation, then we should be asking ourselves why have the federal courts failed to develop a set of criteria for evaluating health effects claims as part of its Rule 702 (“Daubert“) gatekeeping responsibilities.  Why, after close to 20 years after the Supreme Court decided Daubert, can lawyers make “health claims” without having to satisfy evidence-based criteria?

Although the FDA’s guidance is not always as precise as might be hoped, it is far better than the suggestion of the new Reference Manual for Scientific Evidence (3d ed. 2011) that there is no hierarchy of evidence.   See RMSE 3d at 564 & n.48 (citing and quoting idiosyncratic symposium paper that “[t]here should be no hierarchy [among different types of scientific methods to determine cancer causation]; “Late Professor Berger’s Introduction to the Reference Manual on Scientific Evidence” (Oct. 23, 2011).

The FDA’s attempt to articulate an evidence-based hierarchy is noteworthy because the agency must evaluate a wide range of evidence, from in vitro, to animal studies, to observational studies of varying kinds, to clinical trials, to meta-analyses and reviews.  The FDA’s criteria are a good start, and I imagine that they will develop and improve over time.  Although imperfect, the criteria are light years ahead of the situation in federal and state court gatekeeping.  Unlike gatekeeping in civil actions, the FDA criteria are pre-stated and not devised post hoc.  The FDA’s attempt to implement evidence-based principles in the evaluation of health claims made is a model that would much improve the Reference Manual for Scientific EvidenceSee Christopher Guzelian & Philip Guzelian, “Prevention of false scientific speech: a new role for an evidence-based approach,” 27 Human & Experimental Toxicol. 733 (2008).

The FDA’s evidence-based criteria need work in some areas.  For instance, the FDA’s Guidance on meta-analysis is not particularly specific or helpful:

Research Synthesis Studies

Reports that discuss a number of different studies, such as review articles, do not provide sufficient information on the individual studies reviewed for FDA to determine critical elements such as the study population characteristics and the composition of the products used. Similarly, the lack of detailed information on studies summarized in review articles prevents FDA from determining whether the studies are flawed in critical elements such as design, conduct of studies, and data analysis. FDA must be able to review the critical elements of a study to determine whether any scientific conclusions can be drawn from it. Therefore, FDA intends to use review articles and similar publications to identify reports of additional studies that may be useful to the health claim review and as background about the substance/disease relationship. If additional studies are identified, the agency intends to evaluate them individually. Most meta-analyses, because they lack detailed information on the studies summarized, will only be used to identify reports of additional studies that may be useful to the health claim review and as background about the substance-disease relationship.  FDA, however, intends to consider as part of its health claim review process a meta-analysis that reviews all the publicly available studies on the substance/disease relationship. The reviewed studies should be consistent with the critical elements, quality and other factors set out in this guidance and the statistical analyses adequately conducted.”

FDA – Guidance for Industry: Evidence-Based Review System for the Scientific Evaluation of Health Claims – Final at 10 (2009).

The dismissal of review articles as a secondary source is welcome, but meta-analyses are quantitative reviews that can add additional insights and evidence, if methodologically appropriate, by providing a summary estimate of association, sensitivity analyses, meta-regression, etc.  The FDA’s guidance was applied in connection with the agency’s refusal to approve a health claim for vitamin C and lung cancer.  Proponents claimed that a particular meta-analysis supported their health claim, but the FDA disagreed.  The proponents sought injunctive relief in federal district court, which upheld the FDA’s decision on vitamin C and lung cancer.  Alliance for Natural Health US v. Sebelius, 786 F.Supp. 2d 1, 21 (D.D.C. 2011).  The district court found that the FDA’s refusal to approve the health claim was neither arbitrary nor capricious with respect to its evaluation of the cited meta-analysis:

‘‘The FDA discounted the Cho study because it was a ‘meta-analysis’ of studies reflected in a review article. FDA Decision at 2523. As explained in the 2009 Guidance Document, ‘research synthesis studies’, and ‘review articles’, including ‘most meta-analyses’, ‘do not provide sufficient information on the individual studies reviewed’ to determine critical elements of the studies and whether those elements were flawed. 2009 Guidance Document at A.R. 2432. The Guidance Document makes an exception for meta-analyses ‘that review[ ] all the publicly available studies on the substance/disease relationship’. Id. Based on the Court’s review of the Cho article, the FDA’s decision to exclude this article as a meta-analysis was not arbitrary and capricious.’’

Id. at 19.

The FDA’s Guidance was adequate for its task in the vitamin C/lung cancer health claim, but notably absent from the Guidance are any criteria to evaluate competing meta-analyses that do include “all the publicly available studies on the substance/disease relationship.”  The model assumptions of meta-analyses, fixed effect versus random effects, lack of heterogeneity, as well as other considerations will need to be spelled out in advance.  Still not a bad start.  Implementing evidence-based criteria in Rule 702 gatekeeping has the potential to tame the gatekeeper’s discretion.

Meta-Meta-Analysis — The Gadolinium MDL — More Than Ix’se Dixit

June 8th, 2012

There is an tendency, for better or worse, for legal bloggers to be partisan cheerleaders over litigation outcomes.  I admit that most often I am dismayed by judicial failures or refusals to exclude dubious plaintiffs’ expert witnesses’ opinion testimony, and I have been known to criticize such decisions.  Indeed, I wouldn’t mind seeing courts exclude dubious defendants’ expert witnesses.  I have written approvingly about cases in which judges have courageously engaged with difficult scientific issues, seen through the smoke screen, and properly assessed the validity of the opinions expressed.  The Gadolinium MDL (No. 1909) Daubert motions and decision offer a fascinating case study of a challenge to an expert witness’s meta-analysis, an effective defense of the meta-analysis, and a judicial decision to admit the testimony, based upon the meta-analysis.  In re Gadolinium-Based Contrast Agents Prods. Liab. Litig., 2010 WL 1796334 (N.D. Ohio May 4, 2010) [hereafter Gadolinium], reconsideration denied, 2010 WL 5173568 (June 18, 2010).

Plaintiffs proffered general causation opinions (between gadolinium contrast media and Nephrogenic Systemic Fibrosis (“NSF”), by a nephrologist, Joachim H. Ix, M.D., with training in epidemiology.  Dr. Ix’s opinions were based in large part upon a meta-analysis he conducted on data in published observational studies.  Judge Dan Aaron Polster, the MDL judge, itemized the defendant’s challenges to Dr. Ix’s proposed testimony:

“The previously-used procedures GEHC takes issue with are:

(1) the failure to consult with experts about which studies to include;

(2) the failure to independently verify which studies to select for the meta-analysis;

(3) using retrospective and non-randomized studies;

(4) relying on studies with wide confidence intervals; and

(5) using a “more likely than not” standard for causation that would not pass scientific scrutiny.”

Gadolinium at *23.  Judge Polster confidently dispatched these challenges.  Dr. Ix, as a nephrologist, had subject-matter expertise with which to develop inclusionary and exclusionary criteria on his own.  The defendant never articulated what, if any, studies were inappropriately included or excluded.  The complaint that Dr. Ix had used retrospective and non-randomized studies also rang hollow in the absence of any showing that there were randomized clinical trials with pertinent data at hand.  Once a serious concern of nephrotoxicity arose, clinical trials were unethical, and the defendant never explained why observational studies were somehow inappropriate for inclusion in a meta-analysis.

Relying upon studies with wide confidence intervals can be problematic, but that is one of the reasons to conduct a meta-analysis, assuming the model assumptions for the meta-analysis can be verified.  The plaintiffs effectively relied upon a published meta-analysis, which pre-dated their expert witness’s litigation effort, in which the authors used less conservative inclusionary criteria, and reported a statistically significant summary estimate of risk, with an even wider confidence interval.  R. Agarwal, et al., ” Gadolinium-based contrast agents and nephrogenic systemic fibrosis: a systematic review and meta-analysis,” 24 Nephrol. Dialysis & Transplantation 856 (2009).  As the plaintiffs noted in their opposition to the challenge to Dr. Ix:

“Furthermore, while GEHC criticizes Dr. Ix’s CI from his meta-analysis as being “wide” at (5.18864 and 25.326) it fails to share with the court that the peer-reviewed Agarwal meta-analysis, reported a wider CI of (10.27–69.44)… .”

Plaintiff’s Opposition to GE Healthcare’s Motion to Exclude the Opinion Testimony of Joachim Ix at 28 (Mar. 12, 2010)[hereafter Opposition].

Wider confidence intervals certainly suggest greater levels of random error, but Dr. Ix’s intervals suggested statistical significance, and he had carefully considered statistical heterogeneity.  Opposition at 19. (Heterogeneity was never advanced by the defense as an attack on Dr. Ix’s meta-analysis).  Remarkably, the defendant never advanced a sensitivity analysis to suggest or to show that reasonable changes to the evidentiary dataset could result in loss of statistical significance, as might be expected from the large intervals.  Rather, the defendant relied upon the fact that Dr. Ix had published other meta-analyses in which the confidence interval was much narrower, and then claimed that he had “required” these narrower confidence intervals for his professional, published research.  Memorandum of Law of GE Healthcare’s Motion to Exclude Certain Testimony of Plaintiffs’ Generic Expert, Joachim H. Ix, MD, MAS, In re Gadolinium MDL No. 1909, Case: 1:08-gd-50000-DAP  Doc #: 668   (Filed Feb. 12, 2010)[hereafter Challenge].  There never was, however, a showing that narrower intervals were required for publication, and the existence of the published Agarwal meta-analysis contradicted the suggestion.

Interestingly, the defense did not call attention to Dr. Ix’s providing an incorrect definition of the confidence interval!  Here is how Dr. Ix described the confidence interval, in language quoted by plaintiffs in their Opposition:

“The horizontal lines display the “95% confidence interval” around this estimate. This 95% confidence interval reflects the range of odds ratios that would be observed 95 times if the study was repeated 100 times, thus the narrower these confidence intervals, the more precise the estimate.”

Opposition at 20.  The confidence interval does not provide a probability distribution of the parameter of interest; rather the distribution of confidence intervals has a probability of covering the hypothesized “true value” of the parameter.

Finally, the defendant never showed any basis for suggesting that a scientific opinion on causation requires something more than a “more likely than not” basis.

Judge Polster also addressed some more serious challenges:

“Defendants contend that Dr. Ix’s testimony should also be excluded because the methodology he utilized for his generic expert report, along with varying from his normal practice, was unreliable. Specifically, Defendants assert that:

(1) Dr. Ix could not identify a source he relied upon to conduct his meta-analysis;

(2) Dr. Ix imputed data into the study;

(3) Dr. Ix failed to consider studies not reporting an association between GBCAs and NSF; and

(4) Dr. Ix ignored confounding factors.”

Gadolinium at *24

IMPUTATION

The first point, above – the alleged failure to identify a source for conducting the meta-analysis – rings fairly hollow, and Judge Polster easily deflected it.  The second point raised a more interesting challenge.  In the words of defense counsel:

“However, in arriving at this estimate, Dr. Ix imputed, i.e., added, data into four of the five studies.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  Specifically, Dr. Ix added a single case of NSF without antecedent GBCA exposure to the patient data in the underlying studies.

* * *

During his deposition, Dr. Ix could not provide any authority for his decision to impute the additional data into his litigation meta-analysis.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  When pressed for any authority supporting his decision, Dr. Ix quipped that ‘this may be a good question to ask a Ph.D level biostatistician about whether there are methods to [calculate an odds ratio] without imputing a case [of NSF without antecedent GBCA exposure]’.”

Challenge at 12-13.

The deposition reference suggests that the examiner had scored a debating point by catching Dr. Ix unprepared, but by the time the parties briefed the challenge, the plaintiffs had the issue well in hand, citing A. W. F. Edwards, “The Measure of Association in a 2 × 2 Table,” 126 J. Royal Stat. Soc. Series A 109 (1963); R.L. Plackett, “The Continuity Correction in 2 x 2 Tables,” 51 Biometrika 327 (1964).  Opposition at 36 (describing the process of imputation in the event of zero counts in the cells of a 2 x 2 table for odds ratios).  There are qualms to be stated about imputation, but the defense failed to make them.  As a result, the challenge overall lost momentum and credibility.  As the trial court stated the matter:

“Next, there is no dispute that Dr. Ix imputed data into his meta-analysis. However, as Defendants acknowledge, there are valid scientific reasons to impute data into a study. Here, Dr. Ix had a valid basis for imputing data. As explained by Plaintiffs, Dr. Ix’s imputed data is an acceptable technique for avoiding the calculation of an infinite odds ratio that does not accurately measure association.7 Moreover, Dr. Ix chose the most conservative of the widely accepted approaches for imputing data.8 Therefore, Dr. Ix’s decision to impute data does not call into question the reliability of his meta-analysis.”

Gadolinium at *24.

FAILURE TO CONSIDER NULL STUDIES

The defense’s challenged including a claim that Dr. Ix had arbitrarily excluded studies in which there was no reported incidence of NSF. The defense brief unfortunately does not describe the studies excluded, and what, if any, effect their inclusion in the meta-analysis would have had.  This was, after all, the crucial issue. The abstract nature of the defense claim left the matter ripe for misrepresentation by the plaintiffs:

“GEHC continues to misunderstand the role of a meta-analysis and the need for studies that included patients both that did or did not receive GBCAs and reported on the incidence of NSF, despite Dr. Ix’s clear elucidation during his deposition. (Ix Depo. TR [Exh.1] at 97-98).  Meta-analyses such as performed by Dr. Ix and Dr. Agarwal search for whether or not there is a statistically valid association between exposure and disease event. In order to ascertain the relationship between the exposure and event one must have an event to evaluate. In other words, if you have a study in which the exposed group consists of 10,000 people that are exposed to GBCAs and none develop NSF, compared to a non-exposed group of 10,000 who were not exposed to GBCAs and did not develop NSF, the study provides no information about the association between GBCAs and NSF or the relative risk of developing NSF.”

Challenge at 37 – 38 (emphasis in original).  What is fascinating about this particular challenge, and the plaintiffs’ response, is the methodological hypocrisy exhibited.  In essence, the plaintiffs argued that imputation was appropriate in a case-control study, in which one cell contained a zero, but they would ignore a great deal of data in a cohort study with data.  To be sure, case-control studies are more efficient than cohort studies for identifying and assessing risk ratios for rare outcomes.  Nevertheless, the plaintiffs could easily have been hoisted with their own hypothetical petard.  No one in 10,000 gadolinium-exposed patients developed NSF; and no one in a control group did either.  The hypothetical study suggests that the rate of NSF is low and not different in the exposed and in the unexposed patients.  The risk ratio could be obtained by imputing an integer for the cells containing zero, and a confidence interval calculated.  The risk ratio, of course, would be 1.0.

Unfortunately, the defense did not make this argument; nor did it explore where the meta-analysis might have come out had a more even-handed methodology been taken by Dr. Ix.  The gap allowed the trial court to brush the challenge aside:

“The failure to consider studies not reporting an association between GBCAs and NSF also does not render Dr. Ix’s meta-analysis unreliable. The purpose of Dr. Ix’s meta-analysis was to study the strength of the association between an exposure (receiving GBCA) and an outcome (development of NSF). In order to properly do this, Dr. Ix necessarily needed to examine studies where the exposed group developed NSF.”

Gadolinium at *24.  Judge Polster, with no help from the defense brief, missed the irony of Dr. Ix’s willingness to impute data in the case-control 2 x 2 contingency tables, but not in the relative risk tables.

CONFOUNDING

Defendants complained that Dr. Ix had ignored the possibility that confounding factors had contributed to the development of NSF.  Challenge at 13.  Defendants went so far as to charge Dr. Ix with misleading the court by failing to consider other possible causative exposures or conditions.  Id.

Defendants never identified the existence, source, and likely magnitude of confounding factors.  As a result, the plaintiffs’ argument, based in the Reference Manual, that confounding was an unlikely explanation for a very large risk ratio was enthusiastically embraced by the trial court, virtually verbatim from the plaintiffs’ Opposition (at 14):

“Finally, the Court rejects Defendants’ argument that Dr. Ix failed to consider confounding factors. Plaintiffs argued and Defendants did not dispute that, applying the Bradford Hill criteria, Dr. Ix calculated a pooled odds ratio of 11.46 for the five studies examined, which is higher than the 10 to 1 odds ratio of smoking and lung cancer that the Reference Manual on Scientific Evidence deemed to be “so high that it is extremely difficult to imagine any bias or confounding factor that may account for it.” Id. at 376.  Thus, from Dr. Ix’s perspective, the odds ratio was so high that a confounding factor was improbable. Additionally, in his deposition, Dr. Ix acknowledged that the cofactors that have been suggested are difficult to confirm and therefore he did not try to specifically quantify them. (Doc # : 772-20, at 27.) This acknowledgement of cofactors is essentially equivalent to the Agarwal article’s representation that “[t]here may have been unmeasured variables in the studies confounding the relationship between GBCAs and NSF,” cited by Defendants as a representative model for properly considering confounding factors. (See Doc # : 772, at 4-5.)”

Gadolinium at *24.

The real problem is that the defendant’s challenge pointed only to possible, unidentified causal agents.  The smoking/lung cancer analogy, provided by the Reference Manual, was inapposite.  Smoking is indeed a large risk factor for lung cancer, with relative risks over 20.  Although there are other human lung carcinogens, none is consistently in the same order of magnitude (not even asbestos), and as a result, confounding can generally be excluded as an explanation for the large risk ratios seen in smoking studies.  It would be easy to imagine that there are confounders for NSF, especially given that it is relatively recently been identified, and that they might be of the same or greater magnitude as that suggested for the gadolinium contrast media.  The defense, however, failed to identify confounders that actually threatened the validity of any of the individual studies, or of the meta-analysis.

CONCLUSION

The defense hinted at the general unreliability of meta-analysis, with references to References Manual on Scientific Evidence at 381 (2d ed. 2000)(noting problems with meta-analysis), and other, relatively dated papers.  See, e.g., John Bailar, “Assessing Assessments,” 277 Science 529 (1997)(arguing that “problems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with [meta-analysis].”).  The Reference Manual language carried over into the third edition, is out of date, and represents a failing of the new edition.  See The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011).

The plaintiffs came forward with some descriptive statistics of the prevalence of meta-analysis in contemporary biomedical literature.  The defendants gave mostly argument; there is a dearth of citation to defense expert witnesses, affidavits, consensus papers on meta-analysis, textbooks, papers by leading authors, and the like.  The defense challenge suffered from being diffuse and unfocused; it lost persuasiveness by including weak, collateral issues such as claiming that Dr. Ix was opining “only” on a “more likely than not” basis, and that he had not consulted with other experts, and that he had failed to use randomized trial data.  The defense was quick to attack perceived deficiencies, but it did not illustrate how or why the alleged deficiencies threatened the validity of Dr. Ix’s meta-analysis.  Indeed, even when the defense made strong points, such as the exclusion of zero-event cohort studies, it failed to document that such studies existed, and that their inclusion might have made a difference.

 

WOE-fully Inadequate Methodology – An Ipse Dixit By Another Name

May 1st, 2012

Take all the evidence, throw it into the hopper, close your eyes, open your heart, and guess the weight.  You could be a lucky winner!  The weight of the evidence suggests that the weight-of-the-evidence (WOE) method is little more than subjective opinion, but why care if it helps you to get to a verdict?

The scientific community has never been seriously impressed by the so-called weight of the evidence (WOE) approach to determining causality.  The phrase is vague and ambiguous; its use, inconsistent. See, e.g., V. H. Dale, G.R. Biddinger, M.C. Newman, J.T. Oris, G.W. Suter II, T. Thompson, et al., “Enhancing the ecological risk assessment process,” 4 Integrated Envt’l Assess. Management 306 (2008)(“An approach to interpreting lines of evidence and weight of evidence is critically needed for complex assessments, and it would be useful to develop case studies and/or standards of practice for interpreting lines of evidence.”);  Igor Linkov, Drew Loney, Susan M. Cormier, F.Kyle Satterstrom, Todd Bridges, “Weight-of-evidence evaluation in environmental assessment: review of qualitative and quantitative approaches,” 407 Science of Total Env’t 5199–205 (2009); Douglas L. Weed, “Weight of Evidence: A Review of Concept and Methods,” 25 Risk Analysis 1545 (2005) (noting the vague, ambiguous, indefinite nature of the concept of “weight of evidence” review);   R.G. Stahl Jr., “Issues addressed and unaddressed in EPA’s ecological risk guidelines,” 17 Risk Policy Report 35 (1998); (noting that U.S. Environmental Protection Agency’s guidelines for ecological weight-of-evidence approaches to risk assessment fail to provide guidance); Glenn W. Suter II, Susan M. Cormier, “Why and how to combine evidence in environmental assessments:  Weighing evidence and building cases,” 409 Science of the Total Environment 1406, 1406 (2011)(noting arbitrariness and subjectivity of WOE “methodology”).

 

General Electric v. Joiner

Most savvy judges quickly figured out that weight of the evidence (WOE) was suspect methodology, woefully lacking, and indeed, not really a methodology at all.

The WOE method was part of the hand waving in Joiner by plaintiffs’ expert witnesses, including the frequent testifier Rabbi Teitelbaum.  The majority recognized that Rabbi Teitelbaum’s WOE weighed in at less than a peppercorn, and affirmed the district court’s exclusion of his opinions.  The Joiner Court’s assessment provoked a dissent from Justice Stevens, who was troubled by the Court’s undressing of the WOE methodology:

“Dr. Daniel Teitelbaum elaborated on that approach in his deposition testimony: ‘[A]s a toxicologist when I look at a study, I am going to require that that study meet the general criteria for methodology and statistical analysis, but that when all of that data is collected and you ask me as a patient, Doctor, have I got a risk of getting cancer from this? That those studies don’t answer the question, that I have to put them all together in my mind and look at them in relation to everything I know about the substance and everything I know about the exposure and come to a conclusion. I think when I say, “To a reasonable medical probability as a medical toxicologist, this substance was a contributing cause,” … to his cancer, that that is a valid conclusion based on the totality of the evidence presented to me. And I think that that is an appropriate thing for a toxicologist to do, and it has been the basis of diagnosis for several hundred years, anyway’.

* * * *

Unlike the District Court, the Court of Appeals expressly decided that a ‘weight of the evidence’ methodology was scientifically acceptable. To this extent, the Court of Appeals’ opinion is persuasive. It is not intrinsically “unscientific” for experienced professionals to arrive at a conclusion by weighing all available scientific evidence—this is not the sort of ‘junk science’ with which Daubert was concerned. After all, as Joiner points out, the Environmental Protection Agency (EPA) uses the same methodology to assess risks, albeit using a somewhat different threshold than that required in a trial.  Petitioners’ own experts used the same scientific approach as well. And using this methodology, it would seem that an expert could reasonably have concluded that the study of workers at an Italian capacitor plant, coupled with data from Monsanto’s study and other studies, raises an inference that PCB’s promote lung cancer.”

General Electric v. Joiner, 522 U.S. 136, 152-54 (1997)(Stevens, J., dissenting)(internal citations omitted)(confusing critical assessment of studies with WOE; and quoting Rabbit Teitelbaum’s attempt to conflate diagnosis with etiological attribution).  Justice Stevens could reach his assessment only by ignoring the serious lack of internal and external validity in the studies relied upon by Rabbi Teitelbaum.  Those studies did not support his opinion individually or collectively.

Justice Stevens was wrong as well about the claimed scientific adequacy of WOE.  Courts have long understood that precautionary, preventive judgments of regulatory agencies are different from scientific conclusions that are admissible in civil and criminal litigation.  See Allen v. Pennsylvania Engineering Corp., 102 F.3d 194 (5th Cir. 1996)(WOE, although suitable for regulatory risk assessment, is not appropriate in civil litigation).  Justice Stevens’ characterization of WOE was little more than judicial ipse dixit, and it was, in any event, not the law; it was the argument of a dissenter.

 

Milward v. Acuity Specialty Products

Admittedly, dissents can sometimes help lower court judges chart a path of evasion and avoidance of a higher court’s holding.  In Milward, Justice Stevens’ mischaracterization of WOE and scientific method was adopted as the legal standard for expert witness testimony by a panel of the United States Court of Appeals, for the First Circuit.  Milward v. Acuity Specialty Products Group, Inc., 664 F.Supp. 2d 137 (D. Mass. 2009), rev’d, 639 F.3d 11 (1st Cir. 2011), cert. denied, U.S. Steel Corp. v. Milward, ___ U.S. ___, 2012 WL 33303 (2012).

Mr. Milward claimed that he was exposed to benzene as a refrigerator technician, and developed acute promyelocytic leukeumia (APL) as result.  664 F. Supp. 2d at 140. In support of his claim, Mr. Milward offered the testimony of Dr. Martyn T. Smith, a toxicologist, who testified that the “weight of the evidence” supported his opinion that benzene exposure causes APL. Id. Smith, in his litigation report, described his methodology as an application of WOE:

“The term WOE has come to mean not only a determination of the statistical and explanatory power of any individual study (or the combined power of all the studies) but the extent to which different types of studies converge on the hypothesis.) In assessing whether exposure to benzene may cause APL, I have applied the Hill considerations . Nonetheless, application of those factors to a particular causal hypothesis, and the relative weight to assign each of them, is both context dependent and subject to the independent judgment of the scientist reviewing the available body of data. For example, some WOE approaches give higher weight to mechanistic information over epidemiological data.”

Smith Report at ¶¶19, 21 (citing Sheldon Krimsky, “The Weight of Scientific Evidence in Policy and Law,” 95(S1) Am. J. Public Health 5130, 5130-31 (2005))(March 9, 2009).  Smith marshaled several bodies of evidence, which he claimed collectively supported his opinion that benzene causes APL.  Milward, 664 F. Supp. 2d at 143.

Milward also offered the testimony of a philosophy professor, Carl F. Cranor, for the opinion that WOE was an acceptable methodology, and that all scientific inference is subject to judgment.  This is the same Cranor who, advocating for open admissions of all putative scientific opinions, showcased his confusion between statistical significance probability and the posterior probability involved in a conclusion of causality.  Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34(Oxford 1993)(“One can think of α, β (the chances of type I and type II errors, respectively) and 1- β as measures of the “risk of error” or “standards of proof.”) See also id. at 44, 47, 55, 72-76.

After a four-day evidentiary hearing, the district court found that Martyn Smith’s opinion was merely a plausible hypothesis, and not admissible.  Milward, 664 F. Supp. 2d at 149.  The Court of Appeals, in an opinion by Chief Judge Lynch, however, reversed and ruled that an inference of general causation based on a WOE methodology satisfied the reliability requirement for admission under Federal Rule of Evidence 702.  639 F.3d at 26.  According to the Circuit, WOE methodology was scientifically sound,  Id. at 22-23.

 

WOE Cometh

Because the WOE methodology is not well described, either in the published literature or in Martyn Smith’s litigation report, it is difficult to understand exactly what the First Circuit approved by reversing Smith’s exclusion.  Usually the burden is on the proponent of the opinion testimony, and one would have thought that the vagueness of the described methodology would count against admissibility.  It is hard to escape the conclusion that the Circuit elevated a poorly described method, best characterized as hand waving, into a description of scientific method

The Panel appeared to have been misled by Carl F. Cranor, who described “inference to the best explanation” as requiring a scientist to “consider all of the relevant evidence” and “integrate the evidence using professional judgment to come to a conclusion about the best explanation. Id at 18. The available explanations are then weighed, and a would-be expert witness is free to embrace the one he feels offers the “best” explanation.  The appellate court’s opinion takes WOE, combined with Cranor’s “inference to the best explanation,” to hold that an expert witness need only opine that he has considered the range of plausible explanations for the association, and that he believes that the causal explanation is the best or “most plausible.”  Id. at 20 (upholding this approach as “methodologically reliable”).

What is missing of course is the realization that plausible does not mean established, reasonably certain, or even more likely than not.  The Circuit’s invocation of plausibility also obscures the indeterminacy of the available data for supporting a reliable conclusion of causation in many cases.

Curiously, the Panel likened WOE to the use of differential diagnosis, which is a method for inferring the specific cause of a particular patient’s disease or disorder.  Id. at 18.  This is a serious confusion between a method concerned with general causation and one concerned with specific causation.  Even if, by the principle of charity, we allow that the First Circuit was thinking of some process of differential etiology rather than diagnosis, given that diagnoses (other than for infectious diseases and a few pathognomonic disorders) do not usually carry with them information about unique etiologic agents.  But even such a process of differential etiology is a well-structured dysjunctive syllogism of the form:

A v B v C

~A ∩ ~B

∴ C

There is nothing subjective about assigning weights or drawing inferences in applying such a syllogism.  In the Milward case, one of the propositional facts that might have well explained the available evidence was chance, but plaintiff’s expert witness Smith could not and did not rule out chance in that the studies upon which he relied were not statistically significant.  Smith could thus never get past “therefore” in any syllogism or in any other recognizable process of reasoning.

The Circuit Court provides no insight into the process Smith used to weigh the available evidence, and it failed to address the analytical gaps and evidentiary insufficiencies addressed by the trial court, other than to invoke the mantra that all these issues go to “the weight, not the admissibility” of Smith’s opinions.  This, of course, is a conclusion, not an explanation or a legal theory.

There is also a cute semantic trick lurking in plaintiffs’ position in Milward, which results from their witnesses describing their methodology as “WOE.”  Since the jury is charged with determining the “weight of the evidence,” any evaluation of the WOE would be an invasion of the province of the jury.  Milward, 639 F.3d at 20. QED by the semantic device of deliberating conflating the name of the putative scientific methodology with the term traditionally used to describe jury fact finding.

In any event, the Circuit’s chastisement of the district court for evaluating Smith’s implementation of the WOE methodology, his logical, mathematical, and epidemiological errors, his result-driven reinterpretation of study data, threatens to read an Act of Congress — the Federal Rules of Evidence, and especially Rules 702 and 703 — out of existence by judicial fiat.  The Circuit’s approach is also at odds with Supreme Court precedent (now codified in Rule 702) on the importance and the requirement of evaluating opinion testimony for analytical gaps and the ipse dixit of expert witnesses.  General Electic Co. v. Joiner, 522 U.S. 136, 146 (1997).

 

Smith’s Errors in Recalculating Odds Ratios of Published Studies

In the district court, the defendants presented testimony of an epidemiologist, Dr. David H. Garabrant, who took Smith to task for calculating risk ratios incorrectly.  Smith did not have any particular expertise in epidemiologist, and his faulty calculations were problematic from the perspective of both Rule 702 and Rule 703.  The district court found the criticisms of Smith’s calculations convincing, 664 F. Supp. 2d at 149, but the appellate court held that the technical dispute was for the jury; “both experts’ opinions are supported by evidence and sound scientific reasoning,” Milward, 639 F.3d at 24.  This ruling is incomprehensible.  Plaintiffs had the burden of showing admissibility of Smith opinion generally, but also the reasonability of his reliance upon the calculated odds ratio.  The defendants had no burden of persuasion on the issue of Smith’s calculations, but they presented testimony, which apparently carried the day.  The appellate court had no basis for reversing the specific ruling with respect to the erroneously calculated risk ratio.

 

Smith’s Reliance upon Statistically Insignificant Studies

Smith relied upon studies that were not statistically significant at any accepted level.  An opinion of causality requires a showing that chance, bias, and confounding have been excluded in assessing an existing association.  Smith failed to exclude chance as an explanation for the association, and the burden to make this exclusion was on the plaintiffs. This failure was not something that could readily be patched by adverting to other evidence of studies in animals or in test tubes.    The Court of Appeals excused the important analytical gap in plaintiffs’ witness’s opinion because APL is rare, and data collection is difficult in the United States.  Id. at 24.  Evidence “consistent with” and “suggestive of” the challenged witness’s opinion thus suffices.  This is a remarkable homeopathic dilution of both legal and scientific causation.  Now we have a rule of law that allows plaintiffs to be excused from having to prove their case with reliable evidence if they allege a rare disease for which they lack evidence.

 

Leveling the Hierarchy of Evidence

Imagine trying to bring a medication to market with a small case-control study, with a non-statistically significant odds ratio!  Oh, but these clinical trials are so difficult and expensive; and they take such a long time.  Like a moment’s thought, when thinking is so hard and a moment such a long time.  We would be quite concerned if the FDA abridged the standard for causal efficacy in the licensing of new medications; we should be just as concerned about judicial abridgments of standards for causation of harm in tort actions.

Leveling the hierarchy of evidence has been an explicit or implicit goal of several law professors.  Some of the leveling efforts even show up in the new Reference Manual for Scientific Evidence (RMSE 3d ed. 2011).  SeeNew-Age Levellers – Flattening Hierarchy of Evidence.”

The Circuit, in Milward, quoted an article published in the Journal of the National Cancer Institute by Michele Carbone and others who suggest that there should be no hierarchy, but the Court ignored a huge body of literature that explains and defends the need for recognizing that not all study designs or types are equal.  Interestingly, the RMSE chapter on epidemiology by Professor Green (see more below) cites the same article.  RMSE 3d at 564 & n.48 (citing and quoting symposium paper that “[t]here should be no hierarchy [among different types of scientific methods to determine cancer causation]. Epidemiology, animal, tissue culture and molecular pathology should be seen as integrating evidences in the determination of human carcinogenicity.” Michele Carbone et al., “Modern Criteria to Establish Human Cancer Etiology,” 64 Cancer Res. 5518, 5522 (2004).)  Carbone, of course, is best known for his advocacy of a viral cause (SV40), of human mesothelioma, a claim unsupported, and indeed contradicted, by epidemiologic studies.  Carbone’s statement does not support the RMSE chapter’s leveling of epidemiology and toxicology, and Carbone is, in any event, an unlikely source to cite.

The First Circuit, in Milward, studiously ignored a mountain of literature on evidence-based medicine, including the RSME 3d chapter on “Reference Guide on Medical Testimony,” which teaches that leveling of study designs and types is inappropriate. The RMSE chapter devotes several pages to explaining the role of study design in assessing an etiological issue:

3. Hierarchy of medical evidence

With the explosion of available medical evidence, increased emphasis has been placed on assembling, evaluating, and interpreting medical research evidence.  A fundamental principle of evidence-based medicine (see also Section IV.C.5, infra) is that the strength of medical evidence supporting a therapy or strategy is hierarchical.

When ordered from strongest to weakest, systematic review of randomized trials (meta-analysis) is at the top, followed by single randomized trials, systematic reviews of observational studies, single observational studies, physiological studies, and unsystematic clinical observations.150 An analysis of the frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized controlled trials in the medical evidence hierarchy.151 Although they are at the bottom of the evidence hierarchy, unsystematic clinical observations or case reports may be the first signals of adverse events or associations that are later confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless, subsequent studies may not confirm initial reports (e.g., the putative association between coffee consumption and pancreatic cancer).154

John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, “Reference Guide on Medical Testimony,” RMSE 3d 687, 723 -24 (2011).   The implication that there is no hierarchy of evidence in causal inference, and that tissue culture studies are as relevant as epidemiology, is patently absurd. The Circuit not only went out on a limb, it managed to saw the limb off, while “out there.”

 

Milward – Responses Critical and Otherwise

The First Circuit’s decision in Milward made an immediate impression upon those writers who have worked hard to dismantle or marginalize Rule 702.  The Circuit’s decision was mysteriously cited with obvious approval by Professor Margaret Berger, even though she had died before the decision was published!  Margaret A. Berger, “The Admissibility of Expert Testimony,” RMSE 3d at 20 & n. 51(2011).  Professor Michael Green, one of the reporters for the ALI’s Restatement (Third) of Torts hyperbolically called Milward “[o]ne of the most significant toxic tort causation cases in recent memory.”  Michael D. Green, “Introduction: Restatement of Torts as a Crystal Ball,” 37 Wm. Mitchell L. Rev. 993, 1009 n.53 (2011).

The WOE approach, and its embrace in Milward, obscures the reality that sometimes the evidence does not logically or analytically support the offered conclusion, and at other times, the best explanation is uncertainty.  By adopting the WOE approach, vague and ambiguous as it is, the Milward Court was beguiled into holding that WOE determinations are for the jury.  The lack of meaningful content of WOE means that decisions such as Milward effectively remove the gatekeeping function, or permit that function to be minimally satisfied by accepting an expert witness’s claim to have employed WOE.  The epistemic warrant required by Rule 702 is diluted if not destroyed.  Scientific hunch and speculation, proper in their place, can be passed off for scientific knowledge to gullible or result-oriented judges and juries.

Confidence in Intervals and Diffidence in the Courts

March 4th, 2012

Next year, the Supreme Court’s Daubert decision will turn 20.  The decision, in interpreting Federal Rule of Evidence 702, dramatically changed the landscape of expert witness testimony.  Still, there are many who would turn the clock back to disabling the gatekeeping function.  In past posts, I have identified scholars, such as Erica Beecher-Monas and the late Margaret Berger, who tried to eviscerate judicial gatekeeping.  Recently a student note argued for the complete abandonment of all judicial control of expert witness testimony.  See  Note, “Admitting Doubt: A New Standard for Scientific Evidence,” 123 Harv. L. Rev. 2021 (2010)(arguing that courts should admit all relevant evidence).

One advantage that comes from requiring trial courts to serve as gatekeepers is that the expert witnesses’ reasoning is approved or disapproved in an open, transparent, and rational way.  Trial courts subject themselves to public scrutiny in a way that jury decision making does not permit.  The critics of Daubert often engage in a cynical attempt to remove all controls over expert witnesses in order to empower juries to act on their populist passions and prejudices.  When courts misinterpret statistical and scientific evidence, there is some hope of changing subsequent decisions by pointing out their errors.  Jury errors on the other hand, unless they involve determinations of issues for which there were “no evidence,” are immune to institutional criticism or correction.

Despite my whining, not all courts butcher statistical concepts.  There are many astute judges out there who see error and call it error.  Take for instance, the trial judge who was confronted with this typical argument:

“While Giles admits that a p-value of .15 is three times higher than what scientists generally consider statistically significant—that is, a p-value of .05 or lower—she maintains that this ‘‘represents 85% certainty, which meets any conceivable concept of preponderance of the evidence.’’ (Doc. 103 at 16).”

Giles v. Wyeth, Inc., 500 F.Supp. 2d 1048, 1056-57 (S.D.Ill. 2007), aff’d, 556 F.3d 596 (7th Cir. 2009).  Despite having case law cited to it (such as In re Ephedra), the trial court looked to the Reference Manual on Scientific Evidence, a resource that seems to be ignored by many federal judges, and rejected the bogus argument.  Unfortunately, the lawyers who made the bogus argument still are licensed, and at large, to incite the same error in other cases.

This business perhaps would be amenable to an empirical analysis.  An enterprising sociologist of the law could conduct some survey research on the science and math training of the federal judiciary, on whether the federal judges have read chapters of the Reference Manual before deciding cases involving statistics or science, and whether federal judges expressed the need for further education.  This survey evidence could be capped by an analysis of the prevalence of certain kinds of basic errors, such as the transpositional fallacy committed by so many judges (but decisively rejected in the Giles case).  Perhaps such an empirical analysis would advance our understanding whether we need specialty science courts.

One of the reasons that the Reference Manual on Scientific Evidence is worthy of so much critical attention is that the volume has the imprimatur of the Federal Judicial Center, and now the National Academies of Science.  Putting aside the idiosyncratic chapter by the late Professor Berger, the Manual clearly present guidance on many important issues.  To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges.

Unfortunately, the Manual has competition from lesser authors whose work obscures, misleads, and confuses important issues.  Consider an article by two would-be expert witnesses, who testify for plaintiffs, and confidently misstate the meaning of a confidence interval:

“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”

Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004).  This misstatement was then cited and quoted with obvious approval by Professor Beecher-Monas, in her text on scientific evidence.  Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 60-61 n. 17 (2007).   Beecher-Monas goes on, however, to argue that confidence interval coefficients are not the same as burdens of proof, but then implies that scientific standards of proof are different from the legal preponderance of the evidence.  She provides no citation or support for the higher burden of scientific proof:

“Some commentators have attributed the causation conundrum in the courts to the differing burdens of proof in science and law.28 In law, the civil standard of ‘more probable than not’ is often characterized as a probability greater than 50 percent.29 In science, on the other hand, the most widely used standard is a 95 percent confidence interval (corresponding to a 5 percent level of significance, or p-level).30 Both sound like probabilistic assessment. As a result, the argument goes, civil judges should not exclude scientific testimony that fails scientific validity standards because the civil legal standards are much lower. The transliteration of the ‘more probable than not’ standard of civil factfinding into a quantitative threshold of statistical evidence is misconceived. The legal and scientific standards are fundamentally different. They have different goals and different measures.  Therefore, one cannot justifiably argue that evidence failing to meet the scientific standards nonetheless should be admissible because the scientific standards are too high for preponderance determinations.”

Id. at 65.  This seems to be on the right track, although Beecher-Monas does not state clearly whether she subscribes to the notion that the burdens of proof in science and law differ.  The argument then takes a wrong turn:

“Equating confidence intervals with burdens of persuasion is simply incoherent. The goal of the scientific standard – the 95 percent confidence interval – is to avoid claiming an effect when there is none (i.e., a false positive).31

Id. at 66.   But this is crazy error; confidence intervals are not burdens of persuasion, legal or scientific.  Beecher-Monas is not, however, content to leave this alone:

“Scientists using a 95 percent confidence interval are making a prediction about the results being due to something other than chance.”

Id. at 66 (emphasis added).  Other than chance?  Well this implies causality, as well as bias and confounding, but the confidence interval, like the p-value, addresses only random or sampling error.  Beecher-Monas’s error is neither random nor scientific.  Indeed, she perpetuates the same error committed by the Fifth Circuit in a frequently cited Bendectin case, which interpreted the confidence interval as resolving questions of the role of matters “other than chance,” such as bias and confounding.  Brock v. Merrill Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989)(“Fortunately, we do not have to resolve any of the above questions [as to bias and confounding], since the studies presented to us incorporate the possibility of these factors by the use of a confidence interval.”)(emphasis in original).  See, e.g., David H. Kaye, David E. Bernstein, and Jennifer L. Mnookin, The New Wigmore – A Treatise on Evidence:  Expert Evidence § 12.6.4, at 546 (2d ed. 2011) Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 86-87 (2009)(criticizing the overinterpretation of confidence intervals by the Brock court).

Clapp, Ozonoff, and Beecher-Monas are not alone in offering bad advice to judges who must help resolve statistical issues.  Déirdre Dwyer, a prominent scholar of expert evidence in the United Kingdom, manages to bundle up the transpositional fallacy and a misstatement of the meaning of the confidence interval into one succinct exposition:

“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”

Déirdre Dwyer, The Judicial Assessment of Expert Evidence 154-55 (Cambridge Univ. Press 2008).

Of course, Clapp, Ozonoff, Beecher-Monas, and Dwyer build upon a long tradition of academics’ giving errant advice to judges on this very issue.  See, e.g., Christopher B. Mueller, “Daubert Asks the Right Questions:  Now Appellate Courts Should Help Find the Right Answers,” 33 Seton Hall L. Rev. 987, 997 (2003)(describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant & Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 Judicature 12, 19 (2003)(“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, Modern Epidemiology 190 (Philadelphia 1998);  John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New Reference Manual on Scientific Evidence,” 74 N.C.L.Rev. 1183, 1212 n.172 (1996)(“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

Who has prevailed?  The statistically correct authors of the statistics chapter of the Reference Manual on Scientific Evidence, or the errant commentators?  It would be good to have some empirical evidence to help evaluate the judiciary’s competence. Here are some cases, many drawn from the Manual‘s discussions, arranged chronologically, before and after the first appearance of the Manual:

Before First Edition of the Reference Manual on Scientific Evidence:

DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 948 (3d Cir. 1990)(“A 95% confidence interval is constructed with enough width so that one can be confident that it is only 5% likely that the relative risk attained would have occurred if the true parameter, i.e., the actual unknown relationship between the two studied variables, were outside the confidence interval.   If a 95% confidence interval thus contains ‘1’, or the null hypothesis, then a researcher cannot say that the results are ‘statistically significant’, that is, that the null hypothesis has been disproved at a .05 level of significance.”)(internal citations omitted)(citing in part, D. Barnes & J. Conley, Statistical Evidence in Litigation § 3.15, at 107 (1986), as defining a CI as “a limit above or below or a range around the sample mean, beyond which the true population is unlikely to fall”).

United States ex rel. Free v. Peters, 806 F. Supp. 705, 713 n.6 (N.D. Ill. 1992) (“A 99% confidence interval, for instance, is an indication that if we repeated our measurement 100 times under identical conditions, 99 times out of 100 the point estimate derived from the repeated experimentation will fall within the initial interval estimate … .”), rev’d in part, 12 F.3d 700 (7th Cir. 1993)

DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992)(”A 95% confidence interval means that there is a 95% probability that the ‘true’ relative risk falls within the interval”) , aff’d, 6 F.3d 778 (3d Cir. 1993)

Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1353-54 & n.1 (6th Cir. 1992)(describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10”)

Hilao v. Estate of Marcos, 103 F.3d 767, 787 (9th Cir. 1996)(Rymer, J., dissenting and concurring in part).

After the first publication of the Reference Manual on Scientific Evidence:

American Library Ass’n v. United States, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), rev’d on other grounds, 539 U.S. 194 (2003)

SmithKline Beecham Corp. v. Apotex Corp., 247 F.Supp.2d 1011, 1037-38 (N.D. Ill. 2003)(“the probability that the true value was between 3 percent and 7 percent, that is, within two standard deviations of the mean estimate, would be 95 percent”)(also confusing attained significance probability with posterior probability: “This need not be a fatal concession, since 95 percent (i.e., a 5 percent probability that the sign of the coefficient being tested would be observed in the test even if the true value of the sign was zero) is an  arbitrary measure of statistical significance.  This is especially so when the burden of persuasion on an issue is the undemanding ‘preponderance’ standard, which  requires a confidence of only a mite over 50 percent. So recomputing Niemczyk’s estimates as significant only at the 80 or 85 percent level need not be thought to invalidate his findings.”), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005)

In re Silicone Gel Breast Implants Prods. Liab. Litig, 318 F.Supp.2d 879, 897 (C.D. Cal. 2004) (interpreting a relative risk of 1.99, in a subgroup of women who had had polyurethane foam covered breast implants, with a 95% CI that ran from 0.5 to 8.0, to mean that “95 out of 100 a study of that type would yield a relative risk somewhere between on 0.5 and 8.0.  This huge margin of error associated with the PUF-specific data (ranging from a potential finding that implants make a woman 50% less likely to develop breast cancer to a potential finding that they make her 800% more likely to develop breast cancer) render those findings meaningless for purposes of proving or disproving general causation in a court of law.”)(emphasis in original)

Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc., 482 F.Supp. 2d 478, 495 (D.N.J.2007)(“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”)(apparently relying party’s expert witness’s report), aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd., 344 Fed.Appx. 595 (Fed. Cir. 2009)

Eli Lilly & Co. v. Teva Pharms, USA, 2008 WL 2410420, *24 (S.D.Ind. 2008)(stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”)

Benavidez v. City of Irving, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009)(interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”)

Estate of George v. Vermont League of Cities and Towns, 993 A.2d 367, 378 n.12 (Vt. 2010)(erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”)

Correct Statements

There is no reason for any of these courts to have struggled so with the concept of statistical significance or of the confidence interval.  These concepts are well elucidated in the Reference Manual on Scientific Evidence (RMSE):

“To begin with, ‘confidence’ is a term of art. The confidence level indicates the percentage of the time that intervals from repeated samples would cover the true value. The confidence level does not express the chance that repeated estimates would fall into the confidence interval.91

* * *

According to the frequentist theory of statistics, probability statements cannot be made about population characteristics: Probability statements apply to the behavior of samples. That is why the different term ‘confidence’ is used.”

RMSE 3d at 247 (2011).

Even before the Manual, many capable authors have tried to reach the judiciary to help them learn and apply statistical concepts more confidently.  Professors Michael Finkelstein and Bruce Levin, of the Columbia University’s Law School and Mailman School of Public Health, respectively, have worked hard to educate lawyers and judges in the important concepts of statistical analyses:

“It is the confidence limits PL and PU that are random variables based on the sample data. Thus, a confidence interval (PL, PU ) is a random interval, which may or may not contain the population parameter P. The term ‘confidence’ derives from the fundamental property that, whatever the true value of P, the 95% confidence interval will contain P within its limits 95% of the time, or with 95% probability. This statement is made only with reference to the general property of confidence intervals and not to a probabilistic evaluation of its truth in any particular instance with realized values of PL and PU. “

Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers at 169-70 (2d ed. 2001)

Courts have no doubt been confused to some extent between the operational definition of a confidence interval and the role of the sample point estimate as an estimator of the population parameter.  In some instances, the sample statistic may be the best estimate of the population parameter, but that estimate may be rather crummy because of the sampling error involved.  See, e.g., Kenneth J. Rothman, Sander Greenland, Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008) (“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible. * * *  A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”); Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)(“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data.  But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”); Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)(“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval … .”).

Admittedly, I have given an impressionistic account, and I have used anecdotal methods, to explore the question whether the courts have improved in their statistical assessments in the 20 years since the Supreme Court decided Daubert.  Many decisions go unreported, and perhaps many errors are cut off from the bench in the course of testimony or argument.  I personally doubt that judges exercise greater care in their comments from the bench than they do in published opinions.  Still, the quality of care exercised by the courts would be a worthy area of investigation by the Federal Judicial Center, or perhaps by other sociologists of the law.