TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Matrixx Unloaded

March 29th, 2011

In writing for a unanimous Court in Matrixx Initiatives, Inc. v. Siracusano, Justice Sotomayor wandered far afield from the world of pleading rules to flyblow the world of expert witness jurisprudence.  How and why did this happen?  Why did Matrixx invoke the concept of statistical significance to counter case reports of adverse events? Did Matrixx oversell its scientific position, thereby handing Justice Sotomayor an opportunity to unravel decades of evolution of law on the admissibility of expert witness opinion testimony?  Inquiring minds want to know.

Still, whatever the occasion for the obiter dicta, Court’s pronouncements on expert witnesses are stunning for their irrelevance and questionable scholarship:

“We note that courts frequently permit expert testimony on causation based on evidence other than statistical significance. See, e.g., Best v. Lowe’s Home Centers, Inc., 563 F. 3d 171, 178 (6th Cir 2009); Westberry v. Gislaved Gummi AB, 178 F. 3d 257, 263–264 (4th Cir. 1999) (citing cases); Wells v. Ortho Pharmaceutical Corp., 788 F. 2d 741, 744–745 (11th Cir. 1986). We need not consider whether the expert testimony was properly admitted in those cases, and we do not attempt to define here what constitutes reliable evidence of causation.”

Id. at 12.  What is remarkable about this passage is that the first two cases cited involved differential etiology or diagnosis to assess specific causation, not general causation.  As most courts have recognized, this assessment strategy requires that general causation has already been established. See, e.g., Hall v. Baxter Healthcare, 947 F. Supp. 1387 (D. Ore. 1996).

The citation to the third case, Wells, is noteworthy because the case has nothing to do with adverse event reports or statistical significance.  Wells involved a claim of birth defects caused by the use of spermicidal jelly contraceptive, which had been the subject of several studies, one of which at least yielded a statistically significant increase in detected birth defects over what was expected.  Wells v. Ortho Pharmaceutical Corp., 615 F. Supp. 262 (N.D.Ga. 1985), aff’d and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S.950 (1986).  Wells could thus hardly be an example of a case in which there was a judgment of causation based upon a scientific study that lacked statistical significance in its findings. Of course, finding statistical significance is just the beginning of assessing the causality of an association; Wells was notorious for its poor assessment of all the determinants of scientific causation.

The citation to Wells is thus remarkable because the Wells decision was rightly and widely criticized for its failure to evaluate the entire evidentiary display, as well as for its failure to rule out bias and confounding in the studies relied upon by the plaintiff.  See , e.g., James L. Mills and Duane Alexander, “Teratogens and ‘Litogens’,” 15 New Engl. J. Med. 1234 (1986); Samuel R. Gross, “Expert Evidence,” 1991 Wis. L. Rev. 1113, 1121-24 (1991) (“Unfortunately, Judge Shoob’s decision is absolutely wrong. There is no scientifically credible evidence that Ortho-Gynol Contraceptive Jelly ever causes birth defects.”). See also Editorial, “Federal Judges v. Science,” N.Y. Times, December 27, 1986, at A22 (unsigned editorial);  David E. Bernstein, “Junk Science in the Courtroom,” Wall St. J. at A 15 (Mar. 24,1993) (pointing to Wells as a prominent example of how the federal judiciary had embarrassed American judicial system with its careless, non-evidence based approach to scientific evidence). A few years later, another case in the same judicial district against the same defendant for the same product resulted in the grant of summary judgment.  Smith v. Ortho Pharmaceutical Corp., 770 F. Supp. 1561 (N.D. Ga. 1991) (supposedly distinguishing Wells on the basis of more recent studies).

Perhaps the most remarkable aspect of the Court’s citation to Wells is that the case, and all it stands for, was overruled sub silentio by the Supreme Court’s own decisions in Daubert, Joiner, Kumho Tire, and Weisgram.  And if that did not kill the concept, then there was the simple matter of a supervening statute:  the 2000 amendment of Rule 702, of Federal Rules of Evidence.

Citing a case as jurisprudentially dead and discredited as Wells could have been sloppy scholarship and lawyering.  The principle of charity, however, suggests it was purposeful, and that is a frightful prospect.

Courts and Commentators on the Use of Relative Risks to Infer Specific Causation

March 18th, 2011

Below, I have collected some of the case law and commentary on the issue of using relative and attributable risks to satisfy plaintiff’s burden of showing, more likely than not, that an exposure or condition caused his or her disease or injury.


Radiation

Johnston v. United States, 597 F. Supp. 374, 412, 425-26 (D. Kan. 1984)

Allen v. United States, 588 F. Supp. 247 (1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987)

In re TMI Litig., 193 F.3d 613, 629 (3d Cir. 1999)(rejecting “doubling dose” trial court’s analysis), amended, 199 F.3d 158 (3d Cir. 2000)

In re Hanford Nuclear Reservation Litig., 1998 WL 775340, at *8 (E.D.Wash. Aug. 21, 1998), rev’d, 292 F.3d 1124, 1136-37 (9th Cir. 2002)


Swine Flu- GBS Cases

Cook v. United States, 545 F. Supp. 306, 308 (N.D. Cal. 1982)(“Whenever the relative risk to vaccinated persons is greater than two times the risk to unvaccinated persons, there is a greater than 50% chance that a given GBS case among vaccinees of that latency period is attributable to vaccination, thus sustaining plaintiff’s burden of proof on causation.”)

Padgett v. United States, 553 F. Supp. 794, 800 – 01 (W.D. Tex. 1982) (“From the relative risk, we can calculate the probability that a given case of GBS was caused by vaccination. . . . [A] relative risk of 2 or greater would indicate that it was more likely than not that vaccination caused a case of GBS.”);

Manko v. United States, 636 F. Supp. 1419, 1434 (W.D. Mo. 1986)(relative risk of 2, or less, means exposure not the probable cause of disease claimed), aff’d in relevant part, 830 F.2d 831 (8th Cir. 1987)


IUD Cases – Pelvic Inflammatory Disease

Marder v. G.D. Searle & Co., 630 F. Supp. 1087, 1092 (D.Md. 1986) (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987)(per curiam)


Bendectin cases

Lynch v. Merrill-National Laboratories, 646 F.Supp. 856 (D. Mass. 1986)(granting summary judgment), aff’d, 830 F.2d 1190, 1197 (1st Cir. 1987)(distinguishing between chances that “somewhat favor” plaintiff and plaintiff’s burden of showing specific causation by “preponderant evidence”)

DeLuca v. Merrell Dow Pharm., Inc., 911 F.2d 941, 958-9 (3d Cir. 1990)

Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1321 (9th Cir.)(“Daubert II”)(holding that for epidemiological testimony to be admissible to prove specific causation, there must have been a relative risk for the plaintiff of greater than 2) (“For an epidemiological study to show causation under a preponderance standard . . . the study must how that children whose mothers took Bendectin are more than twice as likely to develop limb reduction birth defects as children whose mothers did not.”), cert. denied, 516 U.S. 869 (1995)

DePyper v. Navarro, 1995 WL 788828 (Mich. Cir. Ct. Nov. 27, 1995)

Oxendine v. Merrell Dow Pharm., Inc., 1996 WL 680992 (D.C. Super. Ct. Oct. 24, 1996)

Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 716 (Tex. 1997) (holding, in accord with the weight of judicial authority, “that the requirement of a more than 50% probability means that epidemiological evidence must show that the risk of an injury or condition in the exposed population was more than double the risk in the unexposed or control population”); id. at at 719 (rejecting isolated statistically significant associations when not consistently found among studies)


Silicone Cases

Hall v. Baxter Healthcare, 947 F.Supp. 1387, 1392, 1397, 1403-04 (D. Ore. 1996)(discussing relative risk of 2.0)

Pick v. American Medical Systems, Inc., 958 F. Supp. 1151, 1160 (E.D.La. 1997) (noting, in penile implant case, that “any” increased risk suggests that the exposure “may” have played some causal role)

In re Breast Implant Litigation, 11 F. Supp. 2d 1217, 1226 -27 (D. Colo. 1998)(relative risk of 2.0 or less shows that the background risk is at least as likely to have given rise to the alleged injury)

Barrow v. Bristol-Myers Squibb Co., 1998 WL 812318 (M.D. Fla. Oct. 29, 1998)

Allison v. McGhan Med. Corp., 184 F.3d 1300, 1315n.16, 1316 (11th Cir. 1999)(affirming exclusion of expert testimony based upon a study with a risk ratio of 1.24; noting that statistically significant epidemiological study reporting an increased risk of marker of disease of 1.24 times in patients with breast implants was so close to 1.0 that it “was not worth serious consideration for proving causation”; threshold for concluding that an agent more likely than not caused a disease is 2.0, citing Federal Judicial Center, Reference Manual on Scientific Evidence 168-69 (1994))

Grant v. Bristol-Myers Squibb, 97 F. Supp. 2d 986, 992 (D. Ariz. 2000)

Pozefsky v. Baxter Healthcare Corp., No. 92-CV-0314, 2001 WL 967608, at *3 (N.D.N.Y. August 16, 2001) (excluding causation opinion testimony given contrary epidemiologic studies; noting that sufficient epidemiologic evidence requires relative risk greater than two)

In re Silicone Gel Breast Implant Litig., 318 F. Supp. 2d 879, 893 (C.D. Cal. 2004)

Norris v. Baxter Healthcare Corp., 397 F.3d 878 (10th Cir. 2005) (discussing but not deciding specific causation and the need for relative risk greater than two; no reliable showing of general causation)

Barrow v. Bristol-Meyers Squibb Co., 1998 WL 812318, at *23 (M.D. Fla., Oct. 29, 1998)

Minnesota Mining and Manufacturing v. Atterbury, 978 S.W.2d 183, 198 (Tex.App. – Texarkana 1998) (noting that “[t]here is no requirement in a toxic tort case that a party must have reliable evidence of a relative risk of 2.0 or greater”)


Asbestos

Washington v. Armstrong World Indus., Inc., 839 F.2d 1121 (5th Cir. 1988)(affirming grant of summary judgment on grounds that there was insufficient evidence that plaintiff’s colon cancer was caused by asbestos)

Lee v. Johns Manville Corp., slip op. at 3, Phila. Cty. Ct. C.P., Sept. Term 1978, No. 88 (123) (Oct. 26, 1983) (Forer, J.)(entering verdict in favor of defendants on grounds that plaintiff had failed to show that his colo rectal cancer had been caused by asbestos exposure after adducing evidence of a relative risk less than two)

Primavera v. Celotex Corp., Phila. Cty. Ct. C.P., December Term, 1981, No. 1283 (Bench Op. of Hon. Berel Caesar, (Nov. 2, 1988) (granting compulsory nonsuit on the plaintiff’s claim that his colorectal cancer was caused by his occupational exposure to asbestos)

Grassis v. Johns-Manville Corp., 248 N.J.Super. 446, 455-56, 591 A.2d 671, 676 (App. Div. 1991)

Landrigan v. Celotex Corp., 127 N.J. 404, 419, 605 A.2d 1079 (1992)

Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992)

In re Joint E. & S. Dist. Asbestos Litig., 758 F. Supp. 199 (S.D.N.Y. 1991), rev’d sub nom. Maiorano v. Owens Corning Corp., 964 F.2d 92 (2d Cir. 1992)

Maiorana v. National Gypsum, 827 F. Supp. 1014, 1043 (S.D.N.Y. 1993), aff’d in part and rev’d in part, 52 F.3d 1122, 1134 (2d Cir. 1995)

Jones v. Owens-Corning Fiberglas Corp., 288 N.J. Super. 258, 266, 672 A.2d 230, 235 (App. Div. 1996)

Keene Corp. v. Hall, 626 A.2d 997 (Md. Spec. Ct. App. 1993)(laryngeal cancer)

In re W.R. Grace & Co., 355 B.R. 462, 483 (Bankr. D. Del. 2006) (requiring showing of relative risk greater than two to support property damage claims based on unreasonable risks from asbestos insulation products).


Pharmaceutical Cases

Ambrosini v. Upjohn, 1995 WL 637650, at *4 (D.D.C. 1995)

Ambrosini v. Labarraque, 101 F.3d 129, 135 (D.C. Cir. 1996)(Depo-Provera, birth defects)

Miller v. Pfizer, 196 F. Supp. 2d 1062, 1079 (D. Kan. 2002) (acknowledging that most courts require a showing of RR > 2, but questioning their reasoning), aff’d, 356 F. 3d 1326 (10th Cir. 2004)

Smith v. Wyeth-Ayerst Laboratories Co., appears to recognize that risk and cause are distinct concepts. 278 F. Supp. 2d 684, 691 (W.D.N.C. 2003) (“Epidemiologic data that shows a risk cannot support an inference of cause unless (1) the data are statistically significant according to scientific standards used for evaluating such associations; (2) the relative risk is sufficiently strong to support an inference of ‘more likely than not’; and (3)  the epidemiologic data fits the plaintiff’s case in terms of exposure, latency, and other relevant variables.”)

Burton v. Wyeth-Ayherst Laboratories, 513 F. Supp. 2d 719 (N.D. Tex. 2007)

In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1172 (N.D. Calif. 2007)(observing that epidemiologic studies “can also be probative of specific causation, but only if the relative risk is greater than 2.0, that is, the product more than doubles the risk of getting the disease”)

In re Viagra Products Liab. Litigat., 572 F. Supp. 2d 1071, 1078 (D. Minn. 2008)(noting that some but not all courts have concluded relative risks under two support finding expert witness’s opinion to be inadmissible).


Toxic Tort Cases

In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 785, 836 (E.D.N.Y. 1984) (“A government administrative agency may regulate or prohibit the use of toxic substances through rulemaking, despite a very low probability of any causal relationship.  A court, in contrast, must observe the tort law requirement that a plaintiff establish a probability of more than 50% that the defendant’s action injured him. … This means that at least a two-fold increase in incidence of the disease attributable to Agent Orange exposure is required to permit recovery if epidemiological studies alone are relied upon.”), aff’d 818 F.2d 145, 150-51 (2d Cir. 1987)(approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 487 U.S. 1234 (1988)

Sanderson v. Int’l Flavors & Fragrances, Inc., 950 F. Supp. 981, 998 n. 17,  999-1000, 1004 (C.D.Cal.1996) (more than a doubling of risk is required in case involving aldehyde exposure and claimed multiple chemical sensitivities)

Wright v. Willamette Indus., Inc., 91 F.3d 1105 (8th Cir. 1996)(“Actions in tort for damages focus on the question of whether to transfer money from one individual to another, and under common-law principles (like the ones that Arkansas law recognizes) that transfer can take place only if one individual proves, among other things, that it is more likely than not that another individual has caused him or her harm.  It is therefore not enough for a plaintiff to show that a certain chemical agent sometimes causes the kind of harm that he or she is complaining of.  At a minimum, we think that there must be evidence from which the factfinder can conclude that the plaintiff was exposed to levels of that agent that are known to cause the kind of harm that the plaintiff claims to have suffered. See Abuan v. General Elec. Co., 3 F.3d at 333.  We do not require a mathematically precise table equating levels of exposure with levels of harm, but there must be evidence from which a reasonable person could conclude that a defendant’s emission has probably caused a particular plaintiff the kind of harm of which he or she complains before there can be a recovery.”)

McDaniel v. CSX Transp., Inc., 955 S.W.2d 257, 264 (1997) (doubling of risk is relevant but not required as a matter of law)

Lofgren v. Motorola, 1998 WL 299925 *14 (Ariz. Super. 1998) (TCE, cancer)

Berry v. CSX Transp., Inc., 709 So. 2d 552 (Fla. D. Ct.App. 1998)(solvents, toxic encephalopathy)

Bartley v. Euclid, Inc., 158 F.3d 261 (5th Cir. 1998)

Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 591-92 (D.N.J.2002) (‘‘the threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0’’), aff’d, 68 F. App’x 356 (3d Cir. 2003)

Ferguson v. Riverside School Dist. No. 416, 2002 WL 34355958 (E.D. Wash. Feb. 6, 2002)(No. CS-00-0097-FVS)

Daniels v. Lyondell-Citgo Refining Co., 99 S.W.3d 722, 727 (Tex. App. – Houston [1st Dist.] 2003)

Graham v Lautrec Ltd., 2003 WL 23512133 (Mich. Cir. Ct., July 24, 2003)

Theofanis v. Sarrafi, 791 N.E.2d 38,48 (Ill. App. 2003)(reversing and granting new trial to plaintiff who received an award of no damages when experts testified that relative risk was between 2.0 and 3.0)(“where the risk with the negligent act is at least twice as great as the risk in the absence of negligence, the evidence supports a finding that, more likely than not, the negligence in fact caused the harm”).

Cano v. Everest Minerals Corp., 362 F. Supp. 2d 814, 846 (W.D. Tex. 2005)(relative risk less than 3.0 represents only a weak association)

Mobil Oil Corp. v. Bailey, 187 S.W.3d 263, 268 (Tex. App. – Beaumont 2006)

Cook v. Rockwell Internat’l Corp., 580 F. Supp. 2d 1071, 1088-89 (D. Colo. 2006)

In re Lockheed Litig. Cases, 115 Cal. App. 4th 558 (2004), rev’d in part, 23 Cal. Rptr. 3d 762, 765 (Cal. App. 2d Dist. 2005), cert. dismissed, 192 P.3d 403 (Cal. 2007)

Watts v. Radiator Specialty Co., 990 So. 2d 143 (Miss. 2008)(“The threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0.”)

Henricksen v. Conocophillips Co., 605 F. Supp. 2d 1142, 1158 (E.D. Wash. 2009) (noting that under Circuit precedent, epidemiologic studies showing low-level risk may suffiicent to show general causation but are sufficient to show specific causation only if relative risk exceeds two) (excluding plaintiff‘s expert witness’s testimony because epidemiologic evidence iis “contradictory and inconsistent”)

George v. Vermont League of Cities and Towns, 2010 Vt. 1, 993 A.2d 367, 375 (2010)

City of San Antonio v. Pollock, 284 S.W.3d 809, 818 (Tex. 2009) (holding testimony admitted insufficient as matter of law).


ACADEMIC COMMENTATORS

Michael Dore, “A Commentary of the Use of Epidemiological Evidence in Demonstrating Cause-in-Fact,” 7 Harv. Envt’l L.Rev. 429, 431-40 (1983)

Bert Black & David E. Lilienfeld, Epidemiologic Proof in Toxic Tort Litigation, 52 Fordham L. Rev. 732, 767 – 69 (1984)

David E. Lilienfeld & Bert Black, “The Epidemiologist in Court,” 123 Am. J. Epidemiology 961, 963 (1986)(a relative risk of 1.5 allows an inference of attributable risk of 33%, which means any individual case is less likely than not to be causally related)

Powell, “How to Tell the Truth With Statistics: A New Statistical Approach to Analyzing the Bendectin Epidemiological Data in the Aftermath of Daubert v. Merrell Dow Pharmaceuticals,” 31 Houston L. Rev. 1241, 1310 (1994) (“The plaintiff who wishes to reach the jury on the issue of causation must submit a statistical analysis indicating that exposure to the drug in question more likely than not caused the birth defects in question.  To support a finding of causation, the meta-analysis summary odds ratio must exceed two.”)

Linda Bailey, et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence at 121, 168-69 (Federal Judical Ctr. 1st ed. 1994) (“The threshold for concluding that an agent was more likely the cause of a disease than not is a relative risk greater than 2.0 … .  A relative risk greater than 2.0 would permit an inference that an individual plaintiff’s disease was more likely than not caused by the implicated agent.”)

Ben Armstrong & Gilles Theriault, “Compensating Lung Cancer Patients Occupationally Exposed to Coal Tar Pitch Volatiles,” 53 Occup. Envt’l Med. 160 (1996)

Philip E. Enterline, “Toxic Torts:  Are They Poisoning Scientific Literature?” 30 Am. J. Indus. Med. 121 (1996)

Joseph V. Rodricks & Susan H. Rieth, “Toxicological Risk Assessment in the Court:  Are Available Methodologies Suitable for Evaluating Toxic Tort and Product Liability Claims?,” 27 Reg. Toxicol. & Pharmacol. 21, 25-30 (1998)

Michael Green et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 333, 381, 383 (Federal Judicial Center ed., 2d ed. 2000), available at http://www.fjc.gov ( “[E]pidemiology addresses whether an agent can cause a disease, not whether an agent did cause a specific plaintiff’s disease.  * * *  Nevertheless, the specific causation issue is a necessary legal element in a toxic substance case. The plaintiff must establish not only that the defendant’s agent is capable of causing disease but also that it did cause the plaintiff’s disease.  Thus, a number of courts have confronted the legal question of what is acceptable proof of specific causation and the role that epidemiologic evidence plays in answering that question. This question is not a question that is addressed by epidemiology. Rather, it is a legal question a number of courts have grappled with.”) (“[t]he civil burden of proof is described most often as requiring the fact finder to believe that what is sought to be proved is more likely true than not true. The relative risk from epidemiologic studies can be adapted to this 50% plus standard to yield a probability or likelihood that an agent caused an individual’s disease.”)

David W. Barnes, “Too Many Probabilities:  Statistical Evidence of Tort Causation,” 64 Law and Contemp. Problems 191, 206 (2001) (criticizing the uncritical use of a relative risk greater than two to signify the probability, but acknowledging that sometimes a credible, precise RR, greater than 1.0, will be too small to support specific causation, such as the RR of 1.24 seen in the Allison case)

Russellyn S. Carruth & Bernard D. Goldstein, “Relative Risk Greater than Two in Proof of Causation in Toxic Tort Litigation,” 41 Jurimetrics 195 (2001) (criticizing the use of a relative risk of two benchmark, but acknowledging that when a disease has multiple causes and a substantial base rate in the general population, “there is no objective means to determine if a particular person’s disease was caused by some other environmental exposure, or by a non-environmental cause.”)

Richard W. Clapp & David Ozonoff, “Environment and Health:  Vital Intersection or Contested Territory?” 36 Am. J. L. & Med. 189, 210 (2004)( incorrectly describing the meaning of a confidence interval:  “A relative risk of 1.8, with confidence interval of 1.3 to 2.9 could very likely represent a true relative risk greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”)

Erica Beecher-Monas, Evaluating Scientific Evidence 58, 67 (N.Y. 2007)(“No matter how persuasive epidemiological or toxicological studies may be, they could not show individual causation, although they might enable a (probabilistic) judgment about the association of a particular chemical exposure to human disease in general.”)(“While significance testing characterizes the probability that the relative risk would be the same as found in the study as if the results were due to chance, a relative risk of 2 is the threshold for a greater than 50 percent chance that the effect was caused by the agent in question.”)(incorrectly describing significance probability as a point probability as opposed to tail probabilities)

Andrew W. Jurs, “Daubert, Probabilities and Possibilities and the Ohio Solution:  A Sensible Approach to Relevance Under Rule 702 in Civil and Criminal Applications,” 41 Akron L. Rev. 609, 637 (2008)(acknowledging that relative risks less than 2.0 invite jury speculation about individual, specific causation)

Relative Risks and Individual Causal Attribution Using Risk Size

March 18th, 2011

The relative risk argument is simple.  A relative risk of 1.0 means that the rate of disease incidence or mortality is the same among the exposed and control populations.  A relative risk of 2.0 means that the incidence rate in the exposed population is twice that in the controls.  The existence of an observed rate among the non-exposed controls suggests that we are dealing with a disease of “ordinary life,” for which there is an expected rate of occurrence.  Most chronic diseases, such as cancer, autoimmune disease, cardiovascular diseases, fall into this category of diseases of ordinary life.

If a study of a disease that is prevalent in the general population, say colon cancer, is conducted in an exposed cohort of workers, say asbestos insulators, and the study finds a relative risk of 1.5, we would have to take several steps to assess the finding’s relevance in litigation.  First, this positive association would have to be evaluated for causality.  Bias and confounding would have to be ruled out as explaining the apparent increase in risk.  Furthermore, the association would have to be evaluated for various indicia of causality, such as consistency with other studies, dose-response relationship between exposure and outcome, biological plausibility and coherence, and support from experimental studies.  In the case of asbestos and colon cancer, the causal hypothesis has repeated failed to be supported by such evaluations, but even if we were to assume general causation, arguendo, we would be left without a way to infer causation in a given case.  If plaintiff supported his case with evidence or a relative risk of 1.5, we would have 50% more observed cases than expected.  So if the observed population was expected to experience 100 colon cancer cases over the observation period, a relative risk of 1.5 means that 150 such cases were observed, or 100 expected cases and 50 putative excess cases.  Alas, there is no principled way to tell an excess case from an expected case, and the odds favor the defense two to one that any given case arose from the expected population as opposed to the excess group.  As a probability, the probability that plaintiff’s case arose from the excess portion is 33%, well below what is needed to support a sustainable claim.  Again, this assumes many facts in plaintiff’s favor, such as a perfect epidemiologic study, without bias or confounding, and with consistency among the findings of similar studies.  (None of these assumptions is even close to satisfied for asbestos and colon cancer.)

In the Agent Orange litigation, Judge Weinstein implicitly recognized the problem that very large relative risks suggested that an individual case was likely to have been related to its antecedent risks.  Small relative risks suggested that any inference of specific causation from the antecedent risk was largely speculative, in the absence of some reliable marker of exposure-related causation. See In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 785, 817 (E.D.N.Y. 1984)(plaintiffs must prove at least a two-fold increase in rate of disease allegedly caused by the exposure), aff’d, 818 F.2d 145, 150-51 (2d Cir. 1987)(approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988); see also In re “Agent Orange” Prod. Liab. Litig., 611 F. Supp. 1223, 1240, 1262 (E.D.N.Y. 1985)(excluding plaintiffs’ expert witnesses), aff’d, 818 F.2d 187 (2d Cir. 1987), cert. denied, 487 U.S. 1234 (1988). 

Ever since Judge Weinstein embraced the relative risk of two, as an important benchmark to be exceeded if plaintiffs hoped to show specific causation, scientists who practice medicine for the redistribution of wealth have attacked the concept.  The challengers have urged that small relative risks, including relative risks of two or less, could suffice to support causal attribution in a given case, especially in the presence of relevant clinical findings.  The challengers, however been vague and evasive when it comes to identifying what are the relevant clinical findings and how they operate to show that the risk has actually operated to become part of the causal pathway that has led to the individual’s injury or disease.

Among the most vociferous of the challengers has been Professor Sander Greenland, of the University of California Los Angeles School of Public Health.  Greenland has published his criticisms of the inference of a probability of individual causation from the relative risk on many occasions.  See, e.g., Sander Greenland & James Robins, “Conceptual Problems in the Definition and Interpretation of Attributable Fractions,” 128 Am. J. Epidem. 1185 (1988); James Robins & Sander Greenland, “The Probability of Causation Under a Stochastic Model for Individual Risk,” 45 Biometrics 1125 (1989); James Robins & Sander Greenland, “Estimability and Estimation of Excess and Etiologic Fractions,” 8 Statistics in Medicine 845 (1989); James Robins & Sander Greenland, “Estimability and Estimation of Expected Years of Life Lost Due to a Hazardous Exposure,” 10 Statistics in Medicine 79 (1991); Jan Beyea & Sander Greenland, “The Importance  of Specifying the Underyling Biologic Model in Estimating the Probability of Causation,” 76 Health Physics 269 (1999; Sander Greenland, “Relation of Probability of Causation to Relative Risk and Doubling Dose:  A Methodologic Error That Has Become a Social Problem,” 89 Am. J. Pub. Health 1166 (1999); Sander Greenland & James Robins, “Epidemiology, Justice, and the Probability of Causation,” 40 Jurimetrics 321 (2000).

Greenland’s criticisms turn on various assumptions such as the risk may not be evenly distributed within the sampled population, or the causal mechanism may accelerate onset of disease in such a way as to leave the relative risk unchanged in the study under consideration.  Greenland is correct that it is important to have a clear causal model in mind when evaluating the possibility of causal attributions in the light of population studies and their measures of relative risk.  He is also correct that his clever assumptions, if true, could affect the reasonableness of claiming that a relative risk of two or less supports the defense position in many toxic tort cases.  Unfortunately, Greenland’s clever assumptions and his arguments prove too much, because in many, if not most, cases the causal model is not defined.  There is often no evidence to support the plaintiffs’ claims of acceleration, or of sequestration of risk within the sampled population, and certainly no basis for claiming that the plaintiff belongs to a subset of “vulnerable” exposed persons with a higher than average risk that is reflected in the study relative risk.  Without evidence to support Greenland’s various assumptions, even higher relative risks than 2.0, say risks in the range of 2.0 to 20.0, would be unhelpful to support a plaintiffs’ case.  We would be thrown back to the early case law that held that risk can never support individual attributions, and Judge Weinstein’s rather pragmatic pronouncement in Agent Orange would be thrown aside, to the benefit of defendants in toxic tort cases. 

Last year, the Vermont Supreme Court reaffirmed the continuing vitality of the relative risk argument, on the original pragmatic justification offered by Judge Weinstein in the Agent Orange cases.  George v. Vermont League of Cities and Towns, 2010 Vt. 1, 993 A.2d 367 (Vt. 2010).  Indeed, George may well have been one of the best, and the least unheralded, decisions of 2010.

Mr. George had been a fireman before he died of non-Hodgkin’s lymphoma (NHL).  In administrative workman’s compensation proceedings, the Commissioner ruled that widow failed to show a causal connection between firefighting and NHL, although there was an “association.” His widow appealed the denial of benefits.  On de novo review, the trial court excluded plaintiffs’ expert witnesses on Rule 702 grounds.  (Vermont law follows federal law on requiring relevance and reliability of expert witnesses’ opinions.) The case ended up before the Vermont Supreme Court, which had to review the trial court’s handling of the Rule 702 issues.

Several issues were at play.  The plaintiff had presented multiple expert witnesses, Drs. Tee Guidotti and James Lockey, who had presented general and/or specific causation opinions on firefighting and NHL.  These witnesses relied upon epidemiologic studies, some of which had been incorporated into a meta-analysis, and a so-called “weight of the evidence” methodology.

The Vermont Supreme Court recognized the limits of using epidemiology to resolve the specific causation question in George. The Court found the Texas Supreme Court’s treatment of this issue to be persuasive: 

“epidemiological studies can assist in demonstrating a general association between a substance and a disease or condition, but they cannot prove that a substance actually caused a disease or condition in a particular individual.”

Id. at 374 (relying upon and quoting from Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 715 (Tex.1997)).

The Court also quoted from, and relied upon, the pronouncement of the Federal Judicial Center’s Reference Manual, which explains that ‘‘epidemiology is concerned with the incidence of disease in populations and does not address the question of the cause of an individual’s disease.  This question, sometimes referred to as specific causation, is beyond the domain of the science of epidemiology.’’ Id. at 375 (quoting from M. Green et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 333, 381 (2d ed. 2000); footnote omitted in court’s quotation of this source).

Faced with the academic and judicial criticisms of using the relative risk (which is sometimes referred to as “effect size”), the Court recognized the pragmatic compromise between science and the needs of the legal system, embraced by using the relative risk as a benchmark showing for plaintiffs to make in toxic tort litigation:

“The trial court here adopted a relative risk factor of 2.0 as a benchmark, finding that it easily tied into Vermont’s ‘more likely than not’ civil standard and that such a benchmark was helpful in this case because the eight epidemiological studies relied upon by claimant’s experts reflected widely varying degrees of relative risk.”

 Id. at 375.

“Given claimant’s burden of proof, however, and the inherent limitations of epidemiological data in addressing specific causation, the trial court reasonably found the 2.0 standard to be a helpful benchmark in evaluating the epidemiological evidence underlying Dr. Guidotti’s opinion.”

Id. at 377.

“Mindful of this balance, we conclude that the trial court did not abuse its discretion in considering a relative risk greater than 2.0 as a reasonable and helpful benchmark under the circumstances presented here.”

 Id. at 378.

 The Vermont Supreme Court was also clearly worried about how and why plaintiff’s expert witnesses selected some studies to include in their “weight of evidence” methodology.  Without an adequate explanation of selection and weighting criteria, the choices seemed like arbitrary “cherry picking.”  Id. at 389. This worry is amply justified.  Weight of the evidence methodology is notoriously vague and indeterminate; unless the criteria for weighting are pre-specified and rigorously followed, claims based upon this methodology may be little more than subjective preferences. See, e.g., Douglas L.Weed, “Weight of Evidence: A Review of Concept and Methods,” 25 Risk Analysis 1545 (2005). 

In part, plaintiff’s expert witnesses also relied upon a meta-analysis of observational studies that looked at NHL risk among firefighters.  The Court was concerned about the plaintiffs’ expert witnesses’ failure to explain selection and weighting of studies in the meta-analysis methodology.  This criticism may well be simply plaintiff’s witnesses’ failure to explain the methodology of a published study, which in turn may have properly used an acceptable methodology to provide a summary estimate of risk of NHL among firefighters.  The meta-analysis in question, however, appears to have found a summary risk estimate of 1.51, with a 95% confidence interval, 1.31-1.73.  G.K. LeMasters, et al., “Cancer risk among firefighters: a review and meta-analysis of 32 studies,” 48 J. Occup. Envt’l Med. 1189 (2006).  The plaintiff’s expert witnesses were thus relying upon a study that quantifying the increased risk at 51%, with an upper bound from sampling variability, at 73%.  To the extent that the plaintiff had succeeded in providing reliable evidence of increased risk, she had also succeeded in showing that a doubling, or more, of the risk for NHL was statistically unlikely.  This is hardly a propitious way to win a lawsuit.

Risk ≠ Causation

March 12th, 2011

Evidence of risk is not evidence of causation.  It never has been; it never will be. Risk and causation are distinct concepts.  Processes, events, or exposures may be risks; that is, they may be capable of causing an outcome of interest.  Risk, however, is an ex ante concept.  We can speak of a risk only before the outcome of interest has occurred.  After its occurrence, we are interested in what caused the outcome.

Before the tremendous development of epidemiology in the decades after World War II, most negligence and products liability cases involved mechanistic conceptions of causation.  Juries and courts considered claims of causation that conceptually were framed in the manner of billiard balls hitting one another until the final, billiard-of-ball of interest, went into the pocket.  Litigants and courts did not need to consider statistical evidence when considering whether a saw dismembered a plaintiff, or even whether chronic asbestos exposure caused inflammation and scarring in the lungs of workers.  In some instances, judicial efforts to cast causation as a mechanistic process smack of quackery.  Claims that blunt trauma caused malignant tumors at the site of the trauma, within days or weeks of the impact, come to mind as an example of magical thinking that plagued courts and juries in a era that was short on scientific gatekeeping, and long on deferring to clinical judgment, empty of meaningful scientific support.  See, e.g., Baker v. DeRosa, 413 Pa. 164, 196 A.2d 387 (1964)(holding that question whether car accident caused tumor was for the jury).

The advent of epidemiologic evidence introduced an entirely different class of claims, ones that were based upon stochastic concepts of causation.  The exposure, event, or process that was a putative cause had a probabilistic element to its operation.  The putative cause exercised its contribution to the outcome through a random process, which left changed the frequency of the harmful outcome in those who encountered the exposure.  In addition, the outcome that resulted from the “putative cause” was frequently indistinguishable from those outcomes that arose spontaneously or from other causes in the environment or from normal human aging.  Discerning which risks (or “putative causes”) operated in a given case of chronic human disease (such cancer, cardiovascular disease, autoimmune disease) became a key issue for courts and litigants’ expert witnesses.  The black box of epidemiology, however, sheds little or no light on the issue, and no other light source was available.

Today, expert witnesses, typically for plaintiffs, equate risk with causation.  Because risk is an ex ante concept, the inference from risk to causation is problematic.  In rare instances, the risk is absolute under the circumstances of the plaintiff’s manifestation, such that the outcome can be tied to the exposure that created the risk.  In most cases, however, there will have been other competing risks, which alone could have operated to produce the outcome of which the plaintiff complains.  In toxic tort litigation, we frequently see a multiplicity of pre-existing risks for a chronic disease that is prevalent in the entire population.  When claimants attempt to show causation for such outcomes by epidemiologic evidence, the inference of causation from a particular prior risk is typically little more than a guess.

One well-known epidemiologist explained the limits of inferences with respect to stochastic causation:

“An elementary but essential principal that epidemiologists must keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between exposure and disease.”   ****

“In a courtroom, experts are asked to opine whether the disease of a given patient has been caused by a specific exposure.  This approach of assigning causation in a single person is radically differentfrom the epidemiologic approach, which does not attempt to attribute causation in any individual instance.  Rather, the epidemiologic approach is to evaluate the proposition that the exposure is a cause of the disease in a theoretical sense, rather than in a specific person.”

Kenneth Rothman, Epidemiology: An Introduction 44 (Oxford 2002)(emphasis added). 

Another epidemiologist, who wrote the chapter in the Federal Judicial Center’s Reference Manual on Scientific Evidence, on epidemiology, put the matter thus:

“Epidemiology answers questions about groups, whereas the court often requires information about individuals.”

Leon Gordis, Epidemiology 3d ed. (Philadelphia 2004)(emphasis in original).  Accord G. Friedman, Primer of Epidemiology 2 (2d ed. 1980 (epidemiologic studies address causes of disease in populations, not causation in individuals); Sander Greenland, “Relation of the Probability of Causation to Relative Risk and Doubling Dose:  A Methodologic Error that Has Become a Social Problem,” 89 Am. J. Pub. Health1166, 1168 (1999)(“[a]ll epidemiologic measures (such as rate ratios and rate fractions) reflect only the net impact of exposure on a population”); Joseph V. Rodricks & Susan H. Rieth, “Toxicological Risk Assessment in the Courtroom:  Are Available Methodologies Suitable for Evaluating Toxic Tort and Product Liability Claims?” 27 Regulatory Toxicol. & Pharmacol. 21, 24-25 (1998)(noting that a population risk applies to individuals only if all persons within the population are the same with respect to the influence of the risk on outcome).

These cautionary notes are important reminders of the limits of epidemiologic method.  What these authors miss is that there may be no other principled way to connect one pre-existing risk, among several, to an outcome that is claimed to be tortious.  As the young, laconic Wittgenstein wrote: 

“Wovon man nicht sprechen kann, darüber muß man schweigen.” 

L. Wittgenstein, Tractatus Logico-Philosophicus, Proposition 7 (1921)(translated by Ogden as “Whereof one cannot speak, thereof one must be silent”).  Unfortunately, expert witnesses in legal proceedings sometimes do not feel the normative force of Wittgenstein’s Proposition 7, and they speak without restraint.  As a contemporary philosopher explained in a more accessible idiom,

“Bullshit is unavoidable whenever circumstances require someone to talk without knowing what he is talking about.  Thus the production of bullshit is stimulated whenever a person’s obligations or opportunities to speak about some topic exceed his knowledge of the facts that are relevant to that topic.”

Harry Frankfurt, On Bullshit 63 (Princeton University Press 2005).

Judicial Innumeracy and the MDL Process

February 26th, 2011

In writing previously about the Avandia MDL Court’s handling of the defendants’ Daubert motion, I noted the trial court’s erroneous interpretation of statistical evidence.  See “Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion” (Jan. 10, 2011).  In fact, the Avandia court badly misinterpreted the meaning of a p-value, a basic concept in statistics:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance).”

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011) (internal citation omitted).  The Avandia MDL court was not, however, the first to commit this howler.  Professor David Kaye collected examples of statistical blunders from published cases in a 1986 law review, and again in his chapter on statistical evidence in the Federal Judicial Center’s Reference Manual on Scientific Evidence created a list of erroneous interpretations:

United States v. Georgia Power Co., 474 F.2d. 906, 915 (5th Cir. 1973)

National Lime Ass’n v. EPA, 627 F.2d 416, 453 (D.C. Cir. 1980)

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982) (“A variation of two standard deviations would indicate that the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke.”);

Vuyanich v. Republic Nat’l Bank, 505 F. Supp. 224, 272 (N.D. Tex. 1980) (“[I]f a 5% level of significance is used, a sufficiently large t-statistic for the coefficient indicates that the chances are less than one in 20 that the true coefficient is actually zero.”), vacated, 723 F.2d 1195 (5th Cir. 1984)

Craik v. Minnesota State Univ. Bd., 731 F.2d 465, 476n.13 (8th Cir. 1984)(“[a] finding that a disparity is statistically significant at the 0.095 or 0.01 level means that there is a 5 per cent. Or 1 per cent. Probability, respectively, that the disparity is due to chance.”  See also id. at 510 (Swygert, J., dissenting)(stating that coefficients were statistically significant at 1% level, allowing him to say that “we can be 99% confident that each was different from zero.”)

Sheehan v. Daily Racing Form, Inc., 104 F.3d 940, 941 (7th Cir. 1997) (“An affidavit by a statistician . . . states that the probability that the retentions . . . are uncorrelated with age is less than 5 percent.”)

Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991) (“Social scientists consider a finding of two standard deviations significant, meaning there is about one chance in 20 that the explanation for a deviation could be random . . . .”)

David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in Reference Manual on Scientific Evidence 83, 122-24 (2nd ed. 2000); David H. Kaye, “Is Proof of Statistical Significance Relevant?” 61 Wash. L. Rev. 1333, 1347 (1986)(pointing out that before 1970, there were virtually no references to “statistical significance” or p-values in reported state or federal cases. 

Notwithstanding the educational efforts of the Federal Judicial Center, the innumeracy continues, and with the ascent of the MDL model for addressing mass torts, many recent howlers have come from trial judges given responsibility for overseeing the pretrial coordination of thousands of lawsuits.  In addition to the Avandia MDL Court, here are some other recent erroneous statements that can be added to Professor Kaye’s lists: 

“Scientific convention defines statistical significance as “P ≤ .05,” i.e., no more than one chance in twenty of a finding a false association due to sampling error.  Plaintiffs, however, need only prove that causation is more-probable-than-not.”

In re Ephedra Prods. Liab. Litig., 393 F.Supp.2d 181, 193 (S.D.N.Y. 2005)(confusing the standard for Type I statistical error with the burden of proof).

“More-probable-than-not might be likened to P < .5, so that preponderance of the evidence is nearly ten times less significant (whatever that might mean) than the scientific standard.”

Id. at 193 n.9 (same). 

In the Phenylpropanolamine litigation, the error was even more clearly stated, for both p-values and confidence intervals:

“P-values measure the probability that the reported association was due to chance… .”

“… while confidence intervals indicate the range of values within which the true odds ratio is likely to fall.”

In re Phenylpropanolamine Products Liab. Litig., 289 F. 2d 1230, 1236n.1 (2003)

These misstatements raise important questions about judicial competency for gatekeeping, the selection, education, and training of judges, the assignment of MDL cases to individual trial judges, and the aggregation of Rule 702 motions to a trial judge for a single, one-time decision that will control hundreds if not thousands of cases.

Recently, a student published a bold note that argued for the dismantling of judicial gatekeeping.  Note, “Admitting Doubt: A New Standards for Scientific Evidence,” 123 Harvard Law Review 2021 (2010).  With all the naiveté of someone who has never tried a jury trial, the student argued that juries are at least as good, if not better, at handling technical questions.  The empirical evidence for such a suggestion is slim, and ignores the geographic variability in jury pools.  The above instances of erroneous statistical interpretations might seem to support the student’s note, but the argument would miss two important points: 

  • these errors are put on display for all to see, and for commentators to note and correct, whereas jury decisions obscure their mistakes; and
  • judges can be singled out for their technical competencies, and given appropriate assignments (which hardly ever happens at present), and judges can be required to partake in professional continuing legal education, which might well include training in technical areas to improve their decision making.

The Federal Judicial Center, and its state court counterparts, have work to do.  Lawyers also have an obligation to help courts get difficult, technical issue right.  Finally, courts, lawyers, and commentators need to rethink how the so-called Daubert process works, and does not work, especially in the high-stakes arena of multi-district litigation.

Can Daubert Survive the Multi-District Litigation Process?

February 23rd, 2011

The so-called Daubert process, by which each side in a lawsuit may challenge and seek preclusion of the other side’s expert witnesses, arose in the setting of common-law judges making rulings in individual cases.  Indeed, the Daubert case itself, although one of many cases involving claims of birth defects allegedly caused by Bendectin, was an individual case. 

In the silicone gel breast implant (SGBI) litigation, the process evolved over time, with decisions from different judges, each of whom saw the evidence differently.  The different judges brought different insights and aptitudes to bear on the evidence, and the expert witnesses themselves may have varied in their approaches and reliance upon different studies.  This incrementalist approach, in the context of the SBGI litigation, worked to the benefit of the defendants, in part because their counsel learned about the fraudulent evidence underlying certain studies, and about serious lapses in the standard of research care on the part of some investigators whose studies were prominently relied upon by plaintiffs’ counsel.  In the case of one dubious study, one of its authors, Marc Lappe, a prominent expert witness for plaintiffs, withdrew his support from the conclusions advanced in the study.

Early decisions in the SGBI cases (shortly after the Supreme Court’s decision in Daubert, in 1993) denied the defendants’ applications to preclude plaintiffs’ expert witnesses’ opinion testimony.  Later decisions converged upon the unavoidable truth that the case for SGBIs causing atypical or typical connective tissue diseases was a house of cards, built mostly with jokers.  If the Daubert process had been censored after the first hearing, the result would have been to deem all the breast implant cases trial and jury worthy, to the detriment of the judicial process, to the public’s interest in knowing the truth about silicone biomaterials, to the defendants’ reputational and financial interests, and to the interests of the claimants who had been manipulated by their counsel and support group leaders.

The evolutionary approach taken in the SGBI litigation was indirectly supported by the late Judge Sam Pointer, who presided over the SGBI federal multi-district litigation (MDL).  Judge Pointer strongly believed that the decision to exclude expert testimony belonged to individual trial judges, who received cases on remand from the MDL 926, when the cases were ready for trial.  Judge Pointer ruled on expert witness challenges in cases set for trial before him, but he was not terribly enthusiastic about the Daubert process, and denied most of the motions in a fairly perfunctory fashion.  Because of this procedural approach, Judge Pointer’s laissez-faire attitude towards expert witness testimony did not interfere with the evolutionary process that allowed other courts to see through the dense fog in the plaintiffs’ case.

Since MDL 926, the MDL process has absorbed the ritual of each side’s challenging the other’s expert witnesses, and MDL judges view their role as including the hearing and deciding all pre-trial Daubert challenges.  It has been over 17 years since the Supreme Court decided Daubert, and in that time, the MDL model, both state and federal, has become dominant.  As a result, the Daubert process has often been truncated and abridged to a single motion, decided at one time, by one judge.  The results of this abridgement have not always been happy for ensuring reliable and accurate gatekeeping. 

The MDL process appears to have broken the promise of Rule 702 in many cases.  By putting the first and only Rule 702 gatekeeping decision in the hands of a single judge, charged with making pre-trial rulings in the entire MDL, the MDL process has sapped the gatekeeping process of its dynamic, evolutionary character.  No longer can litigants and judges learn from previous efforts, as well as from commentary by scientists and legal scholars on the prior outcomes.  For judges who lack scientific and analytical acumen, this isolation from the scientific community works to the detriment of the entire process.

To be sure, the MDL process for deciding Rule 702 is efficient.  In many cases, expensive motions, briefings, and hearings are reduced to one event.  The incorporation of expert challenges into an MDL may improve fairness in some instances by allowing well-qualified plaintiffs’ counsel to wrest control of the process from unprepared plaintiffs’ counsel who are determined to control their individual cases.  Defendants may embrace the MDL process because it permits a single, unified document production and discovery schedule of corporate executives.  Perhaps defendants see the gains from MDL process as sufficiently important to forgo the benefit of a fuller opportunity to litigate the expert witness issues.  Whatever can be said in favor of using the MDL forum to resolve expert witness challenges, it is clear that MDL procedures limit the parties’ ability to refine their challenges over time, and to incorporate new evidence and discovery gained after the first challenges are resolved.  In the SGBI litigation, for instance, the defendants learned of significant scientific malfeasance and misfeasance that undermined key studies relied upon by plaintiffs, including some studies done by apparently neutral, well-credential scientists.  The omnibus MDL Daubert motion prevents either side, or the judiciary, from learning from the first and only motion.

Another example of an evidentiary display that has changed over time comes from the asbestos litigation, where plaintiffs continue to claim that asbestos causes gastrointestinal cancer.  The first such cases were pressed by plaintiffs in the early 1980s, with the support of Dr Selikoff and his cadre of testifying physicians and scientists.  A few years ago, however, the Institutes of Medicine convened a committee to review non-pulmonary cancers and asbestos, and concluded that the studies, now accumulated over 35 years since Dr Selikoff’s ipse dixit, do not support a conclusion that asbestos causes colorectal cancer.  Institute of Medicine of the National Academies, Asbestos: Selected Health Effects (2006).

Unfortunately, many trial judges view the admissibility and sufficiency of causation opinions on asbestos and colorectal cancer as “grandfathered” by virtue of the way business has been conducted in trial courts for over three decades.  Still, defendants have gained the opportunity to invoke an important systematic review, which shows that the available evidence does not reliably support the conclusion urged by plaintiffs’ expert witnesses. 

The current approach of using the MDL as the vehicle for resolving expert witness challenges raises serious questions about how MDLs are assigned to judges, and whether those judges have the analytical or quantitative skills to resolve Daubert challenges.  Assigning an MDL to a judge, who will have to rule on the admissibility of expert witness opinion testimony she or he does not understand, does not inspire confidence in the judicial process.  At least in the ad hoc approach employed in the SGBI, the parties could size up their trial judge, and decide that they would forgo their expert challenges based upon their assessment.  Furthermore, an anomalous outcome could be corrected over a series of decisions.  The MDL process, on the other hand, frequently places the Rule 702 decision in the discretion of a single judge.  The selection criteria for that sole decision maker becomes critical.  As equity in days of old varied with the size of the Chancellor’s foot, today’s scientific equity under Rule 702 may vary with accuracy of the trial judge’s slide rule.

The Other Shoe Drops for GSK in Avandia MDL — Hand Waving on Specific Causation

January 24th, 2011

For GSK, the other shoe dropped in the Avandia multi-district litigation, on January 13, 2011, when the presiding judge denied the defense challenge to plaintiff’s expert witness specific causation opinions, in the first case set for trial.  Burford v. GlaxoSmithKline, PLC, 2011 WL 135017 (E.D.Pa. 2011). 

In the MDL court’s opinion on general causation, In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576 (E.D. Pa. 2011), Judge Rufe determined that she was bound to apply a “Third Circuit” approach to expert witness gatekeeping, which focused on the challenged expert witnesses’ methodology, not their conclusions.  In Burford, Judge Rufe, citing two Third Circuit cases were decided after Daubert, but before Joiner, repeats this basic mistake.  Burford, 2011 WL 135017, *2.  Remarkably, the court’s opinion in Burford recites the current version of Federal Rule of Evidence 702, which states that the court must analyze expert witnesses’ conclusions for being based upon “sufficient facts or data,” as well as for being “the product of reliable principle and methods.” The statute mandates consideration of the reliability and validity of the witness’s conclusions, if those conclusion are in his testimony.  This Rule, enacted by Congress in 2000, is a statute, and thus supersedes prior case law, although the Advisory Notes explain that the language of the rule draws heavily from the United States Supreme Court’s decisions in Daubert, Joiner, and Kumho Tire. The Avandia MDL court ignored both the post-Daubert decisions of the Supreme Court, as well as the controlling language of the statute, in gatekeeping opinions on general and specific causation.

Two expert witnesses on specific causation were the subject of GSK’s challenge in Burford:  Dr. Nicholas DePace and Dr. Judy Melinek.  The court readily dispatches Dr. Melinek, who opines that Mr. Burford’s fatal cardiac event, which she characterizes as a heart attack, was caused by Avandia because Avandia causes heart attacks.  The court correctly noted that this inference was improper because risk does not equal causation in a specific case.

As one well-known epidemiologist has put it:

“An elementary but essential principal that epidemiologists must keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between exposure and disease.”

* * *

“In a courtroom, experts are asked to opine whether the disease of a given patient has been caused by a specific exposure.  This approach of assigning causation in a single person is radically different from the epidemiologic approach, which does not attempt to attribute causation in any individual instance.  Rather, the epidemiologic approach is to evaluate the proposition that the exposure is a cause of the disease in a theoretical sense, rather than in a specific person.”

Kenneth Rothman, Epidemiology: An Introduction 44 (Oxford 2002)(emphasis added).

In addressing the admissibility of Dr. DePace’s expert opinion, however, the MDL Court is led astray by Dr. DePace’s handwaving about having considered and “ruled out” Mr. Burford’s other risk factors. 

To be sure, Dr. DePace has some ideas about how Avandia may, plausibly, cause heart attacks.  In particular, Dr. DePace identified three plausible mechanisms, each of which would have had been accompanied by some biomarker (elevated blood lipids, elevated Lp-PLA2, or hypoglycemia).  This witness, however, could not opine that any of these mechanisms was in operation in producing Mr. Burford’s fatal cardiac event. Burford, at *3.

Undaunted, Dr. DePace opined that he had ruled out Mr. Burford’s other risk factors, but his opinion, even from Judge Rufe’s narrative is clearly hand waving and dissembling.  First, everyone, including every middle age man, has a risk of heart attack or cardiac arrest, although that risk may be modified – increased or lowered – by risks or preventive factors.  Mr. Burford had severe diabetes, which in and of itself, is a risk factor, commonly recognized to equal the size of the risk from having had a previous heart attack.  So Mr. Burford was not at baseline risk; indeed, he started all his diabetes medications with the equivalent risk of someone who had had a heart attack already.

Dr. DePace apparently opined that Mr. Burford’s diabetes, his blood sugar level, was well controlled.  The court accepted this contention at face value, although the reader of the court’s opinion will know that it is rubbish.  Although the court does not recite any blood sugar levels, its narrative of facts includes the following course of medications for Mr. Burford:

  • June 2004, diagnosed with type II diabetes, and treated with metformin
  • April 2005, dose of metformin doubled
  • August 2005, Avandia added to double dose of metformin
  • December 2005, Avandia dose doubled as well
  • June 2006, metformin dose doubled again
  • October or November 2006, sulfonylurea added to Avandia and metformin

This narrative hardly suggests good control.  Mr. Burford was on a downward spiral of disease, which in a little over two years took him from diagnosis to three medications to try to control his diabetes. Despite adding Avandia to metformin, doubling the doses of Avandia, doubling and then quadrupling doses of metformin, Mr. Burford still required yet another, third medication, to achieve glycemic control.  Of course, an expert witness can say anything, but the federal district court is supposed to act as a gatekeeper, to protect juries and parties from their ipse dixit.  Many opinions will be difficult to evaluate, but here, Dr. DePace’s opinion about glycemic control in Mr. Burford comes with a banner headline, which shouts “bogus.”

The addition of a third medication, a sulfonylurea, known to cause hypoglycemia (dangerously low blood sugar), which in turn can cause cardiac events and myocardial infarction, is particularly troubling.  See “Sulfonylurea,” in Wikipedia January 24, 2011.  Sulfonylureas act by stimulating the pancreas to produce more insulin, and the sudden addition of this medication to an already aggressive regime of medication clearly had the ability to induce hypoglycemia in Mr. Burford.  Dr. DePace notes that there is no evidence of an hypoglycemic event, which is often true in diabetic patients who experience a sudden death, but the gatekeeping court should have noticed that Dr. DePace’s lack of evidence did not equate to evidence that the risk or actual causal role (of hypoglycemia) was lacking.  Again, the trial court appeared to be snookered by an expert witness’s hand waving.  Surely gatekeepers must be made of sterner stuff.

Perhaps the most wrongheaded is the MDL court’s handling, or its failure to handle, risk as causation, in Dr. DePace’s testimony.

In his deposition, Dr. DePace testified that a heart attack in a 49 year-old man was “very unusual.”  Such a qualitative opinion does not help the finder of fact.  A heart attack is more likely in any 49 year-old man than in any 21 year-old man, although men of both ages can and do suffer heart attacks.  Clearly, a heart attack is more likely in a 49-year old man who has had diabetes, which has required intensive medication for even a semblance of control, than in a 49 year-old man who has never had diabetes.  Dr. DePace’s opinions fail to show that Mr. Burford had no base-line risk in absence of one particular medication, or that this base-line risk was not operating to produce, sufficiently, his alleged heart attack. 

Rather than being a high-risk group with respect to his Avandia use, according to the FDA’s 2007  meta-analysis, Mr. Burford and other patients on “triple therapy” (Avandia + metformin + sulfonylurea), would have had an odds ratio of 1.1 for any myocardial ischemic event, not statistically significant, as a result of their Avandia use.  Mr. Burford’s additional use of an ACE-inhibitor, along with this three diabetic medications, would place him into yet another sub-subgroup.  Whatever modification or interaction this additional medication created in combination with Avandia, the confidence intervals, which were wide for the odds ratio of 1.1, would  become extremely wide, allowing no meaningful inference. In any event, the court in Burford does not tell us what the risk was opined to be, and whether there were good data and facts to support such an opinion.  Remarkably absent from the court’s opinion in Burford is any consideration of the actual magnitude of the claimed risk (in terms of a hazard ratio, relative risk, odds ratio, risk difference, etc.) for patients like Mr. Burford.  Further absent is any consideration of whether any study showing risk has further shown the risk to be statistically different from 1.0 (no increased risk at all). 

As Ted Frank has noted on PointofLaw Forum, the Avandia MDL raises serious questions about the allocation of technical multi-district litigation cases to judges in the federal system.  “It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law” (January 21, 2011).  The Avandia experience also raises questions about the efficacy of the Federal Judicial Center’s program to train judges in the basic analytical, statistical, and scientific disciplines needed in their gatekeeping capacity. 

Although the Avandia MDL court’s assessment that Dr. DePace’s opinion was suboptimal, Burford at * 4, may translate into GSK’s ability to win before a jury, the point of Rule 702 is that a party should not have to stand trial on such shoddy evidence.

Power in the Courts — Part Two

January 21st, 2011

Post hoc calculations of power were once in vogue, but have now routinely been condemned by biostatisticians and epidemiologists in studies that report confidence intervals around estimates of associations, or “effect sizes.”  Power calculations require an alternative hypothesis against which to measure the rejection of the null hypothesis, and the choice of the alternative is subjective and often arbitrary.  Furthermore, the power calculation must make assumptions about the anticipated variance of the data to be obtained.  Once the data are in fact obtained, those assumptions may be shown wrong.  In other words, sometimes the investigators are “lucky,” and their data are less variable than anticipated.  The variance of the data actually obtained, rather than hypothesized, can best be appreciated from the confidence interval around the actually measured point estimate of risk.

In Part One of “Power in Courts,” I addressed the misplaced emphasis the Avandia MDL court put upon the concept of statistical power.  The court apparently accepted at face value the plaintiffs’ argument that GSK’s clinical trials were “underpowered,” which claim was very misleading.  Power calculations were no doubt done to choose sample size for GSK’s clinical trials, but those a priori estimates were based upon assumptions.  In the case of one very large trial, RECORD, many fewer events occurred than anticipated (which is generally a good thing to happen, and not unusual in the context of a clinical trial that gives patients in all arms of the trial better healthcare than available to the general population).  In one sense, those plaintiffs’ expert witnesses are correct to say that RECORD was “underpowered,” but once the study is done, the real measure of statistical precision is given by the confidence interval.

Because the Avandia MDL is not the only litigation in which courts and lawyers have mistakenly urged power concepts for studies that have already been completed, I have collected some key statements that reflect the general consensus and reasoning against what the Court did.

To be fair, the Avandia court did not fault the defense for not having analyzed and calculated post-hoc power of the clinical trials, all of which failed to find statistically significant associations between Avandia and heart attacks. The court, however, did appear to embrace the plaintiffs’ rhetoric that all the Avandia trials were underpowered, without any consideration given to the width and the upper bounds of the confidence intervals around those trials’ estimates of risk ratios for heart attack.  Remarkably, the Avandia court did not present any confidence intervals for any estimates of effect size, although it did present p-values, which it then badly misinterpreted.  Many of the Avandia trials (and the resulting meta-analyses) confidently ruled out risk ratios, for heart attacks, under 2.0.  The court’s conclusions about power are thus misleading at best.

Several consensus statements address whether considerations of power, after studies are completed and the data are analyzed, are appropriate.  The issue has also been addressed extensively in textbooks and in articles.  I have collected some of the relevant statements, below.  To the extent that the Federal Judicial Center’s Reference Manual on Scientific Evidence appears to urge post hoc power calculations, I hope that the much anticipated  Third Edition will correct the error.

CONSENSUS STATEMENTS

CONSORT

The CONSORT group (Consolidated Standards of Reporting Trials) is a world-wide group that sets quality standard for randomized trials in testing of pharmaceuticals.  CONSORT’s lead author is Douglas Altman, a well-respected biostatistician from Oxford University.  The advice of the CONSORT group is clear:

“There is little merit in calculating the statistical power once the results of the trial are known, the power is then appropriately indicated by confidence intervals.”

Douglas Altman, et al., “The Revised CONSORT Statement for Reporting Randomized Trials:  Explanation and Elaboration,” 134 Ann. Intern. Med. 663, 670 (2001).  See alsoDouglas Altman, et al., “Reporting power calculations is important,” 325 Br. Med. J. 1304 (2002).

STROBE

An effort similar to the CONSORT group has been put together by investigators interested in observational studies, the STROBE group (the Strengthening the Reporting of Observational Studies in Epidemiology).  The STROBE group was made up of leading epidemiologists and biostatisticians, who addressed persistent issues and errors in the reporting of observational studies.  Their advice was equally unequivocal on the issue of post hoc power considerations:

“Do not bother readers with post hoc justifications for study size or retrospective power calculations. From the point of view of the reader, confidence intervals indicate the statistical precision that was ultimately obtained. It should be realized that confidence intervals reflect statistical uncertainty only, and not all uncertainty that may be present in a study (see item 20).”

Vandenbroucke, et al., “Strengthening the reporting of observational studies in epidemiology (STROBE):  Explanation and elaboration,” 18 Epidemiology 805, 815 (2007) (Section 10, sample size).

American Psychological Association

In 1999, a committee of the American Psychological Association met to discuss various statistical issues in psychological research papers.  With respect to power analysis, the committee concluded:

“Once the study is analyzed, confidence intervals replace calculated power in describing the results.”

Wilkinson, Task Force on Statistical Inference, “Statistical methods in psychology journals:  guidelines and explanations,” 54 Am. Psychol. 594-604 (1999)

TEXTBOOKS

Modern Epidemiology

Kenneth Rothman and Sander Greenland are known for many contributions, not the least of which is their textbook on epidemiology.  In the second edition of Modern Epidemiology, the authors explain how and why confidence intervals replace power considerations, once the study is completed and the data are analyzed:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis.  The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect.  * * *  In planning a study, it is reasonable to make conjectures about the magnitude of an effect in order to compute sample-size requirements or power.

In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with sample-size or power calculations (Smith & Bates 1992; Goodman & Berlin 1994). * * * Confidence limits convey much more of the essential information by indicating a range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level).  They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman & Sander Greenland, Modern Epidemiology 192 – 193 (1998)

And in 2008, with the addition of Timothy Lash as a co-author, Modern Epidemiology continued its guidance on power as only a pre-study consideration:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008)

A Short Introduction to Epidemiology

Neil Pierce, an epidemiologist, citing Smith & Bates 1992, and Goodman & Berlin 1994, infra, describes the standard method:

“Once a study has been completed, there is little value in retrospectively performing power calculations since the confidence limits of the observed measure of effect provide the best indication of the range of likely value for true association.”

Neil Pierce, Introduction to Epidemiology (2d ed. 2005)

Statistics at Square One

The British Medical Journal publishes a book, Statistics at Square One, which addresses the issue of post hoc power:

“The concept of power is really only relevant when a study is being planned.  After a study has been completed, we wish to make statements not about hypotheses but about the data, and the way to do this is with estimates and confidence intervals.”

T. Swinscow, Statistics at Square One42 (9thed. London 1996) (citing to a book by Martin Gardiner and Douglas Altman, both highly accomplished biostatisticians).

How to Report Statistics in Medicine

Two authors from the Cleveland Clinic, in a guidebook published by the American College of Physicians:

“Until recently, authors were urged to provide ‘post hoc power calculations’ for non-significant differences.  That is, if the results of the study were negative, a power calculation was to be performed after the fact to determine the adequacy of the sample size.  Confidence intervals also reflect sample size, however, and are more easily interpreted, so the requirement of a post hoc power calculation for non-statistically significant results has given way to reporting the confidence interval (32).”

Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 58 (2d ed. 2006)(citing to Goodman & Berlin, infra).  See also Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 78 (1st ed. 1996)

Clinical Epidemiology:  The Essentials

The Fletchers, both respected clinical epidemiologists, describe standard method and practice:

Statistical Power Before and After a Study is Done

Calculation of statistical power based on the hypothesis testing approach is done by the researchers before a study is undertaken to ensure that enough patients will be entered to have a good chance of detecting a clinically meaningful effect if it is present.  However, after the study is completed this approach is no longer relevant.”  There is no need to estimate effect size, outcome event rates, and variability among patients, they are now known.

Therefore, for researchers who report the results of clinical research and readers who try to understand their meaning, the confidence interval approach is more relevant.  One’s attention should shift from statistical power for a somewhat arbitrarily chosen effect size, which may be relevant in the planning stage, to the actual effect size observed in the study and the statistical precision of that estimate of the true value.”

R. Fletcher, et al., Clinical Epidemiology: The Essentials at 200 (3d ed. 1996)

The Planning of Experiments

Sir David Cox is one of the leading statisticians in the world.  In his classic 1958 text, The Planning of Experiments, Sir David wrote:

“Power is important in choosing between alternative methods of analyzing data and in deciding on an appropriate size of experiment.  It is quite irrelevant in the actual analysis of data.”

David Cox, The Planning of Experiments 161 (1958)

ARTICLES

Cummings & Rivara (2003)

“Reporting of power calculations makes little sense once the study has been done.  We think that reviewers who request such calculations are misguided.”

* * *

“Point estimates and confidence intervals tell us more than any power calculations about the range of results that are compatible with the data.”

Cummings & Rivara, “Reporting statistical information in medical journal articles,” 157 Arch. Pediatric Adolesc. Med. 321, 322 (2003)

Senn (2002)

“Power is of no relevance in interpreting a completed study.

* * *

“The definition of a medical statistician is one who not accept that Columbus discovered America because he said he was looking for India in the trial plan.  Columbus made an error in his power calculation – – he relied on an estimate of the size of the Earth that was too small, but he made one none the less, and it turned out to have very fruitful consequences.”

Senn, “Power is indeed irrelevant in interpreting completed studies,” 325 Br. Med. J. 1304 (2002).

Hoenig & Heisey (2001)

“Once we have constructed a C.I., power calculations yield no additional insight.  It is pointless to perform power calculations for hypotheses outside of the C.I. because the data have already told us that these are unlikely values.”  p. 22a

Hoenig & Heisey, “The Abuse of Power:  The Pervasive Fallacy of Power Calculations for Data Analysis”? American Statistician (2001)

Zumbo & Hubley (1998)

In The Statistician, published by the Royal Statistical Society, these authors roundly condemn post hoc power calculations:

“We suggest that it is nonsensical to make power calculations after a study has been conducted and a statistical decision has been made.  Instead, the focus after a study has been conducted should be on effect size . . . .”

Zumbo & Hubley, “A note on misconceptions concerning prospective and retrospective power,” 47-2 The Statistician 385 (1998)

Goodman & Berlin (1994)

Professor Steven Goodman is a professor of epidemiology in Johns Hopkins University, and the Statistical Editor for the Annals Internal Medicine.  Interestingly, Professor Goodman appeared as an expert witness, opposite Sander Greenland, in hearings on Thimerosal.  His article, with Jesse Berlin, has been frequently cited in support of the irrelevance of post hoc power considerations:

“Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant.”

(p. 200a, ¶1)

“Studies with low statistical power have sample sizes that are too small, producing results that have high statistical variability (low precision).  Confidence intervals are a convenient way to express that variability.”

(p. 200a, ¶2)

“Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected . . . .”

(p. 200 b, top)

“Power is exclusively a pretrial concept; it is the probability of a group of possible results (namely all statistically significant outcomes) under a specified alternative hypothesis.  A study produces only one result.”

(p. 201a, ¶2)

“The perspective after the experiment differs from that before that experiment simply because the result is known.  That may seem obvious, but what is less apparent is that we cannot cross back over the divide and use pre-experiment numbers to interpret the result.  That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit a lottery jackpot (the after-experiment perspective).”

(p. 201a-b)

“For interpretation of observed results, the concept of power has no place, and confidence intervals, likelihood, or Bayesian methods should be used instead.”

(p. 205)

Goodman & Berlin, “The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results,” 121 Ann. Intern. Med. 200, 200, 201, 205 (1994).

Smith & Bates (1992)

This article was published in the journal, Epidemiology, which was founded and edited by Professor Kenneth Rothman:

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.”

“Generally, a negative study with low power will be regarded as providing little evidence against the existence of a causal association.  Often overlooked, however, is that otherwise well-conducted studies of low power can be informative:  the upper bound of the (1 – α)% confidence intervals provides a limit on the likely magnitude of any actual effect.

The purpose of this paper is to extend this argument to show that the use of traditional power calculations is causal inference (that is, after a study has been carried out) can be misleading and inferior to the use of upper confidence limits of estimates of effect.  The replacement of post-study power calculations with confidence interval estimates is not a new idea.”

(p. 449a)

* * *

“It is clear, then, that the use of the upper confidence limit conveys considerable information for the purposes of causal inference; by contrast, the power calculation can be quite misleading.”

(p. 451b)

* * *

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.  As we have demonstrated, they have little, if any, value.  We propose that, in their place, (1 – α)%  upper confidence limits be calculated.”

(p. 451b)

Smith & Bates, “Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies,” 3 Epidemiology 449-52 (1992)

Greenland (1988)

“the arbitrariness of power specification is of course absent once the data are collected, since the statistical power refers to the probability of obtaining a particular type of data.  It is thus not a property of particular data sets.  Statistical power of collected data, as the probability of heads on a coin toss that has already taken place, can, at best, meaningfully refer only to one’s ignorance of the result and loses all meaning when one examines the result.”

Greenland, “On Sample Size and Power Calculations for Studies Using Confidence Limits,” Am. J. Epidem. 236 (1988)

Simon (1986)

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end.  This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

Rothman (1986)

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

Makuch & Johnson (1986)

“[the] confidence interval approach, the method we recommend for interpreting completed trials in order to judge the range of true treatment differences that is reasonable consistent with the observed data.”

Robert W. Makuch & Mary F. Johnson, “Some Issues in the Design and Interpretation of ‘Negative’ Clinical Studies,” 146 Arch. Intern. Med. 986, 986 (1986).

Detsky & Sackett (1985)

“Negative clinical trials that conclude that neither of the treatments is superior are often criticized for having enrolled too few patients.  These criticisms usually are based on formal sample size calculations that compute the number of patients required prospectively, as if the trial had not yet been carried out.  We suggest that this ‘prospective’ sample size calculation is incorrect, for once the trial is over we have ‘hard’ data from which to estimate the actual size of the treatment effect.  We can either generate confidence limits around the observed treatment effect or retrospectively compare it with the effect hypothesized before the trial.”

Detsky & Sackett, “When was a ‘negative’ clinicaltrial big enough?  How many patients you need depends on what you found,” 145 Arch. Intern. Med. 709 (1985).

Power in the Courts — Part One

January 18th, 2011

The Avandia MDL court, in its recent decision to permit plaintiffs’ expert witnesses to testify about general causation, placed substantial emphasis on the statistical concept of power.  Plaintiffs’ key claim is that the Avandia causes heart attacks, yet no clinical trial of the oral anti-diabetic medication Avandia found a statistically significant increased risk of heart attacks.  Plaintiffs’ expert witnesses argued that all the clinical trials of Avandia were “underpowered,” and thus the failure to find an increased risk was a Type II (false-negative) error that resulted from the small size of the clinical trials:

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(emphasis in original).

The true effect, according to plaintiffs’ expert witnesses, could be seen only through aggregating the data, across clinical trials, in a meta-analysis.  The proper conduct, reporting, and interpretation of meta-analyses were thus crucial issues for the Avandia MDL court, which appeared to have difficulty with statistical concepts.  The court’s difficulty, however, may have had several sources beyond misleading plaintiffs’ expert witness testimony, and the defense’s decision not to call an expert in biostatistics and meta-analysis at the Rule 702 hearing.

Another source of confusion about statistical power may well have come from the very reference work designed to help judges address statistical and scientific evidence in their judicial capacities:  The Reference Manual on Scientific Evidence.

Statistical power is discussed in the both the chapters on statistics and on epidemiology in The Reference Manual on Scientific Evidence.  The chapter on epidemiology, however, provides misleading guidance on the use of power:

“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity. The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.79  The power of a study expresses the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha, or statistical significance, specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.80 Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often power curves are used in the design of a study to determine what size the study populations should be.81

Michael D. Green, D. Michael Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000).  See also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000)

This guidance is misleading in the context of epidemiologic studies because power curves are rarely used any more to assess completed studies.  Power calculations are, of course, used to help determine sample size for a planned study.  After the data are collected, however, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.

The authors of the chapter on epidemiology cite to a general review paper, id. at p. 362n.79, which does indeed address the concept of statistical power, but the author, a well-known statistician, addresses the issue primarily in the context of planning a statistical analysis, and in discrimination litigation, where the test result will be expressed in a p-value, without a measure of “effect size,” and more important, without a measure of a “confidence interval” around the estimate of effect size:

“The chance of rejecting the false null hypothesis, under the assumptions of an alternative, is called the power of the test. Simply put, among many ways in which we can test a null hypothesis, we want to select a test that has a large power to correctly distinguish between two alternatives. Generally speaking, the power of a test increases with the size of the sample, and tests have greater power, and therefore perform better, the more extreme the alternative considered becomes.

Often, however, attention is focused on the first type of error and the level of significance. If the evidence, then, is not statistically significant, it may be because the null hypothesis is true or because our test did not have sufficient power to discern a difference between the null hypothesis and an alternative explanation. In employment discrimination cases, for example, separate tests for small samples of employees may not yield statistically significant results because each test may not have the ability to discern the null hypothesis of nondiscriminatory employment from illegal patterns of discrimination that are not extreme. On the other hand, a test may be so powerful, for example, when the sample size is very large, that the null hypothesis may be rejected in favor of an alternative explanation that is substantively of very little difference.  ***

Attention must be paid to both types of errors and the risks of each, the level of significance, and the power. The trier of fact can better interpret the result of a significance test if he or she knows how powerful the test is to discern alternatives. If the power is too low against alternative explanations that are illegal practices, then the test may fail to achieve statistical significance even though the illegal practices may be operating. If the power is very large against a substantively small and legally permissible difference from the null hypothesis, then the test may achieve statistical significance even though the employment practices are legal.”

Stephen E. Fienberg,  Samuel H. Krislov, and Miron L. Straf, “Understanding and Evaluating Statistical Evidence in Litigation,” 36 Jurimetrics J. 1, 22-23 (1995).

Professor Fienberg’s characterization is accurate, but his description of “post-hoc” assessment of power was not provided for the context of  edemiologic studies, which today virtually always report confidence intervals around the studies’ estimates of effect size.  These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out by the data in a given study.  Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis.  A wily plaintiff’s expert witness can always arbitrarily select a sufficiently low alternative hypothesis, say a relative risk of 1.01, such that any study would have a vanishingly small probability of correctly distinguishing the null and alternative hypotheses.

The Reference Manual is now undergoing a revision, for an anticipated third edition.  A saner appreciation of the concept of power as it is used in epidemiologic studies and clinical trials would be helpful to courts and to lawyers who litigate cases involving this kind of statistical evidence.

Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion

January 10th, 2011

If GlaxoSmithKline (GSK) did not have bad luck when it comes to its oral anti-diabetic medication Avandia, it would have no luck at all.

On January 4, 2011, the federal judge who oversees the Avandia multi-district litigation (MDL) in Philadelphia entered an order denying GSK’s motion to exclude the causation opinion testimony of plaintiffs’ expert witnesses.  In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(Rufe, J.)[cited as “Op.”].  The decision is available on the CBS Interactive Business Network news blog, BNET

Based largely upon a meta-analysis of randomized clinical trials (RCTs) by Dr Steven Nissen and Ms Kathleen Wolski, plaintiffs’ witnesses opined that Avandia (rosiglitizone) causes heart attacks and strokes.  Because meta-analysis has received so little serious judicial attention in connection with Rule 702 or 703 motions, this opinion by the Hon. Cynthia Rufe, deserves careful attention by all students of “Daubert” law.  Unfortunately, that attention is likely to be critical — Judge Rufe’s opinion fails to engage the law and facts of the case, while committing serious mistakes on both fronts.

The Law

The reader will know that things are not going well for a sound legal analysis when the trial court begins by misstating the controlling law for decision:

“Under the Third Circuit framework, the focus of the Court’s inquiry must be on the experts’ methods, not their conclusions. Therefore, the fact that Plaintiffs’ experts and defendants’ experts reach different conclusions does not factor into the Court’s assessment of the reliability of their methods.”

Op. at 2 (internal citation omitted).

and

“As noted, the experts are not required to use the best possible methods, but rather are required to use scientifically reliable methods.”

Op. at 26.

Although the United States Supreme Court attempted, in Daubert, to draw a distinction between the reliability of an expert witness’s methodology and conclusion, that Court soon realized that the distinction is flawed. If an expert witness’s proffered testimony is discordant from regulatory and scientific conclusions, a reasonable, disinterested scientists would be led to question the reliability of the testimony’s methodology and its inferences from facts and data, to its conclusion.  The Supreme Court recognized this connection in General Electric v. Joiner, and the connection between methodology and conclusions was ultimately incorporated into a statute, the revised Federal Rule of Evidence 702:

“[I]f scientific, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training or education, may testify thereto in the form of an opinion or otherwise, if

  1. the testimony is based upon sufficient fact or data,
  2. the testimony is the product of reliable principles and methods; and
  3. the witness has applied the principles and methods reliably to the facts.”

The Avandia MDL court thus ignored the clear mandate of a statute, Rule 702(1), and applied an unspecified “Third Circuit” framework, which is legally invalid to the extent it departs from the statute.

The Avandia court’s ruling, however, goes beyond this clear error in applying the wrong law.  Judge Rufe notes that:

“The experts must use good grounds to reach their conclusions, but not necessarily the best grounds or unflawed methods.”

Op. at 2-3 (internal citations omitted).

Here the trial court’s double negative is confusing.  The court clearly suggests that plaintiffs’ experts must use “good grounds,” but that their methods can be flawed and still survive challenge.  We can certainly hope that the trial court did not intend to depart so far from the statute, scientific method, and common sense, but the court’s own language suggests that it abused its discretion in applying a clearly incorrect standard.

Misstatements of Fact

The apparent errors of the Avandia decision transcend mistaken legal standards, and go to key facts of the case.  Some errors perhaps show inadvertence or inattention, for instance, when the court states that the RECORD trial, an RCT conducted by GSK, set out “specifically to compare the cardiovascular safety of Avandia to that of Actos (a competitor medication in the same class).  Op. at 4.  In fact, Actos (or pioglitazone) was not involved in the RECORD trial, which involved Avandia, along with two other oral anti-diabetic medications, metformin and sulfonylurea. 

Erroneous Reliance upon p-values to the exclusion of Confidence Intervals

Other misstatements of fact, however, suggest that the trial court did not understand the scientific evidence in the case.  By way of example, the trial court erroneously over-emphasized p-values, and ignored the important interpretative value of the corresponding confidence intervals.  For example, we are told that “[t]he NISSEN meta-analysis combined 42 clinical trials, including the RECORD trial and other RCTs, and found that Avandia increased the risk of myocardial infarction by 43%, a statistically significant result (p = .031).”  Op. at 5.  Ignoring for the moment that the cited meta-analysis did not include the RECORD RCT, the Court should have have reported the p-value along with the corresponding two-sided 95% confidence interval:

“the odds ratio for myocardial infarction was 1.43 (95% confidence interval [CI], 1.03 to 1.98; P = 0.03).”

Steven E. Nissen, M.D., and Kathy Wolski, M.P.H., “Effect of Rosiglitazone on the Risk of Myocardial Infarction and Death from Cardiovascular Causes,” 356 New Engl. J. Med. 2457, 2457 (2007).

The Court repeats this error later in its opinion:

“In 2007, the New England Journal of Medicine published the NISSEN meta-analysis, which combined results from 42 double-blind RCTs and found that patients taking Avandia had a statistically significant 43% increase in myocardial ischemic events. NISSEN used all publicly available data from double-blind RCTs of Avandia in which cardiovascular disease events were recorded, thereby eliminating one major drawback of meta-analysis: the biased selection of studies.”

Op. at 17.  The second time, however, the Court introduced new factual errors.  The Court erred in suggesting that Nissen uses all publicly available data.  There were, in fact, studies available to Nissen and to the public, which met Nissen’s inclusion criteria, but which he failed to include in his meta-analysis.  Nissen’s meta-analysis was thus biased by its failure to have conducted a complete, thorough review of the medical literature for qualifying RCTs.  Furthermore, contrary to the Court’s statement, Nissen included non-double-blinded RCTs, as his own published paper makes clear.

Erroneous Interpretation of p-values

The court erred in its interpretation of p-values:

 “The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance). “

Op. at 25 (internal citation omitted).  The p-value is, of course, the probability that results as large or larger would have been observed, given the truth of the null hypothesis that there is no difference between Avandia and its comparator medications.  The p-value does not permit a probabilistic assessment of the correctness of the null hypothesis; nor does it permit a straightforward probabilistic assessment of the correctness of the alternative hypothesis of rejecting the null hypothesis.

See Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 122, 357 (2000).

Hand Waiving over Statin Use

The Court appeared to have been confused by plaintiffs’ rhetoric that statin use masked a real risk of heart attacks in the Avandia RCTs. 

“It is not clear whether statin use was allowed in the DREAM study.”

Op. at 25.  The problem is that the Court fails to point to any evidence that the use of statins differed between the Avandia and comparator arms of the RCTs.  Statins have been one of the great pharmaceutical success stories of the last 15 years, and it is reasonable to believe that today most diabetic patients (who often high blood fats) would taking statins.  At the time of the DREAM study, the prevalence of use would have been lower than today, but there was no evidence mentioned that the use was different between the Avandia and other arms of the DREAM trial.

Errors in Interpreting RCTs by Intention to Treat Analyses

For unexplained reasons, the court was impressed by what it called a high dropout rate in one of the larger Avandia RCTs:

“The ADOPT study was marred by a very high dropout rate (more than 40% of the subjects did not complete the four year follow up) and the use of statins during the trial.”

Op. at 25.  Talk about being hoisted with one’s own petard!  The high dropout rate in ADOPT resulted from the fact that this RCT was a long-term test of “glycemic control.”  Avandia did better with respect to durable glycemic control than two major, accepted medications, metformin and sulfonylurea, and thus the dropouts came mostly in the comparator arms as patients not taking Avandia required more and stronger medications, or even injected insulin.  The study investigators were obligated to analyze their data in accord with “intention to treat” principles, and so patients removed from the trial due to lack of glycemic control could no longer be counted with respect to any outcome of interest.  Avandia patients thus had longer follow-up time, and more opportunity to have events due to their underlying pathologic physiology (diabetes and diabetic-related heart attacks).

Ignoring Defense Arguments

GSK may have hurt itself by electing not to call an expert witness at the Daubert hearing in this MDL.  Still, the following statement by the Court is hard to square with opening argument given at the hearing:

“GSK points out no specific flaws or limitations in the design or implementation of the NISSEN meta-analysis”

Op. at 6.  If true, then shame on GSK; but somehow this statement seems too incredible to be true.

Ignoring the Difference between myocardial ischemic events and myocardial infarction (MI)

MI occurs when heart muscle dies as a result of a blockage in a blood vessel that brings oxygenated blood.  An ischemic event is defined very broadly in GSK’s study:

“To minimize the possibility of missing events of interest, all events coded with broadly inclusive AE terms captured from investigator reports were reviewed. SAEs identified from the trials database included cardiac failure, angina pectoris, acute pulmonary edema, all cases of chest pain without a clear non-cardiac etiology and myocardial infarction/myocardial ischemia.”

Alexander Cobitz MD, PhD, et al., “A retrospective evaluation of congestive heart failure and myocardial ischemia events in 14 237 patients with type 2 diabetes mellitus enrolled in 42 short-term, double-blind, randomized clinical studies with rosiglitazone,” 17 Pharmacoepidem. & Drug Safety 769, 770 (2008).

In its pooled analysis, GSK was clearly erring on the side of safety in creating its composite end point, but the crucial point is that GSK included events that had nothing to do with MI.  The MDL court appears to have accepted uncritically the plaintiffs’ expert witnesses’ claim that the difference between myocardial ischemic events and MI is only a matter of degree.  The Court found “that the experts were able to draw reliable conclusions about myocardial infarction” from a meta-analysis about a different end point, “by virtue of their expertise and the available data.”  Op. at 10.  This is hand waiving or medical alchemy.

Uncritical Acceptance of Mechanistic Evidence Related to Increased Congestive Heart Failure (CHF) in Avandia Users

The court noted that plaintiffs’ expert witnesses relied upon a well-established relationship  between Avandia and congestive heart failure (CHF).  Op. at 14.  True, true, but immaterial.  Avandia causes fluid retention, but so do other drugs in this class of drugs as well.  Actos causes fluid retention, and carries the same warning for CHF, but there is no evidence that Actos causes MI or stroke.  Although the Court’s desire to have a mechanism of causation is understandable, that desire cannot substitute for actual evidence.

Misuse of Power Analyses

The Avandia MDL Court mistakenly referred to inadequate statistical power in the context of interpreting data of heart attacks in Avandia RCTs. 

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

Op. at 5.

The Court might have helped itself by adverting to the Reference Manual of Scientific Evidence:

“Power is the chance that a statistical test will declare an effect when there is an effect to declare. This chance depends on the size of the effect and the size of the sample.”

Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 125 – 26, 357 (2000) (internal citations omitted).  In other words, you cannot assess the power of the study unless you specify the size of the association of the alternative hypothesis, and the sample size, among other things.  It is true that most of the Avandia trials were not powered to detect heart attacks, but the concept of power requires the user to specify at least the alternative hypothesis against which the study is being assessed for power. Once the studies were completed, and the data became available, there was no longer any need or use for the consideration of power; the statistical precision of the studies’ results was given by their confidence intervals.

Incorrect Use of the Concept of Replication

The MDL court erred in accepting the plaintiffs’ expert witnesses’ bolstering of Nissen’s meta-analytic results by their claim that Nissen’s results had been “replicated”:

“[T]he NISSEN results have been replicated by other researchers. For example, the SINGH meta-analysis pooled data from four long-term clinical trials, and also found a statistically significant increase in the risk of myocardial infarction for patients taking Avandia. GSK and the FDA have also replicated the results of NISSEN through their own meta-analyses.”

Op. at 6 (internal citations omitted).

“The SINGH, GSK and FDA meta-analyses replicated the key findings of the NISSEN study.43”

Op. at 17.

These statements mistakenly suggest that Nissen’s meta-analysis was able to generate a reliable conclusion that there was a statistically significant association between Avandia use and MI.  The Court’s insistence that Nissesn was replicated does not become more true for having been stated twice.  Nissen’s meta-analysis was not an observational study in the usual sense.  His publication made very clear what studies were included (and not at all clear what studies were excluded), and the meta-analytic model that he used.  Thus, it is trivially true that anyone could have replicated his analysis, and indeed, several researchers did so.  See, e.g., George A. Diamond, MD, et al., “Uncertain Effects of Rosiglitazone on the Risk for Myocardial Infarction and Cardiovascular Death,” 147 Ann. Intern. Med. 578 (2007).

But Nissen’s results were not replicated by Singh, GSK, or the FDA, because these other meta-analyses used different methods, different endpoints (in GSK’s analysis), different inclusion criteria, different data, and different interpretative methods.  Most important, GSK and FDA could not reproduce the statistically significant finding for their summary estimate of association between Avandia and heart attacks.

One definition of replication that the MDL court might have consulted makes clear that replication is a repeat of the same experiment to determine whether the same (or a consistent) result is obtained:

“REPLICATION — The execution of an experiment or survey more than once so as to confirm the findings, increase precision, and obtain a closer estimation of sampling error.  Exact replication should be distinguished from consistency of results on replication.  Exact replication is often possible in the physical sciences, but in the biological and behavioral sciences, to which epidemiology belongs, consistency of results on replication is often the best that can be attained. Consistency of results on replication is perhaps the most important criterion in judgments of causality.”

Miquel Porta, Sander Greenland, and John M. Last, eds., A Dictionary of Epidemiology, 5th ed., at 214 (2008).  The meta-analyses of Singh, GSK, and FDA did not, and could not, replicate Nissen’s.  Singh’s meta-analysis obtained a result similar to Nissen’s, but the other meta-analyses by GSK, FDA, and Manucci failed to yield a statistically significant result for MI.  This is replication only in Wonderland.

It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law.