TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

The Dow-Bears Debate the Decline of Daubert

August 10th, 2012

Last month, I posted a short screenplay about how judicial gatekeeping of expert witnesses has slackened recently.  SeeDaubert Approaching the Age of Majority” (July 17, 2012).

Dr. David Schwartz, of Innovative Science Solutions, has adapted the screenplay to the cinematic screen, and directed a full-length feature movie, The Daubert Will Set Your Client Free, using text-to-talk technology. Dr. Schwartz is not only a first-rate scientist, but he is also an aspiring film maker and artist.

OK; full-length is only a little more than 90 seconds, but you may still enjoy our movie-making debut.  And it is coming to a YouTube screen near you, now.

Discovery of Statistician Expert Witnesses

July 19th, 2012

This post has been updated and superseded by “

Daubert Approaching the Age of Majority

July 19th, 2012

PLAINTIFF LAWYER:  WHY DID YOU FILE THE DAUBERT?

DEFENSE LAWYER:  THE DAUBERT WILL SET MY CLIENT FREE.

PLAINTIFF LAWYER:  THE DAUBERT WILL COST YOUR CLIENT A LOT OF MONEY WHICH COULD COMPENSATE YOUR VICTIMS.

DEFENSE LAWYER:  THE DAUBERT WILL BAR YOUR EXPERTS AND GIVE US THE SUMMARY JUDGMENT.

PLAINTIFF LAWYER:  YOU WILL LOSE THE DAUBERT.

DEFENSE LAWYER:  NO; WE WILL WIN THE DAUBERT BECAUSE YOU DO NOT HAVE THE BRADFORD HILL.

PLAINTIFF LAWYER:  THE BRADFORD HILL ARE NINE THINGS.  WHICH ONE ARE YOU TALKING ABOUT?

DEFENSE LAWYER:  I AM TALKING ABOUT ALL NINE.

PLAINTIFF LAWYER:  NO, WE HAVE THE BRADFORD HILL LOCKED UP.

DEFENSE LAWYER:  BUT YOU DO NOT HAVE ANY OF THE NINE.

PLAINTIFF LAWYER:  THE BRADFORD HILL SAYS NONE OF THE NINE IS NECESSARY; THEREFORE WE HAVE ALL SATISFIED.

DEFENSE LAWYER:  BUT WAIT; YOU DO NOT EVEN GET TO THE BRADFORD HILL.  YOUR P-VALUE IS TOO LOW.

PLAINTIFF LAWYER:  I AM THE PLAINTIFF LAWYER.  I CANNOT BE TOO RICH, TOO POWERFUL, OR HAVE TOO HIGH A P-VALUE.

DEFENSE LAWYER:  NO; YOU NEED TO SHOW STATISTICAL SIGNIFICANCE BEFORE YOU GET TO THE BRADFORD HILL.

PLAINTIFF LAWYER:  STATISTICAL SIGNIFICANCE IS NOT A LITMUS TEST; SO I HAVE THAT SATISFIED AS WELL.

DEFENSE LAWYER:  YOU DON’T KNOW WHAT YOU’RE TALKING ABOUT.

PLAINTIFF LAWYER:  YOU NEED TO KEEP UP WITH THE CASE LAW.  LET’S GO PICK A JURY.

Pin the Tail on the Significance Test

July 14th, 2012

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply.  See .  An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve.  Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

 

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance.  As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.110 One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading.  The reason for the two-tailed test, however, is not really tied to multiple testing.  If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both  above and below the null hypothesis count in favor of rejecting the null.  A two-tailed test results.  Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication.  Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing?  There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature.  One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

 

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.49 Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321.  This statement is not quite consistent with the chapter on statistics, and it introduces new problems.  The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical.  Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions.  If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value.  The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests.  Id. at 321 n. 49.  The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

 

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression.  Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law.  I believe that the authors’ suggestion about the interest of “most investigators” is incorrect.  The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case).  The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test.  More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine.  The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke.  United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims.  As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims.  See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988).

 

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value.  According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association.  In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test.  One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand.  Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing  as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex.  Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

 

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests.  While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001).  These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance.  As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation  relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,4[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district.  A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009).  See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240.  In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared.  One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)

Tal Golan’s Preliminary History of Epidemiologic Evidence in U.S. Courts

July 10th, 2012

Tal Golan  is an historian, with a special interest in the history of science in the 18th and 19th centuries, and in historical relationships among, science, technology, and the law.  He now teaches history at the University of California, San Diego.  Golan’s  book on the history of expert witnesses in the common law is an important starting place in understanding the evolution of the adversarial expert witness system in English and American courts.  Tal Golan, Laws of Man and Laws of Nature: A History of Scientific Expert Testimony (Harvard 2004).

Last year, Golan led a faculty seminar at the University of Haifa’s Law School on the history of epidemiologic evidence in 20th century American litigation.  A draft of Golan’s paper is available at the school’s website, and for those interested in the evolution of the American courts’ treatment of statistical and epidemiologic evidence, the paper is worth a look.  Tal Golan, “A preliminary history of epidemiological evidence in the twentieth-century American Courtroom” manuscript (2011) [Golan 2011].

There are problems, however, with Golan’s historical narrative.  Golan points to tobacco cases as the earliest forays into the use of epidemiologic evidence to prove health claims in court:

“I found only four toxic tort cases in the 1960s that involved epidemiological evidence – two tobacco and two vaccine cases. In the tobacco cases, the plaintiffs tried and failed to establish a causal relation between smoking and cancer via the testimony of epidemiological experts. In both cases the judges dismissed the epidemiological evidence and directed summary verdicts for the tobacco companies.38

Golan 2011 at 11 & n. 38 (citing Pritchard v. Liggett & Myers Tobacco Co., 295 F.2d 292 (1961); Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (1963)).  Golan may be correct that some of the early tobacco cases were dismissive of statistical and epidemiologic evidence, but these citations do not support his assertion.  The Latrigue case resulted in a defense verdict after a jury trial.  The judgment for the defendant was affirmed on appeal, with specific reference to the plaintiff’s use of epidemiologic evidence.  Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (5th Cir. 1963) (“The plaintiff contends that the jury’s verdict was contrary to the manifest weight of the evidence. The record consists of twenty volumes, not to speak of exhibits, most of it devoted to medical opinion. The jury had the benefit of chemical studies, epidemiological studies, reports of animal experiments, pathological evidence, reports of clinical observations, and the testimony of renowned doctors. The plaintiff made a convincing case, in general, for the causal connection between tobacco and cancer and, in particular, for the causal connection between Lartigue’s smoking and his cancer. The defendants made a convincing case for the lack of any causal connection.”), cert. denied, 375 U.S. 865 (1963), and cert. denied, 379 U.S. 869 (1964).  Golan is thus wrong to suggest that the plaintiffs in Lartigue suffered a summary judgment or a directed verdict on their causation claims.

In Pritchard, the plaintiff had three trials in the course of litigating his tobacco-related claims.  See Pritchard v. Liggett & Myers Tobacco Co., 134 F. Supp. 829 (W.D. Pa. 1955), rev’d, 295 F.2d 292, 294 (3d Cir. 1961), 350 F.2d 479 (3d Cir. 1965), cert. denied, 382 U.S. 987 (1966), amended, 370 F.2d 95 (3d Cir. 1966), cert. denied, 386 U.S. 1009 (1967).  The Pritchard case ultimately turned on liability more than causation issues.  In both cases, Golan’s citations are abridged and incorrect.

Golan also wades into a discussion of statistical significance in which he misstates the meaning of the concept and he incorrectly describes how it was handled in at least one important case:

“Statistics provides such an assurance by calculating the probability of false association, and the epidemiological dogma demands it to be smaller than 5% (i.e, less than 1 in 20) for the association to be considered statistically significant.”

Golan 2011, at 18.  This statement is wrong.  Statistics do not provide a probability of the truth or falsity of the association.  The significance probability to which Golan refers measures the probability of data at least as extreme as those observed if the null hypothesis of no difference is correct.

Having misunderstood and misstated the meaning of significance probability, Golan proceeds to make the classic misidentification of statistical significance probability with the probability of the either the null hypothesis or the observed result.  Frequentist statistical testing cannot do this, and Golan’s error has no place in a history of these concepts other than to point out that courts have frequently made this mistake:

“The ‘statistical significance‘ standard is far more demanding than the ‘preponderance of the evidence‘ or ‘more likely than not‘ standard used in civil law. It reflects the cautious attitude of scientists who wish to be 95% certain that their measurements are not spurious.

**********

Epidemiologists have considered the price well worth paying. So has criminal law, which emphasizes the minimization of false conviction, even at the price of overlooking true crime. But civil law does not share this concern.”

This narrative misstates what epidemiologist are doing in using significance probability and null hypothesis significance testing.  The confusion between epidemiologic statistical standards and burden of proof in criminal cases is a serious error.

Golan compares and contrasts the approaches of the trial judges in Allen v. United States, and in In re Agent Orange:

“Judge Weinstein, on the other hand, was far less concerned with the strictness of the epidemiology. A scholar of evidence, member of the Advisory Committee that drafted the Federal Rules of Evidence during the early 1970s, and a critic of the partisan deployment of science in the adversarial courtroom, Weinstein embraced the stringent 95% significance threshold as a ready-made admissibility test that could validate the veracity of the statistical evidence used in court. Thus, while he referred to epidemiological studies as ―the best (if not the sole) available evidence in mass exposure cases,‖ he nevertheless refused to accept them in evidence, unless they were statistically significant.64

Golan at 19.  Weinstein is all that and more, but he never simplistically embraced statistical significance as a “ready-made admissibility test.”  Of course 95% is the coefficient of confidence, and the complement of alpha of 0.05%, but this alpha is not a particularly stringent threshold unless it is misunderstood as a burden of proof.  Contrary to Golan’s suggestion, Judge Weinstein was not being conservative or restrictive in his approach in In re Agent Orange.

Golan’s “preliminary” history is a good start, but it misses an important perspective.  After World War II, biological science, in the form of genetics, as well as epidemiology and other areas, grew to encompass stochastic processes as well as mechanistic processes.  To a large extent, in permitting judgments to be based upon statistical and epidemiologic evidence, the law was struggling to catch up with developments in science.   There is quite a bit of evidence that the law is still struggling.

Reference Manual on Scientific Evidence (3d edition) on Statistical Significance

July 8th, 2012

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

Berger’s citations in support are curiously inaccurate.  Footnote 62 cites the Cook case:

“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”

The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study.

Footnote 63 is no better:

“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”

With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis.  The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies.  The court’s review was hardly extensive.  When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).  Berger’s characterization of the review is incorrect, and her failure to cite the subsequent procedural history disturbing.

 

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.

 

Chapter on Multiple Regression

The chapter on regression does not add much to the earlier and later discussions.  The author asks rhetorically what is the appropriate level of statistical significance, and answers:

“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47”

Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.

 

Chapter on Epidemiology

The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573.  The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics. The problem is that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample.  Furthermore, the chapter’s statement here omits reference to the conditional nature of the probability that makes it dependent upon the assumption of correctness of the null hypothesis.

The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.”  The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary.  Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off.  Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”

Id. at 573n.68.  Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing.   The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance.  See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

The epidemiology chapter also chastises courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct:

“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).”

Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose  as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”

577 n81.  The footnotes goes to explain further the difference between alpha probability and burden of proof probability, but incorrectly asserts that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.”  Id.  The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expect value if the null hypothesis is true.  Kaye and Freedman’s chapter on statistics does much better at describing and avoiding the transpositional fallacy when describing p-values.

When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct.  What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87

Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”

Id. at 578-79.  Mostly true, but again rather  unhelpful to judges and lawyers.  The authors divide the world up into “strict” testers and those critical of “strict” testing.  Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all?  I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis.  Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them.  But what of those critical of strict testing?  This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).  The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample.  Confidence intervals, however, do not eliminate the need to interpret the extent of random error.

In the final analysis, the epidemiology chapter is unclear and imprecise.  I believe it confuses matters more than it clarifies.  There is clearly room for improvement in the Fourth Edition.

Viagra, Part II — MDL Court Sees The Light – Bad Data Trump Nuances of Statistical Inference

July 8th, 2012

In the Viagra vision loss MDL, the first Daubert hearing did not end well for the defense.  Judge Magnuson refused to go beyond conclusory statements by the plaintiffs’ expert witness, Gerald McGwin, and to examine the qualitative and quantitative evaluative errors invoked to support plaintiffs’ health claims.  The weakness of McGwin’s evidence, however, appeared to  encourage Judge Magnuson to authorize extensive discovery into McGwin’s study.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1090 (D. Minn. 2008).

The discovery into McGwin’s study had already been underway, with subpoenas to him and to his academic institution.  As it turned out, defendant’s discovery into the data and documents underlying McGwin’s study won the day.  Although Judge Magnuson struggled with inferential statistics, he understood the direct attack on the integrity of McGwin’s data.  Over a year after denying defendant’s Rule 702 motion to exclude Gerald McGwin, the MDL court reconsidered and granted the motion.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).

The basic data on prior exposures and risk factors for the McGwin study was collected by telephone surveys, from which the information was coded into an electronic dataset.  In analyzing the data, McGwin used the electronic dataset and not the survey forms.  Id. at 939.  The transfer from survey forms to electronic dataset did not go smoothly; about 11 patients were miscoded as “exposed“ when their use of Viagra post-dated the onset of NAION. Id. at 942.  Furthermore, the published article incorrectly stated personal history of heart attack as a “risk factor ”; the survey inquired about family not personal history of heart attack. Id. at 944.

The plaintiffs threw several bombs in response, but without legal effect.  First, the plaintiffs claimed that the study participants had been recontacted and the database had been corrected, but they were unable to document this process or the alleged corrections.  Id. at 433.  Furthermore, the plaintiffs could not explain how, if their contention had been true, McGwin would have not committed serious violations of his university’s institutional review board’s regulations with respect to deviations from the original protocol.  Id. at 943 n.7.

Second, the plaintiffs argued that the underlying survey forms were “inadmissible ” and thus the defense could not use them to impeach the McGwin study.  Some might think this a duplicitous argument, utterly at odds with Rule 703 – rely upon a study but prevent use of underlying data and documents to explain that the study does not show what it purports to show.  The MDL court spared the plaintiffs the embarrassment of ruling that the documents on which McGwin had based his study were inadmissible, and found that the forms were business records and admissible under Federal Rule of evidence 803(6).  The court could have gone further to point out that McGwin’s reliance upon hearsay in the form of his study, McGwin 2006, opened the door to impeaching the hearsay relied upon with other hearsay.  See Rule 806.

When defense counsel sat down with McGwin in a deposition, they found that he had not undertaken any new analyses of corrected data.  Plaintiffs’ counsel directed him not to do so.  Id. at 940-41.  But then after the deposition was over, McGwin submitted a letter to the journal to report a corrected analysis.  Pfizer’s counsel obtained the letter in response to their subpoena to McGwin’s university, the University of Alabama, Birmingham.  Mirabile dictu; now the increase risk appeared limited to only to the defendant’s medication, Viagra!

The trial court was not amused.  First, the new analysis was no longer peer reviewed, and the court had placed a great deal of emphasis on peer review in denying the first challenge to McGwin.  Second, the new analysis was no longer that of an independent scientist, but was conducted and submitted as a letter to the editor, while McGwin was working for plaintiffs’ counsel.  Third, the plaintiffs and McGwin conceded that the data were not accurate.  Last, but not least, the trial court clearly was not pleased that the plaintiffs’ counsel had deliberately delayed McGwin’s further analyses until after the deposition, and then tried to submit yet another supplemental report with those further analyses. In sum:

“the Court finds good reason to vacate its original Daubert Order permitting Dr. McGwin to testify as a general causation expert based on the McGwin Study as published. Almost every indicia of reliability the Court relied on in its previous Daubert Order regarding the McGwin Study has been shown now to be unreliable.  Peer review and publication mean little if a study is not based on accurate underlying data. Likewise, the known rate of error is also meaningless if it is based on inaccurate data. Even if the McGwin Study as published was conducted according to generally accepted epidemiologic research and did not result from post-litigation research, the fact that the McGwin Study appears to have been based on data that cannot now be documented or supported renders it inadmissibly unreliable. The Court concludes that under Daubert, Dr. McGwin’s opinion, to the extent that it is based on the McGwin Study as published, lacks sufficient indicia of reliability to be admitted as a general causation opinion.”

Id. at 945-46.  The remaining evidence was the Margo & French study, but McGwin had previously criticized that study as lacking data that ensured that Viagra use preceded onset of NAION.  In the end, McGwin was left with bupkes, and the plaintiffs were left with even less.

*******************

McGwin 2006 Was Also A Pain in the Rear End for McGwin

The Rule 702 motions and hearings on McGwin’s proposed testimony had consequences in the scientific world itself.  In 2011, the British Journal of Ophthalmology retracted McGwin’s 2006 paper.  “Retraction: Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction, ” 95 Brit. J. Ophthalmol. 595 (2011).

Interestingly, the retraction was reported in the Retraction Watch blog, “Retractile dysfunction? Author says journal yanked paper linking Viagra, Cialis to vision problem after legal threats.”  The blog treated the retraction as routine except for the hint of “legal pressure”:

“One of the authors of the paper, a researcher at the University of Alabama named Gerald McGwin Jr., told us that the journal retracted the article because it had become a tool in a lawsuit involving Pfizer, which makes Viagra, and, presumably, men who’d developed blindness after taking the drug:

‘The article just became too much of a pain in the rear end. It became one of those things where we couldn’t provide all the relevant documentation [to the university, which had to provide records for attorneys].’

Ultimately, however, McGwin said that the BJO pulled the plug on the paper.”

Id. The legal threat is hard to discern other than the fact that lawyers wanted to see something that peer reviewers almost never see – the documentation underlying the published paper.  So now, the study that formed the basis for the original ruling against Pfizer floats aimlessly as a derelict on the sea of science.  McGwin is, however, still at his craft.  In a study he published in 2010, he claimed that Viagra but not Cialis use was associated with hearing impairment.  Gerald McGwin, Jr, “Phosphodiesterase Type 5 Inhibitor Use and Hearing Impairment,” 136 Arch. Otolaryngol. Head & Neck Surgery 488 (2010).

Where are Senator Grassley and Congressman Waxman when you need them?

Love is Blind but What About Judicial Gatekeeping of Expert Witnesses? – Viagra Part I

July 7th, 2012

The Viagra litigation over claimed vision loss vividly illustrates the difficulties that trial judges have in understanding and applying the concept of statistical significance.  In this MDL, plaintiffs sued for a specific form of vision loss, non-arteritic ischemic optic neuropathy (NAION), which they claimed was caused by their use of defendant’s medication, Viagra.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071 (D. Minn. 2008).  Plaintiffs’ key expert witness, Gerald McGwin considered three epidemiologic studies; none found a statistically significant elevation of risk of NAION after Viagra use.  Id. at 1076. The defense filed a Rule 702 motion to exclude McGwin’s testimony, based in part upon the lack of statistical significance of the risk ratios he relied upon for his causal opinion.  The trial court held that this lack did not render McGwin’s testimony and unreliable and inadmissible  Id. at 1090.

One of the three studies considered by McGwin was his own published paper.  G. McGwin, Jr., M. Vaphiades, T. Hall, C. Owsley, ‘‘Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction,’’ 90 Br. J. Ophthalmol. 154 (2006)[“McGwin 2006”].    The MDL court noted that McGwin had stated that his paper reported an odds ratio (OR) of 1.75, with a 95% confidence interval (CI), 0.48 to 6.30.  Id. at 1080.  The study also presented multiple subgroup analyses of men who had reported Viagra use after a history of heart attack (OR = 10.7) or hypertension (OR = 6.9), but the MDL court did not provide p-values or confidence intervals for the subgroup analysis results.

Curiously, Judge Magnuson eschewed the guidance of the Reference Manual on Scientific Evidence, in dealing with statistics of sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.  The new third edition again conveys the same basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The point of the RSME‘s guidance is, of course, that the standard error, or the confidence interval (C.I.) based upon a specified number of standard errors, is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

The statistical data on which McGwin was basing his opinion was readily available from McGwin 2006:

“Overall, males with NAION were no more likely to report a history of Viagra … use compared to similarly aged controls (odd ratio (OR) 1.75, 95% confidence interval (CI) 0.48 to 6.30.  However, for those with a history of myocardial infarction, a statistically significant association was observed (OR 10.7, 95% CI 1.3 to 95.8). A similar association was observed for those with a history of hypertension though it lacked statistical significance (OR 6.9, 95% CI 0.8 to 63.6).”

McGwin 2006, at 154.  Following the RSME‘s guidance would have assisted the MDL court in its gatekeeping responsibility in several distinct ways.  First, the court would have focused on how wide the 95% confidence intervals were.  The width of the intervals pointed to statistical imprecision and instability in the point estimates urged by McGwin.  Second, the MDL court would have confronted the extent to which there were multiple ad hoc subgroup analyses in McGwin’s paper.  See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002)(“It is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns.”) Third, the court would have confronted the extent to which the study’s validity was undermined by several potent biases.  Statistical significance was the least of the problems faced by McGwin 2006.

The second study considered and relied upon by McGwin was referred to as Margo & French.  McGwin cited this paper for an “elevated OR of 1.10,” id. at 1081, but again, had the court engaged with the actual evidence, it would have found that McGwin had cherry picked the data he chose to emphasize.  The Margo & French study was a retrospective cohort study using the National Veterans Health Administration’s pharmacy and clinical databases.  C. Margo & D. French, ‘‘Ischemic optic neuropathy in male veterans prescribed phosphodiesterase-5 inhibitors,’’ 143 Am. J. Ophthalmol. 538 (2007).  There were two outcomes ascertained:  NAION and “possible” NAION.  The relative risk of NAION among men prescribed a PDE-5 inhibitor (the class to which Viagra belongs) was 1.02 (95% confidence interval [CI]: 0.92 to 1.12.  In other words, the Margo & French paper had very high statistical precision, and it reported essentially no increased risk at all.  Judge Magnuson cited uncritically McGwin’s endorsement of a risk ratio that included ‘‘possible’’ NAION cases, which could not bode well for a gatekeeping process that is supposed to protect against speculative evidence and conclusions.

McGwin’s citation of Margo & French for the proposition that men who had taken the PDE-5 inhibitors had a 10% increased risk was wrong on several counts.  First, he relied upon an outcome measure that included ‘‘possible’’ cases of NAION.  Second, he completely ignored the sampling error that is captured in the confidence interval.  The MDL court failed to note or acknowledge the p-value or confidence interval for any result in Margo & French. The consideration of random error was not an optional exercise for the expert witness or the court; nor was ignoring it a methodological choice that simply went to the ‘‘disagreement among experts.’’

The Viagra MDL court not only lost its way by ignoring the guidance of the RMSE, it appeared to confuse the magnitude of the associations with the concept of statistical significance.  In the midst of the discussion of statistical significance, the court digressed to address the notion that the small relative risk in Margo & French might mean that no plaintiff could show specific causation, and then in the same paragraph returned to state that ‘‘persuasive authority’’ supported the notion that the lack of statistical significance did not detract from the reliability of a study.  Id. at 1081 (citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., MDL No. 1407, 289 F.Supp.2d 1230, 1241 (W.D.Wash. 2003)).  The magnitude of the observed odds ratio is an independent concept from that of whether an odds ratio as extreme or more so would have occurred by chance if there really was no elevation.

Citing one case, at odds with a great many others, however, did not create an epistemic warrant for ignoring the lack of statistical significance.  The entire notion of cited caselaw for the meaning and importance of statistical significance for drawing inferences is wrong headed.  Even more to the point, the lack of statistical significance in the key study in the PPA litigation did not detract from the reliability of the study, although other features of that study certainly did.  The lack of statistical significance in the PPA study did, however, detract from the reliability of the inference from the study’s estimate of ‘‘effect size’’ to a conclusion of causal association. Indeed, nowhere in the key PPA study did its authors draw a causal conclusion with respect to PPA ingestion and hemorrhagic stroke.  See Walter Kernan, Catherine Viscoli, Lawrence Brass, Joseph Broderick, Thomas Brott, Edward Feldmann, Lewis Morgenstern, Janet Lee Wilterdink, and Ralph Horwitz, ‘‘Phenylpropanolamine and the Risk of Hemorrhagic Stroke,’’ 343 New England J. Med. 1826 (2000).

The MDL court did attempt to distinguish the Eighth Circuit’s decision in Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001), cited by the defense:

‘‘[I]n Glastetter … expert evidence was excluded because ‘rechallenge and dechallenge data’ presented statistically insignificant results and because the data involved conditions ‘quite distinct’ from the conditions at issue in the case. Here, epidemiologic data is at issue and the studies’ conditions are not distinct from the conditions present in the case. The Court does not find Glastetter to be controlling.’’

Id. at 1081 (internal citations omitted; emphasis in original).  This reading of Glastetter, however, misses important features of that case and the Parlodel litigation more generally.  First, the Eighth Circuit commented not only upon the rechallenge-dechallenge data, which involved arterial spasms, but upon an epidemiologic study of stroke, from which Ms. Glastetter suffered.  The Glastetter court did not review the epidemiologic evidence itself, but cited to another court, which did discuss and criticize the study for various ‘‘statistical and conceptual flaws.’’  See Glastetter, 252 F.3d at 992 (citing Siharath v. Sandoz Pharms.Corp., 131 F.Supp. 2d 1347, 1356-59 (N.D.Ga.2001)).  Glastetter was binding authority, and not so easily dismissed and distinguished.

The Viagra MDL court ultimately placed its holding upon the facts that:

‘‘the McGwin et al. and Margo et al. studies were peer-reviewed, published, contain known rates of error, and result from generally accepted epidemiologic research.’’

In re Viagra, 572 F. Supp. 2d at 1081 (citations omitted).  This holding was a judicial ipse dixit substituting for the expert witness’s ipse dixit.  There were no known rates of error for the systematic errors in the McGwin study, and the ‘‘known’’ rates of error for random error in McGwin 2006  were intolerably high.  The MDL court never considered any of the error rates, systematic or random, for the Margo & French study.  The court appeared to have abdicated its gatekeeping responsibility by delegating it to unknown peer reviewers, who never considered whether the studies at issue in isolation or together could support a causal health claim.

With respect to the last of the three studies considered, the Gorkin study, McGwin opined that it was  too small, and the data were not suited to assessing temporal relationship.  Id.  The court did not appear inclined to go beyond McGwin’s ipse dixit.  The Gorkin study was hardly small, in that it was based upon more than 35,000 patient-years of observation in epidemiologic studies and clinical trials, and provided an estimate of incidence for NAION among users of Viagra that was not statistically different from the general U.S. population.  See L. Gorkin, K. Hvidsten, R. Sobel, and R. Siegel, ‘‘Sildenafil citrate use and the incidence of nonarteritic anterior ischemic optic neuropathy,’’ 60 Internat’l J. Clin. Pract. 500, 500 (2006).

Judge Magnuson did proceed, in his 2008 opinion, to exclude all the other expert witnesses put forward by the plaintiffs.  McGwin survived the defendant’s Rule 702 challenge, largely because the court refused to consider the substantial random variability in the point estimates from the studies relied upon by McGwin. There was no consideration of the magnitude of random error, or for that matter, of the systematic error in McGwin’s study.  The MDL court found that the studies upon which McGwin relied had a known and presumably acceptable ‘‘rate of error.’’  In fact, the court did not consider the random or sampling error in any of the three cited studies; it failed to consider the multiple testing and interaction; and it failed to consider the actual and potential biases in the McGwin study.

Some legal commentators have argued that statistical significance should not be a litmus test.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 241 (‘‘Statistical significance should not be a litmus test. However, there are many situations where the lack of significance combined with other aspects of the research should be enough to exclude an expert’s testimony.’’)  While I agree that significance probability should not be evaluated in a mechanical fashion, without consideration of study validity, multiple testing, bias, confounding, and the like, handing waving about litmus tests does not excuse courts or commentators from totally ignoring random variability in studies based upon population sampling.  The dataset in the Viagra litigation was not a close call.

Maryland Puts the Brakes on Each and Every Asbestos Exposure

July 3rd, 2012

Last week, the Maryland Court of Special Appeals reversed a plaintiffs’ verdict in Dixon v. Ford Motor Company, 2012 WL 2483315 (Md. App. June 29, 2012).  Jane Dixon died of pleural mesothelioma.  The plaintiffs, her survivors, claimed that her last illness and death were caused by her household improvement projects, which involved exposure to spackling/joint compound, and by her husband’s work with car parts and brake linings, which involved “take home” exposure on his clothes.  Id. at *1.

All the expert witnesses appeared to agree that mesothelioma is a “dose-response disease,” meaning that the more the exposure, the greater the likelihood that a person exposed will develop the disease. Id. at *2.  Plaintiffs’ expert witness, Dr. Laura Welch, testified that “every exposure to asbestos is a substantial contributing cause and so brake exposure would be a substantial cause even if [Mrs. Dixon] had other exposures.” On cross-examination, Dr. Welch elaborated upon her opinion to explain that any “discrete” exposure would be a contributing factor. Id.

Welch, of course, criticized the entire body of epidemiology of car mechanics and brake repairmen, which generally finds no increased risk of mesothelioma above overall population rates.  With respect to the take-home exposure, Welch had to acknowledge that there were no epidemiologic studies that investigated the risk of wives of brake mechanics.  Welch argued that the studies of car mechanics did not involve exposure to brake shoes as would have been experienced by brake repairmen, but her argument only served to make her attribution based upon take-home exposure to brake linings seem more preposterous.  Id. at *3.  The court recognized that Dr. Welch’s opinion may have been trivially true, but still unhelpful.  Each discrete exposure, even as attenuated as a take-home exposure from having repaired a single brake shoe may have “contributed,” but that opinion did not help the jury assess whether the contribution was substantial.

The court sidestepped the issue of fiber type, and threshold, and honed in on the agreement that mesothelioma risk showed a dose-response relationship with asbestos exposure.  (There is a sense that the court confused the dose-response concept to mean no threshold.)  The court credited hyperbolic risk assessment figures from the United States Environmental Protection Agency, which suggested that even ambient air exposure to asbestos leads to an increase in mesothelioma risk, but then realized that such claims made the legal need to characterize the risk from the defendant’s product all the more important before the jury could reasonably have concluded that any particular exposure experienced by Ms. Dixon was “a substantial contributing factor.”  Id. at *5.

Having recognized that the best the plaintiffs could offer was a claim of increased risk, and perhaps crude quantification of the relative risks resulting from each product’s exposure, the court could not escape that the conclusion that Dr. Welch’s empty recitation of “every exposure” is substantial was nothing more than an unscientific and empty assertion.  Welch’s claim was either tautologically true or empirical nonsense.  The court also recognized that risk substituting for causation opened the door to essentially probabilistic evidence:

“If risk is our measure of causation, and substantiality is a threshold for risk, then it follows—as intimated above—that ‘substantiality’ is essentially a burden of proof. Moreover, we can explicitly derive the probability of causation from the statistical measure known as ‘relative risk’ … .  For reasons we need not explore in detail, it is not prudent to set a singular minimum ‘relative risk’ value as a legal standard.12 But even if there were some legal threshold, Dr. Welch provided no information that could help the finder of fact to decide whether the elevated risk in this case was ‘substantial’.”

Id. at *7.  The court’s discussion here of “the elevated risk” seems wrong unless we understand it to mean the elevated risk attributable to the particular defendant’s product, in the context of an overall exposure that we accept as having been sufficient to cause the decedent’s mesothelioma.  Despite the lack of any quantification of relative risks in the case, overall or from particular products, and the court’s own admonition against setting a minimum relative risk as a legal standard, the court proceeded to discuss relative risks at length.  For instance, the court criticized Judge Kozinski’s opinion in Daubert, upon remand from the Supreme Court, for not going far enough:

“In other words, the Daubert court held that a plaintiff’s risk of injury must have at least doubled in order to hold that the defendant’s action was ‘more likely than not’ the actual cause of the plaintiff’s injury. The problem with this holding is that relative risk does not behave like a ‘binary’ hypothesis that can be deemed ‘true’ or ‘false’ with some degree of confidence; instead, the un-certainty inherent in any statistical measure means that relative risk does not resolve to a certain probability of specific causation. In order for a study of relative risk to truly fulfill the preponderance standard, it would have to result in 100% confidence that the relative risk exceeds two, which is a statistical impossibility. In short, the Daubert approach to relative risk fails to account for the twin statistical uncertainty inherent in any scientific estimation of causation.”

Id. at *7 n.12 (citing Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320-21 (9th Cir.1995) (holding that that a preponderance standard requires causation to be shown by probabilistic evidence of relative risk greater than two) (opinion on remand from Daubert v. Merrell Dow Pharms., 509 U.S. 579 (1993)).  The statistical impossibility derives from the asymptotic nature of the normal distribution, but the court failed to explain why a relative risk of two must be excluded as statistically implausible based upon the sample statistic.  After all, a relative risk greater than two, with a lower bound of a 95% confidence interval above one, based upon an unbiased sampling, suggests that our best evidence is that the population parameter is greater than two, as well.  The court, however, insisted upon stating the relative-risk-greater-than-two rule with a vengeance:

“All of this is not to say, however, that any and all attempts to establish a burden of proof of causation using relative risk will fail. Decisions can be – and in science or medicine are – premised on the lower limit of the relative risk ratio at a requisite confidence level. The point of this minor discussion is that one cannot apply the usual, singular ‘preponderance’ burden to the probability of causation when the only estimate of that probability is statistical relative risk. Instead, a statistical burden of proof of causation must consist of two interdependent parts: a requisite confidence of some minimum relative risk. As we explain in the body of our discussion, the flaws in Dr. Welch’s testimony mean we need not explore this issue any further.44

Id. (emphasis in original).

And despite having declared the improvidence of addressing the relative risk issue, and then the lack of necessity for addressing the issue given Dr. Welch’s flawed testimony, the court nevertheless tackled the issue once more, a couple of pages later:

“It would be folly to require an expert to testify with absolute certainty that a plaintiff was exposed to a specific dose or suffered a specific risk. Dose and risk fall on a spectrum and are not ‘true or false’. As such, any scientific estimate of those values must be expressed as one or more possible intervals and, for each interval, a corresponding confidence that the true value is within that interval.”

Id. at 9 (emphasis in original; internal citations omitted).  The court captured the frequentist concept of the confidence interval as being defined operationally by repeated samplings and their random variability, but the confidence of the confidence interval means that the specified coefficient represents the percentage of all such intervals that include the “true” value, not the probability that a particular interval, calculated from a given sample, contains the true value.  The true value is either in or not in the interval generated from a single sample risk statistic.  Again, it is unclear why the court was weighing in on this aspect of probabilistic evidence when plaintiffs’ expert witness, Welch, offered no quantitation of the overall risk or of the risk attributable to a specific product exposure.

The court indulged the plaintiffs’ no-threshold fantasy but recognized that the risks of low-level asbestos exposure were low, and likely below a doubling of risk, an issue that the court stressed it wanted to avoid.  The court cited one study that suggested a risk (odds) ratio of 1.1 for exposures less than 0.5 fiber/ml – years.  See id. at *5 (citing Y. Iwatsubo et al., “Pleural mesothelioma: dose-response relation at low levels of asbestos exposure in a French population-based case-control study,” 148 Am. J. Epidemiol. 133 (1998) (estimating an odds ratio of 1.1 for exposures less than 0.5 fibers/ml-years).  But the court, which tried to be precise elsewhere, appears to have lost its way in citing Iwatsubo here.  After all, how can a single odds ratio of 1.1 describe all exposures from 0 all the way up to 0.5 f/ml-years?  How can a single odds ratio describe all exposures in this range, regardless of fiber type, when chrystotile asbestos carries little to no risk for mesothelioma, and certainly orders of magnitude risk less than amphibole fibers such as amosite and crocidolite.  And if a low-level exposure has a risk ratio of 1.1, how can plaintiffs’ hired expert witness, Welch, even make the attribution of Dixon’s mesothelioma to the entirety of her exposure, let alone the speculative take-home chrysotile exposure involved from Ford’s brake linings?  Obviously, had the court posed these questions, it would it would have realized that “it is not possible” to permit Welch’s testimony at all.

The court further lost its way in addressing the exculpatory epidemiology put forward by the defense expert witnesses:

“Furthermore, the leading epidemiological report cited by Ford and its amici that specifically studied ‘brake mechanics’, P.A. Hessel et al., ‘Meso-thelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study’, 24 Risk Analysis 547 (2004), does not at all dispel the notion that this population faced an increased risk of mesothelioma due to their industrial asbestos exposure. … When calculated at the 95% confidence level, Hessel et al. estimated that the odds ratio of mesothelioma could have been as low as 0.01 or as high as 4.71, implying a nearly quintupled risk of mesothelioma among the population of brake mechanics. 24 Risk Analysis at 550–51.”

Id. at *8.  Again, the court is fixated with the confidence interval, to the exclusion of the estimated magnitude of the association!  This time, after earlier shouting that it was the lower bound of the interval that matters scientifically, the court emphasizes the upper bound.  The court here has strayed far from the actual data, and any plausible interpretation of them:

“The odds ratio (OR) for employment in brake installation or repair was 0.71 (95% CI: 0.30-1.60) when controlled for insulation or shipbuilding. When a history of employment in any of the eight occupations with potential asbestos exposure was controlled, the OR was 0.82 (95% CI: 0.36-1.80). ORs did not increase with increasing duration of brake work. Exclusion of those with any of the eight exposures resulted in an OR of 0.62 (95% CI: 0.01-4.71) for occupational brake work.”

P.A. Hessel et al., “Mesothelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study,” 24 Risk Analysis 547, 547 (2004).  All of Dr. Hessel’s estimates of effect sizes were below 1.0, and he found no trend for duration of brake work.  Cherry picking out the upper bound of a single subgroup analysis for emphasis was unwarranted, and hardly did justice to the facts or the science.

Dr. Welch’s conclusion that the exposure and risk in this case were “substantial” simply was not a scientific conclusion, and without it her testimony did not provide information for the jury to use in reaching its conclusion as to substantial factor causation. Id. at *7.  The court noted that Welch, and the plaintiffs, may have lacked scientific data to provide estimates of Dixon’s exposure to asbestos or relative risk of mesothelioma, but ignorance or uncertainty was hardly the basis to warrant an expert witness’s belief that the relevant exposures and risks are “substantial.” Id. at *10.  The court was well justified in being discomforted by the conclusory, unscientific opinion rendered by Laura Welch.

In the final puzzle of the Dixon case, the court vacated the judgment, and remanded for a new trial, “either without her opinion on substantiality or else with some quantitative testimony that will help the jury fulfill its charge.”  Id. at *10.  The court thus seemed to imply that an expert witness need not utter the magic word, “substantial,” for the case to be submitted to the jury against a brake defendant in a take-home exposure case.  Given the state of the record, the court should have simply reversed and rendered judgment for Ford.

Ecological Fallacy Goes to Court

June 30th, 2012

In previous posts, I have bemoaned the judiciary’s tin ear for important qualitative differences between and among different research study designs.  The Reference Manual for Scientific Evidence (3d ed. 2011)(RMSE3d) offers inconsistent advice, ranging from Margaret Berger’s counsel to abandon any hierarchy of evidence, to other chapters’ emphasizing the importance of a hierarchy.

The Cook case is one of the more aberrant decisions, which elevated an ecological study, without a statistically significant result, into an acceptable basis for a causal conclusion under Rule 702.  Senior Judge Kane’s decision in the litigation over radioactive contamination from the Colorado Rocky Flats nuclear weapons plant is illustrative of a judicial refusal to engage with the substantive differences among studies, and to ignore the inability of some study designs to support causality.  See Cook v. Rockwell Internat’l Corp., 580 F. Supp. 2d 1071, 1097-98 (D. Colo. 2006) (“Defendants assert that ecological studies are inherently unreliable and therefore inadmissible under Rule 702.  Ecological studies, however, are one of several methods of epidemiological study that are well-recognized and accepted in the scientific community.”), rev’d and remanded on other grounds, 618 F.3d 1127 (10th Cir. 2010), cert. denied, ___ U.S. ___ (May 24, 2012).  Senior Judge Kane’s point about the recognition and acceptance of ecological studies has nothing to do with their ability to support conclusions of causality.  This basic non sequitur led the trial judge into ruling that the challenge “goes to the weight, not the admissibility” of the challenged opinion testimony.  This is a bit like using an election day exit poll, with 5% returns, for “reliable” evidence to support a prediction of the winner.  The poll may have been conducted most expertly, but it lacks the ability to predict the winner.

The issue is not whether ecological studies are “scientific”; they are part of the epidemiologists’ toolkit.  The issue is whether they warrant inferences of causation.  Some so-called scientific studies are merely hypothesis generating, preliminary, tentative, or data-dredging exercises.  Judge Kane opined that ecological studies are merely “less probative” than other studies, and the relative weights of studies do not render them inadmissible.  Id.  This is a misunderstanding or an abdication of gatekeeping responsibility.  First, studies themselves are not admissible; it is the expert witness, whose testimony is challenged.  Second, Rule 702 requires that the proffered opinion be “scientific knowledge,” and ecological studies simply lack the necessary epistemic warrant.

The legal sources cited by Senior Judge Kane provide only equivocal and minimal support at best for his decision.  The court pointed to RSME2d at 344-45, for the proposition that ecological studies are useful for establishing associations, but are weak evidence for causality. The other legal citations give seem equally unhelpful.  In re Hanford Nuclear Reservation Litig., No. CY–91– 3015–AAM, 1998 WL 775340 at *106 (E.D.Wash. Aug.21, 1998) (citing RMSE2d and the National Academy of Science Committee on Radiation Dose Reconstruction for Epidemiological Uses, which states that “ecological studies are usually regarded as hypothesis generating at best, and their results must be regarded as questionable until confirmed with cohort or case‑control studies.” National Research Council, Radiation Dose Reconstruction for Epidemiologic Uses at 70 (1995)), rev’d on other grounds, 292 F.3d 1124 (9th Cir. 2002).  Ruff v. Ensign– Bickford Indus., Inc., 168 F.Supp. 2d 1271, 1282 (D. Utah 2001) (reviewing evidence that consisted of a case-control study in addition to an ecological study; “It is well established in the scientific community that ecological studies are correlational studies and generally provide relatively weak evidence for establishing a conclusive cause and effect relationship.’’); see also id. at 1274 n.3 (“Ecological studies tend to be less reliable than case–control studies and are given little evidentiary weight with respect to establishing causation.”)

 

ERROR COMPOUNDED

The new edition of RMSE cites the Cook case at several places.  In an introductory chapter, the late Professor Margaret Berger cites the case incorrectly for having excluded expert witness testimony.  See Margaret A. Berger, “The Admissibility of Expert Testimony 11, 24 n.62 in RMSE3d (“See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”)  The chapter on epidemiology cites Cook correctly for having refused to exclude the plaintiffs’ expert witness, Dr. Richard Clapp, who relied upon an ecological study of two cancer outcomes in the area adjacent to the Rocky Flats Nuclear Weapons Plant.  See Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” 549, 561 n. 34, in Reference Manual for Scientific Evidence (3d ed. 2011).  The authors, however, abstain from any judgmental comments about the Cook case, which is curious given their careful treatment of ecological studies and their limitations:

“4. Ecological studies

Up to now, we have discussed studies in which data on both exposure and health outcome are obtained for each individual included in the study.33 In contrast, studies that collect data only about the group as a whole are called ecological studies.34 In ecological studies, information about individuals is generally not gathered; instead, overall rates of disease or death for different groups are obtained and compared. The objective is to identify some difference between the two groups, such as diet, genetic makeup, or alcohol consumption, that might explain differences in the risk of disease observed in the two groups.35 Such studies may be useful for identifying associations, but they rarely provide definitive causal answers.36

Id. at 561.  The epidemiology chapter proceeds to note that the lack of information about individual exposure and disease outcome in an ecological study “detracts from the usefulness of the study,” and renders it prone to erroneous inferences about the association between exposure and outcome, “a problem known as an ecological fallacy.”  Id. at 562.  The chapter authors define the ecological fallacy:

“Also, aggregation bias, ecological bias. An error that occurs from inferring that a relationship that exists for groups is also true for individuals.  For example, if a country with a higher proportion of fishermen also has a higher rate of suicides, then inferring that fishermen must be more likely to commit suicide is an ecological fallacy.”

Id. at 623.  Although the ecological study design is weak and generally unsuitable to support causal inferences, the authors note that such studies can be useful in generating hypotheses for future research using studies that gather data about individuals. Id. at 562.  See also David Kaye & David Freedman, “Reference Guide on Statistics,” 211, 266 n.130 (citing the epidemiology chapter “for suggesting that ecological studies of exposure and disease are ‘far from conclusive’ because of the lack of data on confounding variables (a much more general problem) as well as the possible aggregation bias”); Leon Gordis, Epidemiology 205-06 (3d ed. 2004)(ecologic studies can be of value to suggest future research, but “[i]n and of themselves, however, they do not demonstrate conclusively that a causal association exists”).

The views expressed in the Reference Manual for Scientific Evidence, about ecological studies, are hardly unique.  The following quotes show how ecological studies are typically evaluated in epidemiology texts:

Ecological fallacy

An ecological fallacy or bias results if inappropriate conclusions are drawn on the basis of ecological data. The bias occurs because the association observed between variables at the group level does not necessarily represent the association that exists at the individual level (see Chapter 2).

***

Such ecological inferences, however limited, can provide a fruitful start for more detailed epidemiological work.”

R. Bonita, R. Beaglehole, and T. Kjellström, Basic Epidemiology 43 2d ed. (WHO 2006).

“A first observation of a presumed relationship between exposure and disease is often done at the group level by correlating one group characteristic with an outcome, i.e. in an attempt to relate differences in morbidity or mortality of population groups to differences in their local environment, living habits or other factors. Such correlational studies that are usually based on existing data are prone to the so-called ‘ecological fallacy’ since the compared populations may also differ in many other uncontrolled factors that are related to the disease. Nevertheless, ecological studies can provide clues to etiological hypotheses and may serve as a gateway towards more detailed investigations.”

Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 17-18 (2005).

The Cook case is a wonderful illustration of the judicial mindset that avoids and evades gatekeeping by resorting to the conclusory reasoning that a challenge “goes to the weight, not the admissibility” of an expert witness’s opinion.