TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Pin the Tail on the Significance Test

July 14th, 2012

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply.  See .  An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve.  Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

 

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance.  As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.110 One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading.  The reason for the two-tailed test, however, is not really tied to multiple testing.  If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both  above and below the null hypothesis count in favor of rejecting the null.  A two-tailed test results.  Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication.  Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing?  There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature.  One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

 

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.49 Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321.  This statement is not quite consistent with the chapter on statistics, and it introduces new problems.  The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical.  Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions.  If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value.  The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests.  Id. at 321 n. 49.  The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

 

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression.  Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law.  I believe that the authors’ suggestion about the interest of “most investigators” is incorrect.  The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case).  The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test.  More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine.  The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke.  United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims.  As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims.  See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988).

 

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value.  According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association.  In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test.  One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand.  Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing  as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex.  Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

 

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests.  While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001).  These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance.  As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation  relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,4[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district.  A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009).  See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240.  In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared.  One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)

Tal Golan’s Preliminary History of Epidemiologic Evidence in U.S. Courts

July 10th, 2012

Tal Golan  is an historian, with a special interest in the history of science in the 18th and 19th centuries, and in historical relationships among, science, technology, and the law.  He now teaches history at the University of California, San Diego.  Golan’s  book on the history of expert witnesses in the common law is an important starting place in understanding the evolution of the adversarial expert witness system in English and American courts.  Tal Golan, Laws of Man and Laws of Nature: A History of Scientific Expert Testimony (Harvard 2004).

Last year, Golan led a faculty seminar at the University of Haifa’s Law School on the history of epidemiologic evidence in 20th century American litigation.  A draft of Golan’s paper is available at the school’s website, and for those interested in the evolution of the American courts’ treatment of statistical and epidemiologic evidence, the paper is worth a look.  Tal Golan, “A preliminary history of epidemiological evidence in the twentieth-century American Courtroom” manuscript (2011) [Golan 2011].

There are problems, however, with Golan’s historical narrative.  Golan points to tobacco cases as the earliest forays into the use of epidemiologic evidence to prove health claims in court:

“I found only four toxic tort cases in the 1960s that involved epidemiological evidence – two tobacco and two vaccine cases. In the tobacco cases, the plaintiffs tried and failed to establish a causal relation between smoking and cancer via the testimony of epidemiological experts. In both cases the judges dismissed the epidemiological evidence and directed summary verdicts for the tobacco companies.38

Golan 2011 at 11 & n. 38 (citing Pritchard v. Liggett & Myers Tobacco Co., 295 F.2d 292 (1961); Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (1963)).  Golan may be correct that some of the early tobacco cases were dismissive of statistical and epidemiologic evidence, but these citations do not support his assertion.  The Latrigue case resulted in a defense verdict after a jury trial.  The judgment for the defendant was affirmed on appeal, with specific reference to the plaintiff’s use of epidemiologic evidence.  Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (5th Cir. 1963) (“The plaintiff contends that the jury’s verdict was contrary to the manifest weight of the evidence. The record consists of twenty volumes, not to speak of exhibits, most of it devoted to medical opinion. The jury had the benefit of chemical studies, epidemiological studies, reports of animal experiments, pathological evidence, reports of clinical observations, and the testimony of renowned doctors. The plaintiff made a convincing case, in general, for the causal connection between tobacco and cancer and, in particular, for the causal connection between Lartigue’s smoking and his cancer. The defendants made a convincing case for the lack of any causal connection.”), cert. denied, 375 U.S. 865 (1963), and cert. denied, 379 U.S. 869 (1964).  Golan is thus wrong to suggest that the plaintiffs in Lartigue suffered a summary judgment or a directed verdict on their causation claims.

In Pritchard, the plaintiff had three trials in the course of litigating his tobacco-related claims.  See Pritchard v. Liggett & Myers Tobacco Co., 134 F. Supp. 829 (W.D. Pa. 1955), rev’d, 295 F.2d 292, 294 (3d Cir. 1961), 350 F.2d 479 (3d Cir. 1965), cert. denied, 382 U.S. 987 (1966), amended, 370 F.2d 95 (3d Cir. 1966), cert. denied, 386 U.S. 1009 (1967).  The Pritchard case ultimately turned on liability more than causation issues.  In both cases, Golan’s citations are abridged and incorrect.

Golan also wades into a discussion of statistical significance in which he misstates the meaning of the concept and he incorrectly describes how it was handled in at least one important case:

“Statistics provides such an assurance by calculating the probability of false association, and the epidemiological dogma demands it to be smaller than 5% (i.e, less than 1 in 20) for the association to be considered statistically significant.”

Golan 2011, at 18.  This statement is wrong.  Statistics do not provide a probability of the truth or falsity of the association.  The significance probability to which Golan refers measures the probability of data at least as extreme as those observed if the null hypothesis of no difference is correct.

Having misunderstood and misstated the meaning of significance probability, Golan proceeds to make the classic misidentification of statistical significance probability with the probability of the either the null hypothesis or the observed result.  Frequentist statistical testing cannot do this, and Golan’s error has no place in a history of these concepts other than to point out that courts have frequently made this mistake:

“The ‘statistical significance‘ standard is far more demanding than the ‘preponderance of the evidence‘ or ‘more likely than not‘ standard used in civil law. It reflects the cautious attitude of scientists who wish to be 95% certain that their measurements are not spurious.

**********

Epidemiologists have considered the price well worth paying. So has criminal law, which emphasizes the minimization of false conviction, even at the price of overlooking true crime. But civil law does not share this concern.”

This narrative misstates what epidemiologist are doing in using significance probability and null hypothesis significance testing.  The confusion between epidemiologic statistical standards and burden of proof in criminal cases is a serious error.

Golan compares and contrasts the approaches of the trial judges in Allen v. United States, and in In re Agent Orange:

“Judge Weinstein, on the other hand, was far less concerned with the strictness of the epidemiology. A scholar of evidence, member of the Advisory Committee that drafted the Federal Rules of Evidence during the early 1970s, and a critic of the partisan deployment of science in the adversarial courtroom, Weinstein embraced the stringent 95% significance threshold as a ready-made admissibility test that could validate the veracity of the statistical evidence used in court. Thus, while he referred to epidemiological studies as ―the best (if not the sole) available evidence in mass exposure cases,‖ he nevertheless refused to accept them in evidence, unless they were statistically significant.64

Golan at 19.  Weinstein is all that and more, but he never simplistically embraced statistical significance as a “ready-made admissibility test.”  Of course 95% is the coefficient of confidence, and the complement of alpha of 0.05%, but this alpha is not a particularly stringent threshold unless it is misunderstood as a burden of proof.  Contrary to Golan’s suggestion, Judge Weinstein was not being conservative or restrictive in his approach in In re Agent Orange.

Golan’s “preliminary” history is a good start, but it misses an important perspective.  After World War II, biological science, in the form of genetics, as well as epidemiology and other areas, grew to encompass stochastic processes as well as mechanistic processes.  To a large extent, in permitting judgments to be based upon statistical and epidemiologic evidence, the law was struggling to catch up with developments in science.   There is quite a bit of evidence that the law is still struggling.

Reference Manual on Scientific Evidence (3d edition) on Statistical Significance

July 8th, 2012

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

Berger’s citations in support are curiously inaccurate.  Footnote 62 cites the Cook case:

“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”

The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study.

Footnote 63 is no better:

“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”

With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis.  The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies.  The court’s review was hardly extensive.  When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).  Berger’s characterization of the review is incorrect, and her failure to cite the subsequent procedural history disturbing.

 

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.

 

Chapter on Multiple Regression

The chapter on regression does not add much to the earlier and later discussions.  The author asks rhetorically what is the appropriate level of statistical significance, and answers:

“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47”

Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.

 

Chapter on Epidemiology

The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573.  The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics. The problem is that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample.  Furthermore, the chapter’s statement here omits reference to the conditional nature of the probability that makes it dependent upon the assumption of correctness of the null hypothesis.

The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.”  The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary.  Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off.  Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”

Id. at 573n.68.  Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing.   The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance.  See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

The epidemiology chapter also chastises courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct:

“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).”

Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose  as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”

577 n81.  The footnotes goes to explain further the difference between alpha probability and burden of proof probability, but incorrectly asserts that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.”  Id.  The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expect value if the null hypothesis is true.  Kaye and Freedman’s chapter on statistics does much better at describing and avoiding the transpositional fallacy when describing p-values.

When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct.  What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87

Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”

Id. at 578-79.  Mostly true, but again rather  unhelpful to judges and lawyers.  The authors divide the world up into “strict” testers and those critical of “strict” testing.  Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all?  I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis.  Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them.  But what of those critical of strict testing?  This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).  The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample.  Confidence intervals, however, do not eliminate the need to interpret the extent of random error.

In the final analysis, the epidemiology chapter is unclear and imprecise.  I believe it confuses matters more than it clarifies.  There is clearly room for improvement in the Fourth Edition.

Viagra, Part II — MDL Court Sees The Light – Bad Data Trump Nuances of Statistical Inference

July 8th, 2012

In the Viagra vision loss MDL, the first Daubert hearing did not end well for the defense.  Judge Magnuson refused to go beyond conclusory statements by the plaintiffs’ expert witness, Gerald McGwin, and to examine the qualitative and quantitative evaluative errors invoked to support plaintiffs’ health claims.  The weakness of McGwin’s evidence, however, appeared to  encourage Judge Magnuson to authorize extensive discovery into McGwin’s study.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1090 (D. Minn. 2008).

The discovery into McGwin’s study had already been underway, with subpoenas to him and to his academic institution.  As it turned out, defendant’s discovery into the data and documents underlying McGwin’s study won the day.  Although Judge Magnuson struggled with inferential statistics, he understood the direct attack on the integrity of McGwin’s data.  Over a year after denying defendant’s Rule 702 motion to exclude Gerald McGwin, the MDL court reconsidered and granted the motion.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).

The basic data on prior exposures and risk factors for the McGwin study was collected by telephone surveys, from which the information was coded into an electronic dataset.  In analyzing the data, McGwin used the electronic dataset and not the survey forms.  Id. at 939.  The transfer from survey forms to electronic dataset did not go smoothly; about 11 patients were miscoded as “exposed“ when their use of Viagra post-dated the onset of NAION. Id. at 942.  Furthermore, the published article incorrectly stated personal history of heart attack as a “risk factor ”; the survey inquired about family not personal history of heart attack. Id. at 944.

The plaintiffs threw several bombs in response, but without legal effect.  First, the plaintiffs claimed that the study participants had been recontacted and the database had been corrected, but they were unable to document this process or the alleged corrections.  Id. at 433.  Furthermore, the plaintiffs could not explain how, if their contention had been true, McGwin would have not committed serious violations of his university’s institutional review board’s regulations with respect to deviations from the original protocol.  Id. at 943 n.7.

Second, the plaintiffs argued that the underlying survey forms were “inadmissible ” and thus the defense could not use them to impeach the McGwin study.  Some might think this a duplicitous argument, utterly at odds with Rule 703 – rely upon a study but prevent use of underlying data and documents to explain that the study does not show what it purports to show.  The MDL court spared the plaintiffs the embarrassment of ruling that the documents on which McGwin had based his study were inadmissible, and found that the forms were business records and admissible under Federal Rule of evidence 803(6).  The court could have gone further to point out that McGwin’s reliance upon hearsay in the form of his study, McGwin 2006, opened the door to impeaching the hearsay relied upon with other hearsay.  See Rule 806.

When defense counsel sat down with McGwin in a deposition, they found that he had not undertaken any new analyses of corrected data.  Plaintiffs’ counsel directed him not to do so.  Id. at 940-41.  But then after the deposition was over, McGwin submitted a letter to the journal to report a corrected analysis.  Pfizer’s counsel obtained the letter in response to their subpoena to McGwin’s university, the University of Alabama, Birmingham.  Mirabile dictu; now the increase risk appeared limited to only to the defendant’s medication, Viagra!

The trial court was not amused.  First, the new analysis was no longer peer reviewed, and the court had placed a great deal of emphasis on peer review in denying the first challenge to McGwin.  Second, the new analysis was no longer that of an independent scientist, but was conducted and submitted as a letter to the editor, while McGwin was working for plaintiffs’ counsel.  Third, the plaintiffs and McGwin conceded that the data were not accurate.  Last, but not least, the trial court clearly was not pleased that the plaintiffs’ counsel had deliberately delayed McGwin’s further analyses until after the deposition, and then tried to submit yet another supplemental report with those further analyses. In sum:

“the Court finds good reason to vacate its original Daubert Order permitting Dr. McGwin to testify as a general causation expert based on the McGwin Study as published. Almost every indicia of reliability the Court relied on in its previous Daubert Order regarding the McGwin Study has been shown now to be unreliable.  Peer review and publication mean little if a study is not based on accurate underlying data. Likewise, the known rate of error is also meaningless if it is based on inaccurate data. Even if the McGwin Study as published was conducted according to generally accepted epidemiologic research and did not result from post-litigation research, the fact that the McGwin Study appears to have been based on data that cannot now be documented or supported renders it inadmissibly unreliable. The Court concludes that under Daubert, Dr. McGwin’s opinion, to the extent that it is based on the McGwin Study as published, lacks sufficient indicia of reliability to be admitted as a general causation opinion.”

Id. at 945-46.  The remaining evidence was the Margo & French study, but McGwin had previously criticized that study as lacking data that ensured that Viagra use preceded onset of NAION.  In the end, McGwin was left with bupkes, and the plaintiffs were left with even less.

*******************

McGwin 2006 Was Also A Pain in the Rear End for McGwin

The Rule 702 motions and hearings on McGwin’s proposed testimony had consequences in the scientific world itself.  In 2011, the British Journal of Ophthalmology retracted McGwin’s 2006 paper.  “Retraction: Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction, ” 95 Brit. J. Ophthalmol. 595 (2011).

Interestingly, the retraction was reported in the Retraction Watch blog, “Retractile dysfunction? Author says journal yanked paper linking Viagra, Cialis to vision problem after legal threats.”  The blog treated the retraction as routine except for the hint of “legal pressure”:

“One of the authors of the paper, a researcher at the University of Alabama named Gerald McGwin Jr., told us that the journal retracted the article because it had become a tool in a lawsuit involving Pfizer, which makes Viagra, and, presumably, men who’d developed blindness after taking the drug:

‘The article just became too much of a pain in the rear end. It became one of those things where we couldn’t provide all the relevant documentation [to the university, which had to provide records for attorneys].’

Ultimately, however, McGwin said that the BJO pulled the plug on the paper.”

Id. The legal threat is hard to discern other than the fact that lawyers wanted to see something that peer reviewers almost never see – the documentation underlying the published paper.  So now, the study that formed the basis for the original ruling against Pfizer floats aimlessly as a derelict on the sea of science.  McGwin is, however, still at his craft.  In a study he published in 2010, he claimed that Viagra but not Cialis use was associated with hearing impairment.  Gerald McGwin, Jr, “Phosphodiesterase Type 5 Inhibitor Use and Hearing Impairment,” 136 Arch. Otolaryngol. Head & Neck Surgery 488 (2010).

Where are Senator Grassley and Congressman Waxman when you need them?

Love is Blind but What About Judicial Gatekeeping of Expert Witnesses? – Viagra Part I

July 7th, 2012

The Viagra litigation over claimed vision loss vividly illustrates the difficulties that trial judges have in understanding and applying the concept of statistical significance.  In this MDL, plaintiffs sued for a specific form of vision loss, non-arteritic ischemic optic neuropathy (NAION), which they claimed was caused by their use of defendant’s medication, Viagra.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071 (D. Minn. 2008).  Plaintiffs’ key expert witness, Gerald McGwin considered three epidemiologic studies; none found a statistically significant elevation of risk of NAION after Viagra use.  Id. at 1076. The defense filed a Rule 702 motion to exclude McGwin’s testimony, based in part upon the lack of statistical significance of the risk ratios he relied upon for his causal opinion.  The trial court held that this lack did not render McGwin’s testimony and unreliable and inadmissible  Id. at 1090.

One of the three studies considered by McGwin was his own published paper.  G. McGwin, Jr., M. Vaphiades, T. Hall, C. Owsley, ‘‘Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction,’’ 90 Br. J. Ophthalmol. 154 (2006)[“McGwin 2006”].    The MDL court noted that McGwin had stated that his paper reported an odds ratio (OR) of 1.75, with a 95% confidence interval (CI), 0.48 to 6.30.  Id. at 1080.  The study also presented multiple subgroup analyses of men who had reported Viagra use after a history of heart attack (OR = 10.7) or hypertension (OR = 6.9), but the MDL court did not provide p-values or confidence intervals for the subgroup analysis results.

Curiously, Judge Magnuson eschewed the guidance of the Reference Manual on Scientific Evidence, in dealing with statistics of sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.  The new third edition again conveys the same basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The point of the RSME‘s guidance is, of course, that the standard error, or the confidence interval (C.I.) based upon a specified number of standard errors, is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

The statistical data on which McGwin was basing his opinion was readily available from McGwin 2006:

“Overall, males with NAION were no more likely to report a history of Viagra … use compared to similarly aged controls (odd ratio (OR) 1.75, 95% confidence interval (CI) 0.48 to 6.30.  However, for those with a history of myocardial infarction, a statistically significant association was observed (OR 10.7, 95% CI 1.3 to 95.8). A similar association was observed for those with a history of hypertension though it lacked statistical significance (OR 6.9, 95% CI 0.8 to 63.6).”

McGwin 2006, at 154.  Following the RSME‘s guidance would have assisted the MDL court in its gatekeeping responsibility in several distinct ways.  First, the court would have focused on how wide the 95% confidence intervals were.  The width of the intervals pointed to statistical imprecision and instability in the point estimates urged by McGwin.  Second, the MDL court would have confronted the extent to which there were multiple ad hoc subgroup analyses in McGwin’s paper.  See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002)(“It is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns.”) Third, the court would have confronted the extent to which the study’s validity was undermined by several potent biases.  Statistical significance was the least of the problems faced by McGwin 2006.

The second study considered and relied upon by McGwin was referred to as Margo & French.  McGwin cited this paper for an “elevated OR of 1.10,” id. at 1081, but again, had the court engaged with the actual evidence, it would have found that McGwin had cherry picked the data he chose to emphasize.  The Margo & French study was a retrospective cohort study using the National Veterans Health Administration’s pharmacy and clinical databases.  C. Margo & D. French, ‘‘Ischemic optic neuropathy in male veterans prescribed phosphodiesterase-5 inhibitors,’’ 143 Am. J. Ophthalmol. 538 (2007).  There were two outcomes ascertained:  NAION and “possible” NAION.  The relative risk of NAION among men prescribed a PDE-5 inhibitor (the class to which Viagra belongs) was 1.02 (95% confidence interval [CI]: 0.92 to 1.12.  In other words, the Margo & French paper had very high statistical precision, and it reported essentially no increased risk at all.  Judge Magnuson cited uncritically McGwin’s endorsement of a risk ratio that included ‘‘possible’’ NAION cases, which could not bode well for a gatekeeping process that is supposed to protect against speculative evidence and conclusions.

McGwin’s citation of Margo & French for the proposition that men who had taken the PDE-5 inhibitors had a 10% increased risk was wrong on several counts.  First, he relied upon an outcome measure that included ‘‘possible’’ cases of NAION.  Second, he completely ignored the sampling error that is captured in the confidence interval.  The MDL court failed to note or acknowledge the p-value or confidence interval for any result in Margo & French. The consideration of random error was not an optional exercise for the expert witness or the court; nor was ignoring it a methodological choice that simply went to the ‘‘disagreement among experts.’’

The Viagra MDL court not only lost its way by ignoring the guidance of the RMSE, it appeared to confuse the magnitude of the associations with the concept of statistical significance.  In the midst of the discussion of statistical significance, the court digressed to address the notion that the small relative risk in Margo & French might mean that no plaintiff could show specific causation, and then in the same paragraph returned to state that ‘‘persuasive authority’’ supported the notion that the lack of statistical significance did not detract from the reliability of a study.  Id. at 1081 (citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., MDL No. 1407, 289 F.Supp.2d 1230, 1241 (W.D.Wash. 2003)).  The magnitude of the observed odds ratio is an independent concept from that of whether an odds ratio as extreme or more so would have occurred by chance if there really was no elevation.

Citing one case, at odds with a great many others, however, did not create an epistemic warrant for ignoring the lack of statistical significance.  The entire notion of cited caselaw for the meaning and importance of statistical significance for drawing inferences is wrong headed.  Even more to the point, the lack of statistical significance in the key study in the PPA litigation did not detract from the reliability of the study, although other features of that study certainly did.  The lack of statistical significance in the PPA study did, however, detract from the reliability of the inference from the study’s estimate of ‘‘effect size’’ to a conclusion of causal association. Indeed, nowhere in the key PPA study did its authors draw a causal conclusion with respect to PPA ingestion and hemorrhagic stroke.  See Walter Kernan, Catherine Viscoli, Lawrence Brass, Joseph Broderick, Thomas Brott, Edward Feldmann, Lewis Morgenstern, Janet Lee Wilterdink, and Ralph Horwitz, ‘‘Phenylpropanolamine and the Risk of Hemorrhagic Stroke,’’ 343 New England J. Med. 1826 (2000).

The MDL court did attempt to distinguish the Eighth Circuit’s decision in Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001), cited by the defense:

‘‘[I]n Glastetter … expert evidence was excluded because ‘rechallenge and dechallenge data’ presented statistically insignificant results and because the data involved conditions ‘quite distinct’ from the conditions at issue in the case. Here, epidemiologic data is at issue and the studies’ conditions are not distinct from the conditions present in the case. The Court does not find Glastetter to be controlling.’’

Id. at 1081 (internal citations omitted; emphasis in original).  This reading of Glastetter, however, misses important features of that case and the Parlodel litigation more generally.  First, the Eighth Circuit commented not only upon the rechallenge-dechallenge data, which involved arterial spasms, but upon an epidemiologic study of stroke, from which Ms. Glastetter suffered.  The Glastetter court did not review the epidemiologic evidence itself, but cited to another court, which did discuss and criticize the study for various ‘‘statistical and conceptual flaws.’’  See Glastetter, 252 F.3d at 992 (citing Siharath v. Sandoz Pharms.Corp., 131 F.Supp. 2d 1347, 1356-59 (N.D.Ga.2001)).  Glastetter was binding authority, and not so easily dismissed and distinguished.

The Viagra MDL court ultimately placed its holding upon the facts that:

‘‘the McGwin et al. and Margo et al. studies were peer-reviewed, published, contain known rates of error, and result from generally accepted epidemiologic research.’’

In re Viagra, 572 F. Supp. 2d at 1081 (citations omitted).  This holding was a judicial ipse dixit substituting for the expert witness’s ipse dixit.  There were no known rates of error for the systematic errors in the McGwin study, and the ‘‘known’’ rates of error for random error in McGwin 2006  were intolerably high.  The MDL court never considered any of the error rates, systematic or random, for the Margo & French study.  The court appeared to have abdicated its gatekeeping responsibility by delegating it to unknown peer reviewers, who never considered whether the studies at issue in isolation or together could support a causal health claim.

With respect to the last of the three studies considered, the Gorkin study, McGwin opined that it was  too small, and the data were not suited to assessing temporal relationship.  Id.  The court did not appear inclined to go beyond McGwin’s ipse dixit.  The Gorkin study was hardly small, in that it was based upon more than 35,000 patient-years of observation in epidemiologic studies and clinical trials, and provided an estimate of incidence for NAION among users of Viagra that was not statistically different from the general U.S. population.  See L. Gorkin, K. Hvidsten, R. Sobel, and R. Siegel, ‘‘Sildenafil citrate use and the incidence of nonarteritic anterior ischemic optic neuropathy,’’ 60 Internat’l J. Clin. Pract. 500, 500 (2006).

Judge Magnuson did proceed, in his 2008 opinion, to exclude all the other expert witnesses put forward by the plaintiffs.  McGwin survived the defendant’s Rule 702 challenge, largely because the court refused to consider the substantial random variability in the point estimates from the studies relied upon by McGwin. There was no consideration of the magnitude of random error, or for that matter, of the systematic error in McGwin’s study.  The MDL court found that the studies upon which McGwin relied had a known and presumably acceptable ‘‘rate of error.’’  In fact, the court did not consider the random or sampling error in any of the three cited studies; it failed to consider the multiple testing and interaction; and it failed to consider the actual and potential biases in the McGwin study.

Some legal commentators have argued that statistical significance should not be a litmus test.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 241 (‘‘Statistical significance should not be a litmus test. However, there are many situations where the lack of significance combined with other aspects of the research should be enough to exclude an expert’s testimony.’’)  While I agree that significance probability should not be evaluated in a mechanical fashion, without consideration of study validity, multiple testing, bias, confounding, and the like, handing waving about litmus tests does not excuse courts or commentators from totally ignoring random variability in studies based upon population sampling.  The dataset in the Viagra litigation was not a close call.

Maryland Puts the Brakes on Each and Every Asbestos Exposure

July 3rd, 2012

Last week, the Maryland Court of Special Appeals reversed a plaintiffs’ verdict in Dixon v. Ford Motor Company, 2012 WL 2483315 (Md. App. June 29, 2012).  Jane Dixon died of pleural mesothelioma.  The plaintiffs, her survivors, claimed that her last illness and death were caused by her household improvement projects, which involved exposure to spackling/joint compound, and by her husband’s work with car parts and brake linings, which involved “take home” exposure on his clothes.  Id. at *1.

All the expert witnesses appeared to agree that mesothelioma is a “dose-response disease,” meaning that the more the exposure, the greater the likelihood that a person exposed will develop the disease. Id. at *2.  Plaintiffs’ expert witness, Dr. Laura Welch, testified that “every exposure to asbestos is a substantial contributing cause and so brake exposure would be a substantial cause even if [Mrs. Dixon] had other exposures.” On cross-examination, Dr. Welch elaborated upon her opinion to explain that any “discrete” exposure would be a contributing factor. Id.

Welch, of course, criticized the entire body of epidemiology of car mechanics and brake repairmen, which generally finds no increased risk of mesothelioma above overall population rates.  With respect to the take-home exposure, Welch had to acknowledge that there were no epidemiologic studies that investigated the risk of wives of brake mechanics.  Welch argued that the studies of car mechanics did not involve exposure to brake shoes as would have been experienced by brake repairmen, but her argument only served to make her attribution based upon take-home exposure to brake linings seem more preposterous.  Id. at *3.  The court recognized that Dr. Welch’s opinion may have been trivially true, but still unhelpful.  Each discrete exposure, even as attenuated as a take-home exposure from having repaired a single brake shoe may have “contributed,” but that opinion did not help the jury assess whether the contribution was substantial.

The court sidestepped the issue of fiber type, and threshold, and honed in on the agreement that mesothelioma risk showed a dose-response relationship with asbestos exposure.  (There is a sense that the court confused the dose-response concept to mean no threshold.)  The court credited hyperbolic risk assessment figures from the United States Environmental Protection Agency, which suggested that even ambient air exposure to asbestos leads to an increase in mesothelioma risk, but then realized that such claims made the legal need to characterize the risk from the defendant’s product all the more important before the jury could reasonably have concluded that any particular exposure experienced by Ms. Dixon was “a substantial contributing factor.”  Id. at *5.

Having recognized that the best the plaintiffs could offer was a claim of increased risk, and perhaps crude quantification of the relative risks resulting from each product’s exposure, the court could not escape that the conclusion that Dr. Welch’s empty recitation of “every exposure” is substantial was nothing more than an unscientific and empty assertion.  Welch’s claim was either tautologically true or empirical nonsense.  The court also recognized that risk substituting for causation opened the door to essentially probabilistic evidence:

“If risk is our measure of causation, and substantiality is a threshold for risk, then it follows—as intimated above—that ‘substantiality’ is essentially a burden of proof. Moreover, we can explicitly derive the probability of causation from the statistical measure known as ‘relative risk’ … .  For reasons we need not explore in detail, it is not prudent to set a singular minimum ‘relative risk’ value as a legal standard.12 But even if there were some legal threshold, Dr. Welch provided no information that could help the finder of fact to decide whether the elevated risk in this case was ‘substantial’.”

Id. at *7.  The court’s discussion here of “the elevated risk” seems wrong unless we understand it to mean the elevated risk attributable to the particular defendant’s product, in the context of an overall exposure that we accept as having been sufficient to cause the decedent’s mesothelioma.  Despite the lack of any quantification of relative risks in the case, overall or from particular products, and the court’s own admonition against setting a minimum relative risk as a legal standard, the court proceeded to discuss relative risks at length.  For instance, the court criticized Judge Kozinski’s opinion in Daubert, upon remand from the Supreme Court, for not going far enough:

“In other words, the Daubert court held that a plaintiff’s risk of injury must have at least doubled in order to hold that the defendant’s action was ‘more likely than not’ the actual cause of the plaintiff’s injury. The problem with this holding is that relative risk does not behave like a ‘binary’ hypothesis that can be deemed ‘true’ or ‘false’ with some degree of confidence; instead, the un-certainty inherent in any statistical measure means that relative risk does not resolve to a certain probability of specific causation. In order for a study of relative risk to truly fulfill the preponderance standard, it would have to result in 100% confidence that the relative risk exceeds two, which is a statistical impossibility. In short, the Daubert approach to relative risk fails to account for the twin statistical uncertainty inherent in any scientific estimation of causation.”

Id. at *7 n.12 (citing Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320-21 (9th Cir.1995) (holding that that a preponderance standard requires causation to be shown by probabilistic evidence of relative risk greater than two) (opinion on remand from Daubert v. Merrell Dow Pharms., 509 U.S. 579 (1993)).  The statistical impossibility derives from the asymptotic nature of the normal distribution, but the court failed to explain why a relative risk of two must be excluded as statistically implausible based upon the sample statistic.  After all, a relative risk greater than two, with a lower bound of a 95% confidence interval above one, based upon an unbiased sampling, suggests that our best evidence is that the population parameter is greater than two, as well.  The court, however, insisted upon stating the relative-risk-greater-than-two rule with a vengeance:

“All of this is not to say, however, that any and all attempts to establish a burden of proof of causation using relative risk will fail. Decisions can be – and in science or medicine are – premised on the lower limit of the relative risk ratio at a requisite confidence level. The point of this minor discussion is that one cannot apply the usual, singular ‘preponderance’ burden to the probability of causation when the only estimate of that probability is statistical relative risk. Instead, a statistical burden of proof of causation must consist of two interdependent parts: a requisite confidence of some minimum relative risk. As we explain in the body of our discussion, the flaws in Dr. Welch’s testimony mean we need not explore this issue any further.44

Id. (emphasis in original).

And despite having declared the improvidence of addressing the relative risk issue, and then the lack of necessity for addressing the issue given Dr. Welch’s flawed testimony, the court nevertheless tackled the issue once more, a couple of pages later:

“It would be folly to require an expert to testify with absolute certainty that a plaintiff was exposed to a specific dose or suffered a specific risk. Dose and risk fall on a spectrum and are not ‘true or false’. As such, any scientific estimate of those values must be expressed as one or more possible intervals and, for each interval, a corresponding confidence that the true value is within that interval.”

Id. at 9 (emphasis in original; internal citations omitted).  The court captured the frequentist concept of the confidence interval as being defined operationally by repeated samplings and their random variability, but the confidence of the confidence interval means that the specified coefficient represents the percentage of all such intervals that include the “true” value, not the probability that a particular interval, calculated from a given sample, contains the true value.  The true value is either in or not in the interval generated from a single sample risk statistic.  Again, it is unclear why the court was weighing in on this aspect of probabilistic evidence when plaintiffs’ expert witness, Welch, offered no quantitation of the overall risk or of the risk attributable to a specific product exposure.

The court indulged the plaintiffs’ no-threshold fantasy but recognized that the risks of low-level asbestos exposure were low, and likely below a doubling of risk, an issue that the court stressed it wanted to avoid.  The court cited one study that suggested a risk (odds) ratio of 1.1 for exposures less than 0.5 fiber/ml – years.  See id. at *5 (citing Y. Iwatsubo et al., “Pleural mesothelioma: dose-response relation at low levels of asbestos exposure in a French population-based case-control study,” 148 Am. J. Epidemiol. 133 (1998) (estimating an odds ratio of 1.1 for exposures less than 0.5 fibers/ml-years).  But the court, which tried to be precise elsewhere, appears to have lost its way in citing Iwatsubo here.  After all, how can a single odds ratio of 1.1 describe all exposures from 0 all the way up to 0.5 f/ml-years?  How can a single odds ratio describe all exposures in this range, regardless of fiber type, when chrystotile asbestos carries little to no risk for mesothelioma, and certainly orders of magnitude risk less than amphibole fibers such as amosite and crocidolite.  And if a low-level exposure has a risk ratio of 1.1, how can plaintiffs’ hired expert witness, Welch, even make the attribution of Dixon’s mesothelioma to the entirety of her exposure, let alone the speculative take-home chrysotile exposure involved from Ford’s brake linings?  Obviously, had the court posed these questions, it would it would have realized that “it is not possible” to permit Welch’s testimony at all.

The court further lost its way in addressing the exculpatory epidemiology put forward by the defense expert witnesses:

“Furthermore, the leading epidemiological report cited by Ford and its amici that specifically studied ‘brake mechanics’, P.A. Hessel et al., ‘Meso-thelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study’, 24 Risk Analysis 547 (2004), does not at all dispel the notion that this population faced an increased risk of mesothelioma due to their industrial asbestos exposure. … When calculated at the 95% confidence level, Hessel et al. estimated that the odds ratio of mesothelioma could have been as low as 0.01 or as high as 4.71, implying a nearly quintupled risk of mesothelioma among the population of brake mechanics. 24 Risk Analysis at 550–51.”

Id. at *8.  Again, the court is fixated with the confidence interval, to the exclusion of the estimated magnitude of the association!  This time, after earlier shouting that it was the lower bound of the interval that matters scientifically, the court emphasizes the upper bound.  The court here has strayed far from the actual data, and any plausible interpretation of them:

“The odds ratio (OR) for employment in brake installation or repair was 0.71 (95% CI: 0.30-1.60) when controlled for insulation or shipbuilding. When a history of employment in any of the eight occupations with potential asbestos exposure was controlled, the OR was 0.82 (95% CI: 0.36-1.80). ORs did not increase with increasing duration of brake work. Exclusion of those with any of the eight exposures resulted in an OR of 0.62 (95% CI: 0.01-4.71) for occupational brake work.”

P.A. Hessel et al., “Mesothelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study,” 24 Risk Analysis 547, 547 (2004).  All of Dr. Hessel’s estimates of effect sizes were below 1.0, and he found no trend for duration of brake work.  Cherry picking out the upper bound of a single subgroup analysis for emphasis was unwarranted, and hardly did justice to the facts or the science.

Dr. Welch’s conclusion that the exposure and risk in this case were “substantial” simply was not a scientific conclusion, and without it her testimony did not provide information for the jury to use in reaching its conclusion as to substantial factor causation. Id. at *7.  The court noted that Welch, and the plaintiffs, may have lacked scientific data to provide estimates of Dixon’s exposure to asbestos or relative risk of mesothelioma, but ignorance or uncertainty was hardly the basis to warrant an expert witness’s belief that the relevant exposures and risks are “substantial.” Id. at *10.  The court was well justified in being discomforted by the conclusory, unscientific opinion rendered by Laura Welch.

In the final puzzle of the Dixon case, the court vacated the judgment, and remanded for a new trial, “either without her opinion on substantiality or else with some quantitative testimony that will help the jury fulfill its charge.”  Id. at *10.  The court thus seemed to imply that an expert witness need not utter the magic word, “substantial,” for the case to be submitted to the jury against a brake defendant in a take-home exposure case.  Given the state of the record, the court should have simply reversed and rendered judgment for Ford.

Meta-Meta-Analysis — The Gadolinium MDL — More Than Ix’se Dixit

June 8th, 2012

There is an tendency, for better or worse, for legal bloggers to be partisan cheerleaders over litigation outcomes.  I admit that most often I am dismayed by judicial failures or refusals to exclude dubious plaintiffs’ expert witnesses’ opinion testimony, and I have been known to criticize such decisions.  Indeed, I wouldn’t mind seeing courts exclude dubious defendants’ expert witnesses.  I have written approvingly about cases in which judges have courageously engaged with difficult scientific issues, seen through the smoke screen, and properly assessed the validity of the opinions expressed.  The Gadolinium MDL (No. 1909) Daubert motions and decision offer a fascinating case study of a challenge to an expert witness’s meta-analysis, an effective defense of the meta-analysis, and a judicial decision to admit the testimony, based upon the meta-analysis.  In re Gadolinium-Based Contrast Agents Prods. Liab. Litig., 2010 WL 1796334 (N.D. Ohio May 4, 2010) [hereafter Gadolinium], reconsideration denied, 2010 WL 5173568 (June 18, 2010).

Plaintiffs proffered general causation opinions (between gadolinium contrast media and Nephrogenic Systemic Fibrosis (“NSF”), by a nephrologist, Joachim H. Ix, M.D., with training in epidemiology.  Dr. Ix’s opinions were based in large part upon a meta-analysis he conducted on data in published observational studies.  Judge Dan Aaron Polster, the MDL judge, itemized the defendant’s challenges to Dr. Ix’s proposed testimony:

“The previously-used procedures GEHC takes issue with are:

(1) the failure to consult with experts about which studies to include;

(2) the failure to independently verify which studies to select for the meta-analysis;

(3) using retrospective and non-randomized studies;

(4) relying on studies with wide confidence intervals; and

(5) using a “more likely than not” standard for causation that would not pass scientific scrutiny.”

Gadolinium at *23.  Judge Polster confidently dispatched these challenges.  Dr. Ix, as a nephrologist, had subject-matter expertise with which to develop inclusionary and exclusionary criteria on his own.  The defendant never articulated what, if any, studies were inappropriately included or excluded.  The complaint that Dr. Ix had used retrospective and non-randomized studies also rang hollow in the absence of any showing that there were randomized clinical trials with pertinent data at hand.  Once a serious concern of nephrotoxicity arose, clinical trials were unethical, and the defendant never explained why observational studies were somehow inappropriate for inclusion in a meta-analysis.

Relying upon studies with wide confidence intervals can be problematic, but that is one of the reasons to conduct a meta-analysis, assuming the model assumptions for the meta-analysis can be verified.  The plaintiffs effectively relied upon a published meta-analysis, which pre-dated their expert witness’s litigation effort, in which the authors used less conservative inclusionary criteria, and reported a statistically significant summary estimate of risk, with an even wider confidence interval.  R. Agarwal, et al., ” Gadolinium-based contrast agents and nephrogenic systemic fibrosis: a systematic review and meta-analysis,” 24 Nephrol. Dialysis & Transplantation 856 (2009).  As the plaintiffs noted in their opposition to the challenge to Dr. Ix:

“Furthermore, while GEHC criticizes Dr. Ix’s CI from his meta-analysis as being “wide” at (5.18864 and 25.326) it fails to share with the court that the peer-reviewed Agarwal meta-analysis, reported a wider CI of (10.27–69.44)… .”

Plaintiff’s Opposition to GE Healthcare’s Motion to Exclude the Opinion Testimony of Joachim Ix at 28 (Mar. 12, 2010)[hereafter Opposition].

Wider confidence intervals certainly suggest greater levels of random error, but Dr. Ix’s intervals suggested statistical significance, and he had carefully considered statistical heterogeneity.  Opposition at 19. (Heterogeneity was never advanced by the defense as an attack on Dr. Ix’s meta-analysis).  Remarkably, the defendant never advanced a sensitivity analysis to suggest or to show that reasonable changes to the evidentiary dataset could result in loss of statistical significance, as might be expected from the large intervals.  Rather, the defendant relied upon the fact that Dr. Ix had published other meta-analyses in which the confidence interval was much narrower, and then claimed that he had “required” these narrower confidence intervals for his professional, published research.  Memorandum of Law of GE Healthcare’s Motion to Exclude Certain Testimony of Plaintiffs’ Generic Expert, Joachim H. Ix, MD, MAS, In re Gadolinium MDL No. 1909, Case: 1:08-gd-50000-DAP  Doc #: 668   (Filed Feb. 12, 2010)[hereafter Challenge].  There never was, however, a showing that narrower intervals were required for publication, and the existence of the published Agarwal meta-analysis contradicted the suggestion.

Interestingly, the defense did not call attention to Dr. Ix’s providing an incorrect definition of the confidence interval!  Here is how Dr. Ix described the confidence interval, in language quoted by plaintiffs in their Opposition:

“The horizontal lines display the “95% confidence interval” around this estimate. This 95% confidence interval reflects the range of odds ratios that would be observed 95 times if the study was repeated 100 times, thus the narrower these confidence intervals, the more precise the estimate.”

Opposition at 20.  The confidence interval does not provide a probability distribution of the parameter of interest; rather the distribution of confidence intervals has a probability of covering the hypothesized “true value” of the parameter.

Finally, the defendant never showed any basis for suggesting that a scientific opinion on causation requires something more than a “more likely than not” basis.

Judge Polster also addressed some more serious challenges:

“Defendants contend that Dr. Ix’s testimony should also be excluded because the methodology he utilized for his generic expert report, along with varying from his normal practice, was unreliable. Specifically, Defendants assert that:

(1) Dr. Ix could not identify a source he relied upon to conduct his meta-analysis;

(2) Dr. Ix imputed data into the study;

(3) Dr. Ix failed to consider studies not reporting an association between GBCAs and NSF; and

(4) Dr. Ix ignored confounding factors.”

Gadolinium at *24

IMPUTATION

The first point, above – the alleged failure to identify a source for conducting the meta-analysis – rings fairly hollow, and Judge Polster easily deflected it.  The second point raised a more interesting challenge.  In the words of defense counsel:

“However, in arriving at this estimate, Dr. Ix imputed, i.e., added, data into four of the five studies.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  Specifically, Dr. Ix added a single case of NSF without antecedent GBCA exposure to the patient data in the underlying studies.

* * *

During his deposition, Dr. Ix could not provide any authority for his decision to impute the additional data into his litigation meta-analysis.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  When pressed for any authority supporting his decision, Dr. Ix quipped that ‘this may be a good question to ask a Ph.D level biostatistician about whether there are methods to [calculate an odds ratio] without imputing a case [of NSF without antecedent GBCA exposure]’.”

Challenge at 12-13.

The deposition reference suggests that the examiner had scored a debating point by catching Dr. Ix unprepared, but by the time the parties briefed the challenge, the plaintiffs had the issue well in hand, citing A. W. F. Edwards, “The Measure of Association in a 2 × 2 Table,” 126 J. Royal Stat. Soc. Series A 109 (1963); R.L. Plackett, “The Continuity Correction in 2 x 2 Tables,” 51 Biometrika 327 (1964).  Opposition at 36 (describing the process of imputation in the event of zero counts in the cells of a 2 x 2 table for odds ratios).  There are qualms to be stated about imputation, but the defense failed to make them.  As a result, the challenge overall lost momentum and credibility.  As the trial court stated the matter:

“Next, there is no dispute that Dr. Ix imputed data into his meta-analysis. However, as Defendants acknowledge, there are valid scientific reasons to impute data into a study. Here, Dr. Ix had a valid basis for imputing data. As explained by Plaintiffs, Dr. Ix’s imputed data is an acceptable technique for avoiding the calculation of an infinite odds ratio that does not accurately measure association.7 Moreover, Dr. Ix chose the most conservative of the widely accepted approaches for imputing data.8 Therefore, Dr. Ix’s decision to impute data does not call into question the reliability of his meta-analysis.”

Gadolinium at *24.

FAILURE TO CONSIDER NULL STUDIES

The defense’s challenged including a claim that Dr. Ix had arbitrarily excluded studies in which there was no reported incidence of NSF. The defense brief unfortunately does not describe the studies excluded, and what, if any, effect their inclusion in the meta-analysis would have had.  This was, after all, the crucial issue. The abstract nature of the defense claim left the matter ripe for misrepresentation by the plaintiffs:

“GEHC continues to misunderstand the role of a meta-analysis and the need for studies that included patients both that did or did not receive GBCAs and reported on the incidence of NSF, despite Dr. Ix’s clear elucidation during his deposition. (Ix Depo. TR [Exh.1] at 97-98).  Meta-analyses such as performed by Dr. Ix and Dr. Agarwal search for whether or not there is a statistically valid association between exposure and disease event. In order to ascertain the relationship between the exposure and event one must have an event to evaluate. In other words, if you have a study in which the exposed group consists of 10,000 people that are exposed to GBCAs and none develop NSF, compared to a non-exposed group of 10,000 who were not exposed to GBCAs and did not develop NSF, the study provides no information about the association between GBCAs and NSF or the relative risk of developing NSF.”

Challenge at 37 – 38 (emphasis in original).  What is fascinating about this particular challenge, and the plaintiffs’ response, is the methodological hypocrisy exhibited.  In essence, the plaintiffs argued that imputation was appropriate in a case-control study, in which one cell contained a zero, but they would ignore a great deal of data in a cohort study with data.  To be sure, case-control studies are more efficient than cohort studies for identifying and assessing risk ratios for rare outcomes.  Nevertheless, the plaintiffs could easily have been hoisted with their own hypothetical petard.  No one in 10,000 gadolinium-exposed patients developed NSF; and no one in a control group did either.  The hypothetical study suggests that the rate of NSF is low and not different in the exposed and in the unexposed patients.  The risk ratio could be obtained by imputing an integer for the cells containing zero, and a confidence interval calculated.  The risk ratio, of course, would be 1.0.

Unfortunately, the defense did not make this argument; nor did it explore where the meta-analysis might have come out had a more even-handed methodology been taken by Dr. Ix.  The gap allowed the trial court to brush the challenge aside:

“The failure to consider studies not reporting an association between GBCAs and NSF also does not render Dr. Ix’s meta-analysis unreliable. The purpose of Dr. Ix’s meta-analysis was to study the strength of the association between an exposure (receiving GBCA) and an outcome (development of NSF). In order to properly do this, Dr. Ix necessarily needed to examine studies where the exposed group developed NSF.”

Gadolinium at *24.  Judge Polster, with no help from the defense brief, missed the irony of Dr. Ix’s willingness to impute data in the case-control 2 x 2 contingency tables, but not in the relative risk tables.

CONFOUNDING

Defendants complained that Dr. Ix had ignored the possibility that confounding factors had contributed to the development of NSF.  Challenge at 13.  Defendants went so far as to charge Dr. Ix with misleading the court by failing to consider other possible causative exposures or conditions.  Id.

Defendants never identified the existence, source, and likely magnitude of confounding factors.  As a result, the plaintiffs’ argument, based in the Reference Manual, that confounding was an unlikely explanation for a very large risk ratio was enthusiastically embraced by the trial court, virtually verbatim from the plaintiffs’ Opposition (at 14):

“Finally, the Court rejects Defendants’ argument that Dr. Ix failed to consider confounding factors. Plaintiffs argued and Defendants did not dispute that, applying the Bradford Hill criteria, Dr. Ix calculated a pooled odds ratio of 11.46 for the five studies examined, which is higher than the 10 to 1 odds ratio of smoking and lung cancer that the Reference Manual on Scientific Evidence deemed to be “so high that it is extremely difficult to imagine any bias or confounding factor that may account for it.” Id. at 376.  Thus, from Dr. Ix’s perspective, the odds ratio was so high that a confounding factor was improbable. Additionally, in his deposition, Dr. Ix acknowledged that the cofactors that have been suggested are difficult to confirm and therefore he did not try to specifically quantify them. (Doc # : 772-20, at 27.) This acknowledgement of cofactors is essentially equivalent to the Agarwal article’s representation that “[t]here may have been unmeasured variables in the studies confounding the relationship between GBCAs and NSF,” cited by Defendants as a representative model for properly considering confounding factors. (See Doc # : 772, at 4-5.)”

Gadolinium at *24.

The real problem is that the defendant’s challenge pointed only to possible, unidentified causal agents.  The smoking/lung cancer analogy, provided by the Reference Manual, was inapposite.  Smoking is indeed a large risk factor for lung cancer, with relative risks over 20.  Although there are other human lung carcinogens, none is consistently in the same order of magnitude (not even asbestos), and as a result, confounding can generally be excluded as an explanation for the large risk ratios seen in smoking studies.  It would be easy to imagine that there are confounders for NSF, especially given that it is relatively recently been identified, and that they might be of the same or greater magnitude as that suggested for the gadolinium contrast media.  The defense, however, failed to identify confounders that actually threatened the validity of any of the individual studies, or of the meta-analysis.

CONCLUSION

The defense hinted at the general unreliability of meta-analysis, with references to References Manual on Scientific Evidence at 381 (2d ed. 2000)(noting problems with meta-analysis), and other, relatively dated papers.  See, e.g., John Bailar, “Assessing Assessments,” 277 Science 529 (1997)(arguing that “problems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with [meta-analysis].”).  The Reference Manual language carried over into the third edition, is out of date, and represents a failing of the new edition.  See The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011).

The plaintiffs came forward with some descriptive statistics of the prevalence of meta-analysis in contemporary biomedical literature.  The defendants gave mostly argument; there is a dearth of citation to defense expert witnesses, affidavits, consensus papers on meta-analysis, textbooks, papers by leading authors, and the like.  The defense challenge suffered from being diffuse and unfocused; it lost persuasiveness by including weak, collateral issues such as claiming that Dr. Ix was opining “only” on a “more likely than not” basis, and that he had not consulted with other experts, and that he had failed to use randomized trial data.  The defense was quick to attack perceived deficiencies, but it did not illustrate how or why the alleged deficiencies threatened the validity of Dr. Ix’s meta-analysis.  Indeed, even when the defense made strong points, such as the exclusion of zero-event cohort studies, it failed to document that such studies existed, and that their inclusion might have made a difference.

 

On the Importance of Showing Relative Risks Greater Than Two – Haack’s Arguments

May 23rd, 2012

Professor Susan Haack has set out, repeatedly, to criticize the judicial requirement of relative risks greater than two to support findings that exposure to a substance, process, or medication was a specific cause of a plaintiff’s injury.  If for no other reason than the frequency with which Haack has published on this same issue, her views are worth examining more closely.

Haack’s argument, typically, proceeds along the lines that requiring a relative risk greater than two (RR > 2) is improper because a RR > 2 is neither necessary nor sufficient for finding specific causation.  See, e.g., Susan Haack, “Warrant, Causation, and the Atomism of Evidence Law,” 5 Episteme 253, 261 (2008)[hereafter “Warrant“];  “Proving Causation: The Holism of Warrant and the Atomism of Daubert,” 4 J. Health & Biomedical Law 273, 304 (2008)[hereafter “Proving Causation“].

Unlike the more sophisticated reasons offered by Professor Sander Greenland, Professor Haack’s reasoning fails to understand both the law and the science.

Haack:  RR > 2 Not Sufficient

Haack argues that RR > 2 is not sufficient for two reasons:

“Epidemiological evidence of a doubling of risk is not sufficient for specific causation: first, because if the study showing a doubling of risk is poorly-designed or poorly-executed, we would have only a low epistemological likelihood of a greater than 50% statistical probability; and second, because even a well-designed and well-conducted study might also show that those subjects who develop D [some claimed causally related disease] when exposed to S [some substance] have some characteristic in common – older patients rather than younger, perhaps, or women rather than men, or the sedentary rather than the active – and our plaintiff might be an elderly, sedentary female.”

Proving Causation at 304 (emphasis added).

The first argument is largely irrelevant to the legal context in which the RR > 2 rationale arises.  Typically, plaintiffs assert general and specific causation on the basis of a complex evidentiary display.  This display includes evidence of an epidemiologic association, but the magnitude of the association is weak, with RR > 1, but < 2.  Thus the defendants challenge the attributability in the plaintiff’s individual case.  The overall evidentiary display may or may not support general causation, but even if general causation were conceded, specific causation would remain as independent factual issue.  Haack’s first “argument” is that the RR > 2 argument is insufficient because the study with RR > 2 may lack internal validity on grounds that it was poorly designed, poorly conducted, or poorly analyzed.  True, true, but immaterial.  On motions for summary judgment or directed verdict, the trial court would resolve any factual issues about disputed validity in favor of the non-moving party.  The defense may have better studies that show the RR =1, but these would not factor in the decision to grant or refuse the motion.  (If the defense can show that the plaintiffs’ studies with RR > 2 are fatally flawed, then the plaintiffs might be relegated to their studies with lower risk.)

Haack’s second reason appears to go to external validity.  She suggests that a study at issue may be in a population that shares key risk factors with the plaintiff.  Why this similarity would suggest that RR > 2 is not sufficient is quite mysterious.  External validity would support the applicability of the study, with its RR > 2, not militate against its sufficiency.  If the “characteristic in common” is the basis for an interaction with the exposure to S, then we would expect that to be shown by the data in the study; it would not, and should not, be a matter of conjecture or speculation.

Haack:  RR > 2 Not Necessary

Similarly, Haack argues that RR > 2 is not necessary for two reasons:

“And epidemiological evidence of a doubling of risk is not necessary for specific causation, either: first, because studies that fail to show a doubling of risk may be flawed – for example, by failing to take account of the period of pregnancy in which subjects are exposed to S, or by failing to take account of the fact that subjects are included who may have been exposed to S in cold medication or sleep-aids; 99 and second, because even a good epidemiological study indicating to a high degree of epistemic likelihood that there is a doubling of risk may also indicate that those subjects who develop D have some characteristic (such as being over 50 or sedentary or subject to allergies or whatever) that this plaintiff lacks.100

Proving Causation at 304 (emphasis added).

Again, Haack’s reasoning is nothing other than an invitation to speculate.  Sure, studies with RR < 2 may be flawed, but the existence of flaws in the studies is hardly a warrant for the true RR > 2.  The evidence is the thing; and she is quick to point out elsewhere:  absence of evidence is not evidence of absence.  And so a flawed study is not particularly probative of anything; it cannot be made into affirmative evidence of the opposite result by the existence of a flaw.  Haack seems to be suggesting that the studies at issue, with RR < 2, may be biased low by misclassification or other systemic bias.  Again, true, true, and immaterial.  An epidemiologic study may suffer bias (or not), but if it does, the usual path is conduct the study again without the previous bias.  Sometimes the data may be re-analyzed, and the march of progress is in the direction of having underlying data accessible to permit some degree of re-analysis.  In any event, cases with RR < 2, or RR = 2, are not transformed into cases of RR > 2, solely by hand waving or speculation over the existence of potential bias.  The existence and direction of the bias remains something that must be shown by competent evidence.

As for the second argument, again, Haack invokes external invalidity as a possible reason that a RR > 2 does not necessarily require a finding for plaintiff.  The plaintiff may be sufficiently different from study participants such that the RR > 2 is not relevant.  This argument hardly undermines a requirement for a RR > 2, based upon a relevant study.

These arguments are repeated virtually verbatim in Proving Causation, where Haack asserts for the same reasons that a RR > 2 is neither necessary nor sufficient for showing specific causation.  Proving Causation at 261.

In an unpublished paper, which Haack has presented several times over the last few years, she has criticized the RR >2 argument as an example of flawed “probabilism” in the law.  Susan Haack, “Risky Business:  Statistical Proof of Individual Causation,” in Jordi Ferrer Beltrán, ed., Casuación y atribución de responsibilidad (Madrid: Marcial Pons, forthcoming)[hereafter Risky Business]; Presentation at the Hastings Law School (Jan. 20, 2012);  Presentation at University of Girona (May 24, 2011)

While there is some merit to Haack’s criticisms of probabilism, they miss the important point, which is that sometimes probabilistic inference is all there is.  Haack cites the New Jersey Supreme Court’s decision in Landrigan as supporting her notion that “other evidence,” presumably particularistic, plaintiff-specific evidence, plus a RR < 2 will suffice:

“The following year (1992), in Landrigan, the Supreme Court of New Jersey briskly observed that ‘a relative risk of 2.0 is not so much a password to a finding of causation as one piece of evidence among others’.”

Risky Business at 22 (citing and quoting Landrigan v. Celotex Corp., 127 N.J. 404, 419, 605 A.2d 1079 (1992)).

Haack, however, advances a common, but mistaken reading of Landrigan, where the Court blurred the distinction between sufficiency and admissibility of expert witness opinion on specific causation.  Landrigan, and another case, Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992), were both tried to juries, about the same time, in different counties in New Jersey.  (My former partner Terri Keeley tried Landrigan; I tried Caterinicchio.)  There was no motion to exclude expert witness testimony in either case; nor was there a motion for summary judgment ever lodged pre-trial.  Both cases involved motions for directed verdict, in which the defense invited the trial courts to accept the plaintiffs’ expert witnesses’ opinions, arguendo, and to focus on the inference of specific causation, which was drawn upon the assertion that both Mr. Landrigan and Mr. Caterinicchio had an increased risk of colorectal cancer as a result of their occupational asbestos exposure.  Admissibility was never in issue.

There were no valid biomarkers, no “fingerprints” of causation; no evidence of either plaintiff’s individual, special vulnerability.  The plaintiffs had put in their cases and rested; the trial courts were required to assume that the facts were as presented by the plaintiffs.  All the plaintiffs had offered, however, of any possible relevance, was a relative risk statistic. The trial courts in both cases granted the directed verdicts, and separate panels of the New Jersey Appellate Division affirmed.  Riding roughshod over the evidence, the New Jersey Supreme Court granted certification in both cases, and reversed and remanded for new trials.

Haack does an admirable job of echoing the speculation advanced by plaintiffs on appeal, in both Landrigan and Caterinicchio.  She speculates that the plaintiffs may have had greater than average exposure, or that they were somehow more vulnerable than the average exposed person in the relevant studies.

To paraphrase a Rumsfeldian bon mot:  The litigants must go to trial with the evidence that they have.

Both cases were remanded for new trials.  What is often not reported or discussed in connection with these two cases is that plaintiffs’ counsel dismissed Landrigan before proceeding with a new trial.  Caterinicchio was indeed retried — to a defense verdict.

Haack Attack on Legal Probabilism

May 6th, 2012

Last year, Professor Susan Haack presented a lecture on “legal probabilism,” at a conference on Standards of Proof and Scientific Evidence, held at the University of Girona, in Spain.  The lecture can be viewed on-line, and a manuscript of Haack’s paper is available , as well.  Susan Haack, “Legal Probabilism:  An Epistemological Dissent” (2011)(cited here as “Haack”).   Professor Haack has franked her paper as a draft, with an admonition “do not cite without permission,” an imperative that has no moral or legal force.  Her imperative certainly has no epistemic warrant.  We will ignore it.

As I have noted previously, here and there, Professor Haack is a Professor of philosophy and of law, at the University of Miami, Florida.  She has written widely on the philosophy of science, in the spirit of Peirce’s pragmatism.  Despite her frequent untutored judgments about legal matters, much of what she has written is a useful corrective to formalistic writings on “the scientific method,” and are worthy of study by lawyers interested in the intersection of science and the law.

The video of Professor Haack’s presentation is worth watching to get an idea of how ad hominem her style is.  I won’t repeat her aspersions and pejorative comments here.  They are not in her paper, and I will take her paper, which she posted online, as the expression of her mature thinking.

Invoking Lord Russell and Richard von Mises, Haack criticizes the reduction of epistemology to a calculus of probability.  Russell, for instance, cautioned against confusing the credibility of a claim with the probability that the claim is true:

“[I]t is clear that some things are almost certain, while others are matters of hazardous conjecture. For a rational man, there is a scale of doubtfulness, from simple logical and arithmetical propositions and perceptive judgments, at one end, to such questions as what language the Myceneans spoke or “what song the Sirens sang” at the other … , [T]he rational man, who attaches to each proposition the right degree of credibility, will be guided by the mathematical theory of probability when it is applicable . … The concept ‘degree of credibility’, however, is applicable much more widely than that of mathematical probability.”‘

Bertrand Russell, Human Knowledge, Its Scope and Limits 381 (N.Y. 1948)(quoted in Haack, supra, at 1).   Haack argues that ordinary language is beguiling.  We use “probably” to hedge our commitment to the truth of a prediction or a proposition of fact.  We insert the adverb “probably” to recognize that our statement might turn out false, although we have no idea of how likely, and no way of quantifying the probability of error.  Thus,

“[w]e commonly use the language of probability or likelihood when we talk about the credibility or warrant of a claim-about how likely is it, given this evidence, that the claim is true, or, unconditionally, about how probable the claim is.”

Haack at 14.

Epistemology is the “thing,” and psychology, not.  Haack admits that legal language is inconsistent:  sometimes the law appears to embrace psychological states of mind as relevant criteria for decisions; sometimes the law is expressly looking at epistemic warrant for the truth of claim.  Flipping the philosophical bird to Derrida and Feyerabend, Haack argues that trials are searches for the truth, and that our notions of substantial justice require replacement of psychological standards of proof, to the extent that they are merely subjective and non-epistemic, with a clear theory of epistemic warrant.  Haack at 6 (citing Tehan v. United States, 383 U.S. 406,416 (1966)(“the purpose of a trial is to determine the truth”); id. at 7 (citing In re Winship, 397 U.S. 358, 368, 370 (1970) (Harlan, J. concurring)(the standard of proof is meant to “instruct the factfinder concerning the degree of confidence our society thinks he should have in the correctness of factual conclusions for a particular type of adjudication.)

Haack points out that there are instances where evidence seems to matter more than subjective state of mind, although the law sometimes equivocates.  She cautions us that “we shouldn’t simply assume, just because the word “probable” or “probability” occurs in legal contexts, that we are dealing with mathematical, rather than epistemological, probabilities.  Haack at 16.  (citing and quoting Thomas Starkie, et al., A Practical Treatise of the Law of Evidence and Digest of Proofs in Civil and Criminal Proceedings vol. I, 579 (Philadelphia 1842)(“That … moral probabilities … could ever be represented by numbers … and thus be subject to numerical analysis,” … “cannot but be regarded as visionary and chimerical.”)  Thus the criminal standard, “beyond a reasonable doubt” seems to be about state of mind, but it is described, at least some of the time, as about the quality and strength of the evidence needed to attain such a state of mind.  The standards of “preponderance of the evidence” and “clear and convincing evidence,” on the other hand, appear to be directly related to the strength of the evidentiary display offered by the party with the burden of proof.

An example that Haack might have used, but did not, is the requirement that an expert witness express an opinion to a “reasonable degree of medical or scientific certainty.”  The law is not particularly concerned about the psychological state of certainty possessed by the witness:  the witness may be a dogmatist with absolute certainty but no epistemic warrant; and that simply will not do.

Of course, the preponderance standard is alternatively expressed as the burden to show the disputed fact is “more likely than not” correct, and that brings us back to explicit probabilisms in the law.  Haack’s argument would be bolstered by acknowledging the work of Professor Kahnemann, who makes the interesting point, at several places, that experts, or for that matter anyone making decisions, are not necessarily expert at determining their level of certainty.  Can someone really say that they believe one set of claims have been shown to be 50.1%, and have an intelligent discussion with another person, who adamantly believes that the claims have been shown to 49.9% true.  Do they resolve their differences by splitting the differences?  Unless we are dealing with an explicit set of frequencies or proportions, the language of probability is metaphorical.

Haack appropriates the term warrant for her epistemiologic theory, but the use seems much older and not novel with Haack.  In any event, Haack sets out her theory of “warrants”:

“(i) How supportive the evidence is; analogue: how well a crossword entry fits with the clue and intersecting completed entries. Evidence may be supportive (positive, favorable), undermining (negative, unfavorable), or neutral (irrelevant) with respect to some conclusion.

(ii) How secure the reasons are, independent of the claim in question; analogue:  how reasonable the competed intersecting entries are, independent of the entry in question. The better the independent security of positive reasons, the more warranted the conclusion, but the better the independent security of negative reasons, the less warranted the conclusion.

(iii) How comprehensive the evidence is, i.e., how much of the relevant evidence it includes; analogue: how much of the crossword has been completed. More comprehensive evidence gives more warrant to a conclusion than less comprehensive evidence does iff the additional evidence is at least as favorable as the rest.”

Haack at 18 (internal citation omitted).  According to Haack, the calculus of probabilities does not help in computing degrees of epistemic warrant.  Id. at 20. Her reasons are noteworthy:

  • “since quality of evidence has several distinct dimensions (supportiveness, independent security, comprehensiveness), and there is no way to rank relative success and failure across these different factors, there is no guarantee even of a linear ordering of degrees of warrant;
  • while the probability of p and the probability of not-p must add up to 1, when there is no evidence, or only very weak evidence, either way, neither p nor not-p may be warranted to any degree; and
  • while the probability of p and q (for independent p and q) is the product of the two, and hence, unless both are 1, less than the probability of either, the warrant of a conjunction may be higher than the warrant of its components”

Id. at 20-21.  The third bullet appears to have been a misfire.  If we were to use Bayes’ theorem, the two pieces of evidence would require sequential adjustments to our posterior odds or probability; we would not multiply the two probabilities directly.

Haack’s attack on legal probabilism blinds her to the reality that sometimes all there is in a legal case is probabilistic evidence.  For instance, in the litigation over claims that asbestos causes colorectal cancer, plaintiffs had only a relative risk statistic to support their desired inference that asbestos had caused their colorectal cancers.  There was no other evidence.  (On general causation, the animal studies failed to find colorectal cancer from asbestos ingestion, and the “weight of evidence” was against an association in any event.)  Nonetheless, Haack cites one case as a triumph of her anti-probabilistic viewpoint:

“Here I am deliberately echoing the words of the Supreme Court of New Jersey in Landrigan, rejecting the idea that epidemiological evidence of a doubling of risk is sufficient to establish specific causation in a toxic-tort case: ‘a relative risk of 2.0 is not so much a password to a finding of causation as one piece of evidence among many’.114 This gets the key epistemological point right.”

Landrigan v. Celotex Corp., 127 N.J. 405, 419, 605 A.2d 1079 (1992).  Well, not really.  Had Haack read the Landrigan decision, including the lower courts’ opinions, she would be aware that there were no other pieces of evidence.  There were no biomarkers, no “fingerprints” of causation; no evidence of Mr. Landrigan’s individual, special vulnerability.  The case went up to the New Jersey Supreme Court, along with a companion case, as a result of directed verdicts.  Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992). The plaintiffs had put in their cases and rested; the trial courts were required to assume that the facts were as presented by the plaintiffs.  All the plaintiffs had offered, however, of any possible relevance, was a relative risk statistic.

Haack’s fervent anti-probabilism obscures the utility of probability concepts, especially when probabilities are all we have.   In another jarring example, Haack seems to equate any use of Bayes’ theorem, or any legal analysis that invokes an assessment of probability, with misguided “legal probabilism.”  For instance, Haack writes:

“Mr. Raymond Easton was arrested for a robbery on the basis of a DNA “cold hit”; statistically, the probability was very low that the match between Mr. Easton’s DNA (on file after an arrest for domestic violence) and DNA found at the crime scene was random. But Mr. Easton, who suffered from Parkinson’s disease, was too weak to dress himself or walk more than a few yards-let alone to drive to the crime scene, or to commit the crime.”

Haack at 37 (internal citation omitted).  Bayes’ Theorem, with its requirement of inclusion of a base rate, or prior probability, in the complete analysis provides the complete answer to Haack’s misguided error about DNA cold hits.

 

Judge Posner’s Digression on Regression

April 6th, 2012

Cases that deal with linear regression are not particularly exciting except to a small brand of “quant” lawyers who see such things “differently.”  Judge Posner, the author of several books, including Economic Analysis of Law (8th ed. 2011), is a judge who sees things differently as well.

In a case decided late last year, Judge Posner took the occasion to chide the district court and the parties’ legal counsel for failing to assess critically a regression analysis offered by an expert witness on the quantum of damages in a contract case.  ATA Airlines Inc. (ATA), a subcontractor of Federal Express Corporation, sued FedEx for breaching an alleged contract to include ATA in a lucrative U.S. military deal.

Remarkably, the contract liability was a non-starter; the panel of the Seventh Circuit reversed and rendered the judgment in favor of the plaintiff.  There never was a contract, and so the case should never have gone to trial.  ATA Airlines, Inc. v. Federal Exp. Corp., 665 F.3d 882, 888-89 (2011).

End of Story?

In a diversity case, based upon state law, with no liability, you would think that the panel would and perhaps should stop once it reached the conclusion that there was no contract upon which to predicate liability.  Anything more would be, of course, pure obiter dictum, but Judge Posner could not resist the teaching moment, both for the trial judge below, the parties, their counsel, and the bar:

“But we do not want to ignore the jury’s award of damages, which presents important questions that have been fully briefed and are bound to arise in future cases.”

Id. at 889. That award of damages was based upon plaintiff’s expert witness’s regression analysis.  Judge Posner was perhaps generous in suggesting that the damages issue, as it involved a regression analysis, had been fully briefed.  Neither party addressed the regression with the level of scrutiny given by Judge Posner and his colleagues, Judges Wood and Easterbrook.

The Federal Express defense lawyers were not totally asleep at the wheel; they did object on Rule 702 grounds to the regression analysis offered by plaintiff’s witness, Lawrence D. Morriss, a forensic accountant.

“There were, as we’re about to see, grave questions concerning the reliability of Morriss’s application of regression analysis to the facts. Yet in deciding that the analysis was admissible, all the district judge said was that FedEx’s objections ‘that there is no objective test performed, and that [Morriss] used a subjective test, and [gave] no explanation why he didn’t consider objective criteria’, presented issues to be explored on cross-examination at trial, and that ‘regression analysis is accepted, so this is not “junk science.” [Morriss] appears to have applied it. Although defendants disagree, he has applied it and come up with a result, which apparently is acceptable in some areas under some models. Simple regression analysis is an accepted model.”

Id. (quoting District Judge Richard L. Young).

Apparently it is not enough for trial judges within the Seventh Circuit to wave their hands and proclaim that objections go to weight not admissibility; nor is it sufficient to say that a generally accepted technique was involved in formulating an opinion without exploring whether the technique was employed properly and reliably.  Judge Posner’s rebuke was short on subtlety and tact in describing the district judge’s response to FedEx’s Rule 702 objections:

“This cursory, and none too clear, response to FedEx’s objections to Morriss’s regression analysis did not discharge the duty of a district judge to evaluate in advance of trial a challenge to the admissibility of an expert’s proposed testimony. The evaluation of such a challenge may not be easy; the ‘principles and methods’ used by expert witnesses will often be difficult for a judge to understand. But difficult is not impossible. The judge can require the lawyer who wants to offer the expert’s testimony to explain to the judge in plain English what the basis and logic of the proposed testimony are, and the judge can likewise require the opposing counsel to explain his objections in plain English.”

Id. The lawyers, including Federal Express’s lawyers, also came in for admonishment:

“This might not have worked in the present case; neither party’s lawyers, judging from the trial transcript and the transcript of the Rule 702 hearing and the briefs and oral argument in this court, understand regression analysis; or if they do understand it they are unable to communicate their understanding in plain English.”

Id.

The court and counsel are not without resources, as Judge Posner pointed out.  The trial court can appoint its own expert to assist in evaluating the parties’ expert witnesses’ opinions.  Alternatively, the trial judge could roll up his sleeves and read the chapter on regression analysis in the Reference Manual on Scientific Evidence (3d ed. 2011). Id. at 889-890.  Judge Posner’s opinion makes clear that had the trial court taken any of these steps, Morriss’s regression analysis would not have survived the Rule 702 challenge.

Morriss’s analysis was, to be sure, a rather peculiar regression of costs regressed on revenues.  Inexplicably, ATA’s witness made cost the dependent variable, with revenue the independent variable.  Common sense would have told the judge that revenue (gained or lost) should have been the dependent term in the analysis.  ATA’s expert witness attempted to justify this peculiar regression by claiming that that the more plausible variables that make up costs (personnel, labor, fuel, equipment) were not available.  Judge Posner would have none of this incredible excuse mongering:

“In any event, a plaintiff’s failure to maintain adequate records is not a justification for an irrational damages theory.”

Id. at 893.

Judge Posner proceeded to dissect Morriss’s regression in detail, both in terms of its design and implementation.  Interestingly, FedEx had a damages expert witness, who was not called at trial.  Judge Posner correctly observed that defendants frequently do not call their damages witnesses at trial lest the jury infer that they are less than sincere in their protestations about no liability.  The FedEx damages expert, however, had calculated a 95 percent confidence interval for Morriss’s prediction for ATA’s costs in a year after the alleged breach of contract.  (It is unclear whether the interval calculated was truly a confidence interval, or a prediction interval, which would have been wider.)  In any event, the interval included costs at the high end, which would have resulted in net losses, rather than net profits, as Morriss had opined.  “All else aside, the confidence interval is so wide that there can be no reasonable confidence in the jury’s damages award.”  Id. at 896.

After summarizing the weirdness of Morriss’s regression analysis, Judge Posner delivered his coup de grâce:

“This is not nitpicking. Morriss’s regression had as many bloody wounds as Julius Caesar when he was stabbed 23 times by the Roman Senators led by Brutus. We have gone on at such length about the deficiencies of the regression analysis in order to remind district judges that, painful as it may be, it is their responsibility to screen expert testimony, however technical; we have suggested aids to the discharge of that responsibility. The responsibility is especially great in a jury trial, since jurors on average have an even lower comfort level with technical evidence than judges. The examination and cross-examination of Morriss were perfunctory and must have struck most, maybe all, of the jurors as gibberish. It became apparent at the oral argument of the appeal that even ATA’s lawyer did not understand Morriss’s analysis; he could not answer our questions about it but could only refer us to Morriss’s testimony. And like ATA’s lawyer, FedEx’s lawyer, both at the trial and in his appellate briefs and at argument, could only parrot his expert.

***

If a party’s lawyer cannot understand the testimony of the party’s own expert, the testimony should be withheld from the jury. Evidence unintelligible to the trier or triers of fact has no place in a trial. See Fed.R.Evid. 403, 702.”

Id. at 896.  Ouch! Even being the victor can be a joyless occasion before Judge Posner.  For those who are interested in such things, the appellate briefs of the parties can be found on line, both for ATA and for FedEx.

It is interesting to compare Judge Posner’s close scrutiny and analysis of the plaintiff’s expert witness’s regression with how the United States Supreme Court treated a challenge to the use of multiple regression in a race discrimination case in the mid-1980s.  In Bazemore v. Friday, 478 U.S. 385 (1986), the defendant criticized the plaintiffs’ regression on grounds that it omitted variables for major factors in any fair, sensible model of salary.  The Fourth Circuit had treated the omissions as fatal, but the Supreme Court excused the omissions by shifting the burden of producing a sensible, reliable regression model to the defense:

“The Court of Appeals erred in stating that petitioners’ regression analyses were ‘unacceptable as evidence of discrimination’, because they did not include ‘all measurable variables thought to have an effect on salary level’. The court’s view of the evidentiary value of the regression analysis was plainly incorrect. While the omission of variables from a regression analysis may render the analysis less probative than it otherwise might be, it can hardly be said, absent some other infirmity, that an analysis which accounts for the major factors ‘must be considered unacceptable as evidence of discrimination’. Ibid. Normally, failure to include variables will affect the analysis’ probativeness, not its admissibility.

Id. at 400.  The Court, buried in a footnote, made an abstract concession that “there may, of course, be some regressions so incomplete as to be inadmissible as irrelevant; but such was clearly not the case here.” Id. at 400 n.15.  When the Court decided Bazemore, the federal courts were still enthralled with their libertine approach to expert witness evidence.  It is unclear whether a straightforward analysis of the plaintiffs’ regression analyses in Bazemore under current Rule 702, without the incendiary claims of racism, would have permitted a more dispassionate analysis of the proffered evidence.