TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Statistical Significance – Will Judicial Notice Substitute for An Expert Witness?

July 23rd, 2012

Do litigants in civil and criminal proceedings need statistical expert witnesses to present statistical analyses? Or, can lawyers take the data that are in evidence, and present their own statistical analyses?

Surely, lawyers could add figures to arrive at a sum, which is relevant to the issues in dispute.  Some lawyers and judges might be able to take model assumptions, and compare two means or two proportions, to show that they the statistics did not likely come from the same populations. Indeed, some lawyers may be able to do such analyses better than some expert witnesses, but this begs the question:  is it legally permissible?

In In re Pfizer Inc. Securities Litig., 584 F.Supp. 2d 621 (S.D.N.Y. 2008), defendant Pfizer filed a motion to dismiss a securities class action complaint.  The court found that Pfizer’s motion would require it to interpret statistical significance, and that it could not accept the parties’ non-expert assertions of the meaning of the concept; nor could the court take judicial notice of the meaning:

“The Court declines to take judicial notice of the meaning of statistical significance or of the data interpretations proffered by Defendants in the context of this motion practice. Rule 201 of the Federal Rules of Evidence provides that courts may only take notice of facts ‘either (1) generally known . . . or (2) capable of accurate and ready determination by resort to sources whose accuracy cannot reasonably be questioned’. Fed. R. Evid. 201(b). While statistical significance may have certain characteristics capable of general abstraction, it is far beyond the scope of Rule 201 to accept as fact the particular definitions of statistical significance proffered by Defendants as either facts generally known or as drawn from sources whose accuracy cannot reasonably be questioned. It is one thing to take notice of the fact that an author has written that 5% is the threshold for statistical significance. It is quite another thing entirely to use that 5% figure as a basis for rejecting the significance of complicated medical studies.”

Id. at 634. Similarly, the court refused to look at specific studies and conclude that they failed to find a statistically significant association between Celebrex and cardiovascular adverse events:

“A motion to dismiss a complaint is not an appropriate vehicle for determination as to the weight of the evidence, expert or otherwise. Clearly, the Court cannot take judicial notice that the three studies show a lack of any statistically significant link between Celebrex/Bextra and adverse cardiovascular events because that supposed fact is neither generally known nor capable of accurate and ready determination by reference to unquestionably accurate sources. Moreover, the Court cannot determine as a matter of law whether such links were statistically insignificant because statistical significance is a question of fact.”

Id. at 635.

In Bristol-Myers Squibb v. AIU Insurance Co., et al.,  Cause No. A-145,672, Jefferson County, 58th Judicial District, Texas, plaintiff’s counsel made a Batson challenge to the defendants’ exercise of peremptory challenges.  See Daily Transcript in (May 13, 1997).  Not having expected the defense counsel to exercise their peremptory challenges in an apparently discriminatory fashion, the plaintiff’s counsel did not have a statistician ready to analyze the pattern of challenges.  One of the plaintiff’s counsel presented the analysis in his oral argument to the court.  The venire panel was made up of 49 persons, 18 black, and 31 white.  The defense exercised 6 of their 7 peremptory challenges to black veniremen.  Based upon these numbers, plaintiff’s counsel presented a calculation of the probability that defense counsel would have exercised their challenges in such an extreme fashion if they made their choices independent of race. The defense objected to plaintiff’s counsel’s calculations, but the trial court overruled the objection and noted that the laws of probability were subject to judicial notice.  Id. at 828-30.

The Texas trial court found a prima facie case of discrimination, and permitted plaintiff’s counsel to cross-examine defense counsel about their peremptory and selection decisions.  Id. at 858.  The case settled shortly afterwards.  See also Andrew T. Berry, “Selecting Jurors,” 24 Litigation 8, 9 (Fall 1997)(“For example, in a recent (unreported) large civil case in the Southwest, the defendants successfully defeated a Batson challenge to their use of 85 percent of their peremptory challenges against protected class members. The successful defense? That the two dozen eminent counsel …, given less than a quarter-hour to exercise their peremptories were simply too disorganized to have struck jurors in violation of Batson.”)

In the welding fume MDL 1535, the plaintiffs persisted in challenges to a particular industry-funded, published epidemiologic study, which reported findings of no increased risks for Parkinson’s disease and parkinsonism among non-shipyard Danish welders.  Jon Fryzek, J. Hansen, S. Cohen, J. Bonde, et al., “A cohort study of Parkinson’s disease and other neurodegenerative disorders in Danish welders,” 47 J. Occup. & Envt’l Med. 466 (2005).  Plaintiffs’ counsel went to the extreme of traveling to Denmark, with one of their expert witnesses in tow, to analyze the underlying data for this study. Upon returning to the United States, the plaintiffs moved to bar reliance upon the Fryzek study, on the theory that the statistical analysis concerning the article’s finding of no statistically significant difference in the age of onset was incorrect.  In support of their argument, one of the plaintiff’s counsel, a law professor who was assisting plaintiffs in the welding fume litigation, submitted an affidavit in support of the motion in limine to bar defense witness’s testimony.  See Affidavit of Theodore Eisenberg, in In re Welding Fume Products Liability Litigation, Case No.: 1:03-cv-17000, MDL No. 1535, Document 1862 Filed 08/07/2006.

Eisenberg’s affidavit reported analyses of the Danish data, apparently based upon work done by an unnamed “programmer” at the Danish Cancer Society.  The affidavit included truncated computer program output, without identification of the statistical tests, or of the statistical software, used. Eisenberg interpreted the p-value result of the attached statistical analysis to show that there was a statistically significant difference in the age of onset of Parkinson’s disease between welders and non-welders.

The defense opposed the motion on grounds that Eisenberg’s affidavit was an ethically impermissible attempt by a lawyer in the case to present an expert witness opinion.  The defense also countered substantively with an affidavit from one of its expert witnesses, who analyzed the affidavit and realized that Eisenberg and the anonymous programmer had not presented the complete software output from their analyses, and that they had used a different test from that used in the published paper.  Eisenberg’s affidavit therefore had not identified an error in the published paper.  Declaration of Timothy L. Lash (Sept. 11, 2006), filed in In re Welding Fume Products Liability Litigation, Case No.: 1:03-cv-17000, MDL No. 1535.  The trial court denied the plaintiff’s motion to bar reliance upon the Fryzek study, without comment on the propriety of Eisenberg’s affidavit.

The MTBE mass tort litigation gave rise a peculiar instance in which a trial court held that a real estate value appraiser had departed from the level of intellectual rigor used in assessing property value changes, claimed to have resulted from a gas station’s pollution of the ground water in a small town in Orange County, New York.  The witness opined that the plaintiffs’ property suffered a 15% decline in market value, but he failed to identify the methods he used to arrive at his opinion. In re Methyl Tertiary Butyl Ether (“MTBE”) Prods. Liab. Litig., 2008 U.S. Dist. LEXIS 44216 (S.D.N.Y. June 4, 2008)(Scheindlin, J.).  The expert witness did explain that there were so few sales in the affected town that he could not use regression analysis, and that it was thus necessary to look at “trend data on sales by sub-markets, sales/list price analysis and days on the market comparisons.” Id. at *5.  Even so, the trial court could not otherwise discern what method the witness did use:

“In this case, I am unable to discern any method — much less a reliable method — that Langer used to reach his conclusion that the value of plaintiffs’ property decreased by fifteen percent because of MTBE contamination. Rather, Langer has merely compiled market data and then offered his conclusions, yet he has failed to explain the relationship between the two.”

Id. at *11.

Although the expert witness’s departure from the professional standard of care rendered his opinion inadmissible, the trial court decided that the would-be expert witness could still testify as a fact witness to the facts that he had collected about sales trends in the affected community and elsewhere.  According to the court, the statistics gathered by this witness were relevant, and the plaintiffs’ counsel could argue plausible inferences to the jury, from the sales figures.  Id. at *16-17.  The court thus remarkably permitted the plaintiffs’ counsel to provide the statistical analysis that his designated expert witness had failed to give in a legally reliably form.

Copywrongs – Plagiarism in the Law

July 20th, 2012

Previously I have written about the ethical and practical issues involved in lawyers’ plagiarism.  See also Copycat – Further Thoughts.  Professor Douglas E. Abrams, of the University of Missouri School of Law, has written an interesting article on issues raided by lawyers’ plagiarism, “Plagiarism in Lawyers’ Advocacy: Imposing Discipline for Conduct Prejudicial to the Administration of Justice,” which is due out next year in the Wake Forest Law Review.  For now, a draft is available from the Social Science Research Network for download.

Abrams details some recent cases in which counsel were chastised for copying published material, prior judicial opinions, and other counsel’s briefs.  There is still a lot of gray areas.  Abrams does not deal with legal forms.  However flattering it is for judges to adopt language from lawyers’ briefs, is it plagiarism for them to do so?  Does a judge commit plagiarism by adopting language wholesale from a lawclerk’s draft? Does a senior lawyer commit plagiarism by leaving off the names of junior lawyers and lawclerks, who contributed portions of the brief? Does it matter if the senior lawyer’s writing is an article for publication rather than a brief to the court?  If a lawyer takes language from another’s brief, and uses it in an article, does she commit plagiarism?  If a lawyer discovers plagiarism committed by another lawyer, is there an ethical obligation to report the plagiarizer?

Discovery of Statistician Expert Witnesses

July 19th, 2012

This post has been updated and superseded by “

Daubert Approaching the Age of Majority

July 19th, 2012

PLAINTIFF LAWYER:  WHY DID YOU FILE THE DAUBERT?

DEFENSE LAWYER:  THE DAUBERT WILL SET MY CLIENT FREE.

PLAINTIFF LAWYER:  THE DAUBERT WILL COST YOUR CLIENT A LOT OF MONEY WHICH COULD COMPENSATE YOUR VICTIMS.

DEFENSE LAWYER:  THE DAUBERT WILL BAR YOUR EXPERTS AND GIVE US THE SUMMARY JUDGMENT.

PLAINTIFF LAWYER:  YOU WILL LOSE THE DAUBERT.

DEFENSE LAWYER:  NO; WE WILL WIN THE DAUBERT BECAUSE YOU DO NOT HAVE THE BRADFORD HILL.

PLAINTIFF LAWYER:  THE BRADFORD HILL ARE NINE THINGS.  WHICH ONE ARE YOU TALKING ABOUT?

DEFENSE LAWYER:  I AM TALKING ABOUT ALL NINE.

PLAINTIFF LAWYER:  NO, WE HAVE THE BRADFORD HILL LOCKED UP.

DEFENSE LAWYER:  BUT YOU DO NOT HAVE ANY OF THE NINE.

PLAINTIFF LAWYER:  THE BRADFORD HILL SAYS NONE OF THE NINE IS NECESSARY; THEREFORE WE HAVE ALL SATISFIED.

DEFENSE LAWYER:  BUT WAIT; YOU DO NOT EVEN GET TO THE BRADFORD HILL.  YOUR P-VALUE IS TOO LOW.

PLAINTIFF LAWYER:  I AM THE PLAINTIFF LAWYER.  I CANNOT BE TOO RICH, TOO POWERFUL, OR HAVE TOO HIGH A P-VALUE.

DEFENSE LAWYER:  NO; YOU NEED TO SHOW STATISTICAL SIGNIFICANCE BEFORE YOU GET TO THE BRADFORD HILL.

PLAINTIFF LAWYER:  STATISTICAL SIGNIFICANCE IS NOT A LITMUS TEST; SO I HAVE THAT SATISFIED AS WELL.

DEFENSE LAWYER:  YOU DON’T KNOW WHAT YOU’RE TALKING ABOUT.

PLAINTIFF LAWYER:  YOU NEED TO KEEP UP WITH THE CASE LAW.  LET’S GO PICK A JURY.

Pin the Tail on the Significance Test

July 14th, 2012

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply.  See .  An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve.  Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

 

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance.  As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.110 One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading.  The reason for the two-tailed test, however, is not really tied to multiple testing.  If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both  above and below the null hypothesis count in favor of rejecting the null.  A two-tailed test results.  Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication.  Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing?  There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature.  One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

 

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.49 Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321.  This statement is not quite consistent with the chapter on statistics, and it introduces new problems.  The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical.  Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions.  If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value.  The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests.  Id. at 321 n. 49.  The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

 

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression.  Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law.  I believe that the authors’ suggestion about the interest of “most investigators” is incorrect.  The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case).  The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test.  More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine.  The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke.  United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims.  As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims.  See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988).

 

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value.  According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association.  In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test.  One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand.  Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing  as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex.  Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

 

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests.  While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001).  These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance.  As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation  relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,4[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district.  A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009).  See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240.  In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared.  One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)

Tal Golan’s Preliminary History of Epidemiologic Evidence in U.S. Courts

July 10th, 2012

Tal Golan  is an historian, with a special interest in the history of science in the 18th and 19th centuries, and in historical relationships among, science, technology, and the law.  He now teaches history at the University of California, San Diego.  Golan’s  book on the history of expert witnesses in the common law is an important starting place in understanding the evolution of the adversarial expert witness system in English and American courts.  Tal Golan, Laws of Man and Laws of Nature: A History of Scientific Expert Testimony (Harvard 2004).

Last year, Golan led a faculty seminar at the University of Haifa’s Law School on the history of epidemiologic evidence in 20th century American litigation.  A draft of Golan’s paper is available at the school’s website, and for those interested in the evolution of the American courts’ treatment of statistical and epidemiologic evidence, the paper is worth a look.  Tal Golan, “A preliminary history of epidemiological evidence in the twentieth-century American Courtroom” manuscript (2011) [Golan 2011].

There are problems, however, with Golan’s historical narrative.  Golan points to tobacco cases as the earliest forays into the use of epidemiologic evidence to prove health claims in court:

“I found only four toxic tort cases in the 1960s that involved epidemiological evidence – two tobacco and two vaccine cases. In the tobacco cases, the plaintiffs tried and failed to establish a causal relation between smoking and cancer via the testimony of epidemiological experts. In both cases the judges dismissed the epidemiological evidence and directed summary verdicts for the tobacco companies.38

Golan 2011 at 11 & n. 38 (citing Pritchard v. Liggett & Myers Tobacco Co., 295 F.2d 292 (1961); Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (1963)).  Golan may be correct that some of the early tobacco cases were dismissive of statistical and epidemiologic evidence, but these citations do not support his assertion.  The Latrigue case resulted in a defense verdict after a jury trial.  The judgment for the defendant was affirmed on appeal, with specific reference to the plaintiff’s use of epidemiologic evidence.  Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (5th Cir. 1963) (“The plaintiff contends that the jury’s verdict was contrary to the manifest weight of the evidence. The record consists of twenty volumes, not to speak of exhibits, most of it devoted to medical opinion. The jury had the benefit of chemical studies, epidemiological studies, reports of animal experiments, pathological evidence, reports of clinical observations, and the testimony of renowned doctors. The plaintiff made a convincing case, in general, for the causal connection between tobacco and cancer and, in particular, for the causal connection between Lartigue’s smoking and his cancer. The defendants made a convincing case for the lack of any causal connection.”), cert. denied, 375 U.S. 865 (1963), and cert. denied, 379 U.S. 869 (1964).  Golan is thus wrong to suggest that the plaintiffs in Lartigue suffered a summary judgment or a directed verdict on their causation claims.

In Pritchard, the plaintiff had three trials in the course of litigating his tobacco-related claims.  See Pritchard v. Liggett & Myers Tobacco Co., 134 F. Supp. 829 (W.D. Pa. 1955), rev’d, 295 F.2d 292, 294 (3d Cir. 1961), 350 F.2d 479 (3d Cir. 1965), cert. denied, 382 U.S. 987 (1966), amended, 370 F.2d 95 (3d Cir. 1966), cert. denied, 386 U.S. 1009 (1967).  The Pritchard case ultimately turned on liability more than causation issues.  In both cases, Golan’s citations are abridged and incorrect.

Golan also wades into a discussion of statistical significance in which he misstates the meaning of the concept and he incorrectly describes how it was handled in at least one important case:

“Statistics provides such an assurance by calculating the probability of false association, and the epidemiological dogma demands it to be smaller than 5% (i.e, less than 1 in 20) for the association to be considered statistically significant.”

Golan 2011, at 18.  This statement is wrong.  Statistics do not provide a probability of the truth or falsity of the association.  The significance probability to which Golan refers measures the probability of data at least as extreme as those observed if the null hypothesis of no difference is correct.

Having misunderstood and misstated the meaning of significance probability, Golan proceeds to make the classic misidentification of statistical significance probability with the probability of the either the null hypothesis or the observed result.  Frequentist statistical testing cannot do this, and Golan’s error has no place in a history of these concepts other than to point out that courts have frequently made this mistake:

“The ‘statistical significance‘ standard is far more demanding than the ‘preponderance of the evidence‘ or ‘more likely than not‘ standard used in civil law. It reflects the cautious attitude of scientists who wish to be 95% certain that their measurements are not spurious.

**********

Epidemiologists have considered the price well worth paying. So has criminal law, which emphasizes the minimization of false conviction, even at the price of overlooking true crime. But civil law does not share this concern.”

This narrative misstates what epidemiologist are doing in using significance probability and null hypothesis significance testing.  The confusion between epidemiologic statistical standards and burden of proof in criminal cases is a serious error.

Golan compares and contrasts the approaches of the trial judges in Allen v. United States, and in In re Agent Orange:

“Judge Weinstein, on the other hand, was far less concerned with the strictness of the epidemiology. A scholar of evidence, member of the Advisory Committee that drafted the Federal Rules of Evidence during the early 1970s, and a critic of the partisan deployment of science in the adversarial courtroom, Weinstein embraced the stringent 95% significance threshold as a ready-made admissibility test that could validate the veracity of the statistical evidence used in court. Thus, while he referred to epidemiological studies as ―the best (if not the sole) available evidence in mass exposure cases,‖ he nevertheless refused to accept them in evidence, unless they were statistically significant.64

Golan at 19.  Weinstein is all that and more, but he never simplistically embraced statistical significance as a “ready-made admissibility test.”  Of course 95% is the coefficient of confidence, and the complement of alpha of 0.05%, but this alpha is not a particularly stringent threshold unless it is misunderstood as a burden of proof.  Contrary to Golan’s suggestion, Judge Weinstein was not being conservative or restrictive in his approach in In re Agent Orange.

Golan’s “preliminary” history is a good start, but it misses an important perspective.  After World War II, biological science, in the form of genetics, as well as epidemiology and other areas, grew to encompass stochastic processes as well as mechanistic processes.  To a large extent, in permitting judgments to be based upon statistical and epidemiologic evidence, the law was struggling to catch up with developments in science.   There is quite a bit of evidence that the law is still struggling.

Reference Manual on Scientific Evidence (3d edition) on Statistical Significance

July 8th, 2012

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

Berger’s citations in support are curiously inaccurate.  Footnote 62 cites the Cook case:

“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”

The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study.

Footnote 63 is no better:

“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”

With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis.  The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies.  The court’s review was hardly extensive.  When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).  Berger’s characterization of the review is incorrect, and her failure to cite the subsequent procedural history disturbing.

 

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.

 

Chapter on Multiple Regression

The chapter on regression does not add much to the earlier and later discussions.  The author asks rhetorically what is the appropriate level of statistical significance, and answers:

“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47”

Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.

 

Chapter on Epidemiology

The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573.  The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics. The problem is that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample.  Furthermore, the chapter’s statement here omits reference to the conditional nature of the probability that makes it dependent upon the assumption of correctness of the null hypothesis.

The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.”  The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary.  Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off.  Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”

Id. at 573n.68.  Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing.   The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance.  See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

The epidemiology chapter also chastises courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct:

“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).”

Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose  as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”

577 n81.  The footnotes goes to explain further the difference between alpha probability and burden of proof probability, but incorrectly asserts that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.”  Id.  The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expect value if the null hypothesis is true.  Kaye and Freedman’s chapter on statistics does much better at describing and avoiding the transpositional fallacy when describing p-values.

When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct.  What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87

Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”

Id. at 578-79.  Mostly true, but again rather  unhelpful to judges and lawyers.  The authors divide the world up into “strict” testers and those critical of “strict” testing.  Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all?  I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis.  Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them.  But what of those critical of strict testing?  This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).  The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample.  Confidence intervals, however, do not eliminate the need to interpret the extent of random error.

In the final analysis, the epidemiology chapter is unclear and imprecise.  I believe it confuses matters more than it clarifies.  There is clearly room for improvement in the Fourth Edition.

Viagra, Part II — MDL Court Sees The Light – Bad Data Trump Nuances of Statistical Inference

July 8th, 2012

In the Viagra vision loss MDL, the first Daubert hearing did not end well for the defense.  Judge Magnuson refused to go beyond conclusory statements by the plaintiffs’ expert witness, Gerald McGwin, and to examine the qualitative and quantitative evaluative errors invoked to support plaintiffs’ health claims.  The weakness of McGwin’s evidence, however, appeared to  encourage Judge Magnuson to authorize extensive discovery into McGwin’s study.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1090 (D. Minn. 2008).

The discovery into McGwin’s study had already been underway, with subpoenas to him and to his academic institution.  As it turned out, defendant’s discovery into the data and documents underlying McGwin’s study won the day.  Although Judge Magnuson struggled with inferential statistics, he understood the direct attack on the integrity of McGwin’s data.  Over a year after denying defendant’s Rule 702 motion to exclude Gerald McGwin, the MDL court reconsidered and granted the motion.  In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009).

The basic data on prior exposures and risk factors for the McGwin study was collected by telephone surveys, from which the information was coded into an electronic dataset.  In analyzing the data, McGwin used the electronic dataset and not the survey forms.  Id. at 939.  The transfer from survey forms to electronic dataset did not go smoothly; about 11 patients were miscoded as “exposed“ when their use of Viagra post-dated the onset of NAION. Id. at 942.  Furthermore, the published article incorrectly stated personal history of heart attack as a “risk factor ”; the survey inquired about family not personal history of heart attack. Id. at 944.

The plaintiffs threw several bombs in response, but without legal effect.  First, the plaintiffs claimed that the study participants had been recontacted and the database had been corrected, but they were unable to document this process or the alleged corrections.  Id. at 433.  Furthermore, the plaintiffs could not explain how, if their contention had been true, McGwin would have not committed serious violations of his university’s institutional review board’s regulations with respect to deviations from the original protocol.  Id. at 943 n.7.

Second, the plaintiffs argued that the underlying survey forms were “inadmissible ” and thus the defense could not use them to impeach the McGwin study.  Some might think this a duplicitous argument, utterly at odds with Rule 703 – rely upon a study but prevent use of underlying data and documents to explain that the study does not show what it purports to show.  The MDL court spared the plaintiffs the embarrassment of ruling that the documents on which McGwin had based his study were inadmissible, and found that the forms were business records and admissible under Federal Rule of evidence 803(6).  The court could have gone further to point out that McGwin’s reliance upon hearsay in the form of his study, McGwin 2006, opened the door to impeaching the hearsay relied upon with other hearsay.  See Rule 806.

When defense counsel sat down with McGwin in a deposition, they found that he had not undertaken any new analyses of corrected data.  Plaintiffs’ counsel directed him not to do so.  Id. at 940-41.  But then after the deposition was over, McGwin submitted a letter to the journal to report a corrected analysis.  Pfizer’s counsel obtained the letter in response to their subpoena to McGwin’s university, the University of Alabama, Birmingham.  Mirabile dictu; now the increase risk appeared limited to only to the defendant’s medication, Viagra!

The trial court was not amused.  First, the new analysis was no longer peer reviewed, and the court had placed a great deal of emphasis on peer review in denying the first challenge to McGwin.  Second, the new analysis was no longer that of an independent scientist, but was conducted and submitted as a letter to the editor, while McGwin was working for plaintiffs’ counsel.  Third, the plaintiffs and McGwin conceded that the data were not accurate.  Last, but not least, the trial court clearly was not pleased that the plaintiffs’ counsel had deliberately delayed McGwin’s further analyses until after the deposition, and then tried to submit yet another supplemental report with those further analyses. In sum:

“the Court finds good reason to vacate its original Daubert Order permitting Dr. McGwin to testify as a general causation expert based on the McGwin Study as published. Almost every indicia of reliability the Court relied on in its previous Daubert Order regarding the McGwin Study has been shown now to be unreliable.  Peer review and publication mean little if a study is not based on accurate underlying data. Likewise, the known rate of error is also meaningless if it is based on inaccurate data. Even if the McGwin Study as published was conducted according to generally accepted epidemiologic research and did not result from post-litigation research, the fact that the McGwin Study appears to have been based on data that cannot now be documented or supported renders it inadmissibly unreliable. The Court concludes that under Daubert, Dr. McGwin’s opinion, to the extent that it is based on the McGwin Study as published, lacks sufficient indicia of reliability to be admitted as a general causation opinion.”

Id. at 945-46.  The remaining evidence was the Margo & French study, but McGwin had previously criticized that study as lacking data that ensured that Viagra use preceded onset of NAION.  In the end, McGwin was left with bupkes, and the plaintiffs were left with even less.

*******************

McGwin 2006 Was Also A Pain in the Rear End for McGwin

The Rule 702 motions and hearings on McGwin’s proposed testimony had consequences in the scientific world itself.  In 2011, the British Journal of Ophthalmology retracted McGwin’s 2006 paper.  “Retraction: Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction, ” 95 Brit. J. Ophthalmol. 595 (2011).

Interestingly, the retraction was reported in the Retraction Watch blog, “Retractile dysfunction? Author says journal yanked paper linking Viagra, Cialis to vision problem after legal threats.”  The blog treated the retraction as routine except for the hint of “legal pressure”:

“One of the authors of the paper, a researcher at the University of Alabama named Gerald McGwin Jr., told us that the journal retracted the article because it had become a tool in a lawsuit involving Pfizer, which makes Viagra, and, presumably, men who’d developed blindness after taking the drug:

‘The article just became too much of a pain in the rear end. It became one of those things where we couldn’t provide all the relevant documentation [to the university, which had to provide records for attorneys].’

Ultimately, however, McGwin said that the BJO pulled the plug on the paper.”

Id. The legal threat is hard to discern other than the fact that lawyers wanted to see something that peer reviewers almost never see – the documentation underlying the published paper.  So now, the study that formed the basis for the original ruling against Pfizer floats aimlessly as a derelict on the sea of science.  McGwin is, however, still at his craft.  In a study he published in 2010, he claimed that Viagra but not Cialis use was associated with hearing impairment.  Gerald McGwin, Jr, “Phosphodiesterase Type 5 Inhibitor Use and Hearing Impairment,” 136 Arch. Otolaryngol. Head & Neck Surgery 488 (2010).

Where are Senator Grassley and Congressman Waxman when you need them?

Love is Blind but What About Judicial Gatekeeping of Expert Witnesses? – Viagra Part I

July 7th, 2012

The Viagra litigation over claimed vision loss vividly illustrates the difficulties that trial judges have in understanding and applying the concept of statistical significance.  In this MDL, plaintiffs sued for a specific form of vision loss, non-arteritic ischemic optic neuropathy (NAION), which they claimed was caused by their use of defendant’s medication, Viagra.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071 (D. Minn. 2008).  Plaintiffs’ key expert witness, Gerald McGwin considered three epidemiologic studies; none found a statistically significant elevation of risk of NAION after Viagra use.  Id. at 1076. The defense filed a Rule 702 motion to exclude McGwin’s testimony, based in part upon the lack of statistical significance of the risk ratios he relied upon for his causal opinion.  The trial court held that this lack did not render McGwin’s testimony and unreliable and inadmissible  Id. at 1090.

One of the three studies considered by McGwin was his own published paper.  G. McGwin, Jr., M. Vaphiades, T. Hall, C. Owsley, ‘‘Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction,’’ 90 Br. J. Ophthalmol. 154 (2006)[“McGwin 2006”].    The MDL court noted that McGwin had stated that his paper reported an odds ratio (OR) of 1.75, with a 95% confidence interval (CI), 0.48 to 6.30.  Id. at 1080.  The study also presented multiple subgroup analyses of men who had reported Viagra use after a history of heart attack (OR = 10.7) or hypertension (OR = 6.9), but the MDL court did not provide p-values or confidence intervals for the subgroup analysis results.

Curiously, Judge Magnuson eschewed the guidance of the Reference Manual on Scientific Evidence, in dealing with statistics of sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.  The new third edition again conveys the same basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The point of the RSME‘s guidance is, of course, that the standard error, or the confidence interval (C.I.) based upon a specified number of standard errors, is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

The statistical data on which McGwin was basing his opinion was readily available from McGwin 2006:

“Overall, males with NAION were no more likely to report a history of Viagra … use compared to similarly aged controls (odd ratio (OR) 1.75, 95% confidence interval (CI) 0.48 to 6.30.  However, for those with a history of myocardial infarction, a statistically significant association was observed (OR 10.7, 95% CI 1.3 to 95.8). A similar association was observed for those with a history of hypertension though it lacked statistical significance (OR 6.9, 95% CI 0.8 to 63.6).”

McGwin 2006, at 154.  Following the RSME‘s guidance would have assisted the MDL court in its gatekeeping responsibility in several distinct ways.  First, the court would have focused on how wide the 95% confidence intervals were.  The width of the intervals pointed to statistical imprecision and instability in the point estimates urged by McGwin.  Second, the MDL court would have confronted the extent to which there were multiple ad hoc subgroup analyses in McGwin’s paper.  See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002)(“It is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns.”) Third, the court would have confronted the extent to which the study’s validity was undermined by several potent biases.  Statistical significance was the least of the problems faced by McGwin 2006.

The second study considered and relied upon by McGwin was referred to as Margo & French.  McGwin cited this paper for an “elevated OR of 1.10,” id. at 1081, but again, had the court engaged with the actual evidence, it would have found that McGwin had cherry picked the data he chose to emphasize.  The Margo & French study was a retrospective cohort study using the National Veterans Health Administration’s pharmacy and clinical databases.  C. Margo & D. French, ‘‘Ischemic optic neuropathy in male veterans prescribed phosphodiesterase-5 inhibitors,’’ 143 Am. J. Ophthalmol. 538 (2007).  There were two outcomes ascertained:  NAION and “possible” NAION.  The relative risk of NAION among men prescribed a PDE-5 inhibitor (the class to which Viagra belongs) was 1.02 (95% confidence interval [CI]: 0.92 to 1.12.  In other words, the Margo & French paper had very high statistical precision, and it reported essentially no increased risk at all.  Judge Magnuson cited uncritically McGwin’s endorsement of a risk ratio that included ‘‘possible’’ NAION cases, which could not bode well for a gatekeeping process that is supposed to protect against speculative evidence and conclusions.

McGwin’s citation of Margo & French for the proposition that men who had taken the PDE-5 inhibitors had a 10% increased risk was wrong on several counts.  First, he relied upon an outcome measure that included ‘‘possible’’ cases of NAION.  Second, he completely ignored the sampling error that is captured in the confidence interval.  The MDL court failed to note or acknowledge the p-value or confidence interval for any result in Margo & French. The consideration of random error was not an optional exercise for the expert witness or the court; nor was ignoring it a methodological choice that simply went to the ‘‘disagreement among experts.’’

The Viagra MDL court not only lost its way by ignoring the guidance of the RMSE, it appeared to confuse the magnitude of the associations with the concept of statistical significance.  In the midst of the discussion of statistical significance, the court digressed to address the notion that the small relative risk in Margo & French might mean that no plaintiff could show specific causation, and then in the same paragraph returned to state that ‘‘persuasive authority’’ supported the notion that the lack of statistical significance did not detract from the reliability of a study.  Id. at 1081 (citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., MDL No. 1407, 289 F.Supp.2d 1230, 1241 (W.D.Wash. 2003)).  The magnitude of the observed odds ratio is an independent concept from that of whether an odds ratio as extreme or more so would have occurred by chance if there really was no elevation.

Citing one case, at odds with a great many others, however, did not create an epistemic warrant for ignoring the lack of statistical significance.  The entire notion of cited caselaw for the meaning and importance of statistical significance for drawing inferences is wrong headed.  Even more to the point, the lack of statistical significance in the key study in the PPA litigation did not detract from the reliability of the study, although other features of that study certainly did.  The lack of statistical significance in the PPA study did, however, detract from the reliability of the inference from the study’s estimate of ‘‘effect size’’ to a conclusion of causal association. Indeed, nowhere in the key PPA study did its authors draw a causal conclusion with respect to PPA ingestion and hemorrhagic stroke.  See Walter Kernan, Catherine Viscoli, Lawrence Brass, Joseph Broderick, Thomas Brott, Edward Feldmann, Lewis Morgenstern, Janet Lee Wilterdink, and Ralph Horwitz, ‘‘Phenylpropanolamine and the Risk of Hemorrhagic Stroke,’’ 343 New England J. Med. 1826 (2000).

The MDL court did attempt to distinguish the Eighth Circuit’s decision in Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001), cited by the defense:

‘‘[I]n Glastetter … expert evidence was excluded because ‘rechallenge and dechallenge data’ presented statistically insignificant results and because the data involved conditions ‘quite distinct’ from the conditions at issue in the case. Here, epidemiologic data is at issue and the studies’ conditions are not distinct from the conditions present in the case. The Court does not find Glastetter to be controlling.’’

Id. at 1081 (internal citations omitted; emphasis in original).  This reading of Glastetter, however, misses important features of that case and the Parlodel litigation more generally.  First, the Eighth Circuit commented not only upon the rechallenge-dechallenge data, which involved arterial spasms, but upon an epidemiologic study of stroke, from which Ms. Glastetter suffered.  The Glastetter court did not review the epidemiologic evidence itself, but cited to another court, which did discuss and criticize the study for various ‘‘statistical and conceptual flaws.’’  See Glastetter, 252 F.3d at 992 (citing Siharath v. Sandoz Pharms.Corp., 131 F.Supp. 2d 1347, 1356-59 (N.D.Ga.2001)).  Glastetter was binding authority, and not so easily dismissed and distinguished.

The Viagra MDL court ultimately placed its holding upon the facts that:

‘‘the McGwin et al. and Margo et al. studies were peer-reviewed, published, contain known rates of error, and result from generally accepted epidemiologic research.’’

In re Viagra, 572 F. Supp. 2d at 1081 (citations omitted).  This holding was a judicial ipse dixit substituting for the expert witness’s ipse dixit.  There were no known rates of error for the systematic errors in the McGwin study, and the ‘‘known’’ rates of error for random error in McGwin 2006  were intolerably high.  The MDL court never considered any of the error rates, systematic or random, for the Margo & French study.  The court appeared to have abdicated its gatekeeping responsibility by delegating it to unknown peer reviewers, who never considered whether the studies at issue in isolation or together could support a causal health claim.

With respect to the last of the three studies considered, the Gorkin study, McGwin opined that it was  too small, and the data were not suited to assessing temporal relationship.  Id.  The court did not appear inclined to go beyond McGwin’s ipse dixit.  The Gorkin study was hardly small, in that it was based upon more than 35,000 patient-years of observation in epidemiologic studies and clinical trials, and provided an estimate of incidence for NAION among users of Viagra that was not statistically different from the general U.S. population.  See L. Gorkin, K. Hvidsten, R. Sobel, and R. Siegel, ‘‘Sildenafil citrate use and the incidence of nonarteritic anterior ischemic optic neuropathy,’’ 60 Internat’l J. Clin. Pract. 500, 500 (2006).

Judge Magnuson did proceed, in his 2008 opinion, to exclude all the other expert witnesses put forward by the plaintiffs.  McGwin survived the defendant’s Rule 702 challenge, largely because the court refused to consider the substantial random variability in the point estimates from the studies relied upon by McGwin. There was no consideration of the magnitude of random error, or for that matter, of the systematic error in McGwin’s study.  The MDL court found that the studies upon which McGwin relied had a known and presumably acceptable ‘‘rate of error.’’  In fact, the court did not consider the random or sampling error in any of the three cited studies; it failed to consider the multiple testing and interaction; and it failed to consider the actual and potential biases in the McGwin study.

Some legal commentators have argued that statistical significance should not be a litmus test.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 241 (‘‘Statistical significance should not be a litmus test. However, there are many situations where the lack of significance combined with other aspects of the research should be enough to exclude an expert’s testimony.’’)  While I agree that significance probability should not be evaluated in a mechanical fashion, without consideration of study validity, multiple testing, bias, confounding, and the like, handing waving about litmus tests does not excuse courts or commentators from totally ignoring random variability in studies based upon population sampling.  The dataset in the Viagra litigation was not a close call.

Maryland Puts the Brakes on Each and Every Asbestos Exposure

July 3rd, 2012

Last week, the Maryland Court of Special Appeals reversed a plaintiffs’ verdict in Dixon v. Ford Motor Company, 2012 WL 2483315 (Md. App. June 29, 2012).  Jane Dixon died of pleural mesothelioma.  The plaintiffs, her survivors, claimed that her last illness and death were caused by her household improvement projects, which involved exposure to spackling/joint compound, and by her husband’s work with car parts and brake linings, which involved “take home” exposure on his clothes.  Id. at *1.

All the expert witnesses appeared to agree that mesothelioma is a “dose-response disease,” meaning that the more the exposure, the greater the likelihood that a person exposed will develop the disease. Id. at *2.  Plaintiffs’ expert witness, Dr. Laura Welch, testified that “every exposure to asbestos is a substantial contributing cause and so brake exposure would be a substantial cause even if [Mrs. Dixon] had other exposures.” On cross-examination, Dr. Welch elaborated upon her opinion to explain that any “discrete” exposure would be a contributing factor. Id.

Welch, of course, criticized the entire body of epidemiology of car mechanics and brake repairmen, which generally finds no increased risk of mesothelioma above overall population rates.  With respect to the take-home exposure, Welch had to acknowledge that there were no epidemiologic studies that investigated the risk of wives of brake mechanics.  Welch argued that the studies of car mechanics did not involve exposure to brake shoes as would have been experienced by brake repairmen, but her argument only served to make her attribution based upon take-home exposure to brake linings seem more preposterous.  Id. at *3.  The court recognized that Dr. Welch’s opinion may have been trivially true, but still unhelpful.  Each discrete exposure, even as attenuated as a take-home exposure from having repaired a single brake shoe may have “contributed,” but that opinion did not help the jury assess whether the contribution was substantial.

The court sidestepped the issue of fiber type, and threshold, and honed in on the agreement that mesothelioma risk showed a dose-response relationship with asbestos exposure.  (There is a sense that the court confused the dose-response concept to mean no threshold.)  The court credited hyperbolic risk assessment figures from the United States Environmental Protection Agency, which suggested that even ambient air exposure to asbestos leads to an increase in mesothelioma risk, but then realized that such claims made the legal need to characterize the risk from the defendant’s product all the more important before the jury could reasonably have concluded that any particular exposure experienced by Ms. Dixon was “a substantial contributing factor.”  Id. at *5.

Having recognized that the best the plaintiffs could offer was a claim of increased risk, and perhaps crude quantification of the relative risks resulting from each product’s exposure, the court could not escape that the conclusion that Dr. Welch’s empty recitation of “every exposure” is substantial was nothing more than an unscientific and empty assertion.  Welch’s claim was either tautologically true or empirical nonsense.  The court also recognized that risk substituting for causation opened the door to essentially probabilistic evidence:

“If risk is our measure of causation, and substantiality is a threshold for risk, then it follows—as intimated above—that ‘substantiality’ is essentially a burden of proof. Moreover, we can explicitly derive the probability of causation from the statistical measure known as ‘relative risk’ … .  For reasons we need not explore in detail, it is not prudent to set a singular minimum ‘relative risk’ value as a legal standard.12 But even if there were some legal threshold, Dr. Welch provided no information that could help the finder of fact to decide whether the elevated risk in this case was ‘substantial’.”

Id. at *7.  The court’s discussion here of “the elevated risk” seems wrong unless we understand it to mean the elevated risk attributable to the particular defendant’s product, in the context of an overall exposure that we accept as having been sufficient to cause the decedent’s mesothelioma.  Despite the lack of any quantification of relative risks in the case, overall or from particular products, and the court’s own admonition against setting a minimum relative risk as a legal standard, the court proceeded to discuss relative risks at length.  For instance, the court criticized Judge Kozinski’s opinion in Daubert, upon remand from the Supreme Court, for not going far enough:

“In other words, the Daubert court held that a plaintiff’s risk of injury must have at least doubled in order to hold that the defendant’s action was ‘more likely than not’ the actual cause of the plaintiff’s injury. The problem with this holding is that relative risk does not behave like a ‘binary’ hypothesis that can be deemed ‘true’ or ‘false’ with some degree of confidence; instead, the un-certainty inherent in any statistical measure means that relative risk does not resolve to a certain probability of specific causation. In order for a study of relative risk to truly fulfill the preponderance standard, it would have to result in 100% confidence that the relative risk exceeds two, which is a statistical impossibility. In short, the Daubert approach to relative risk fails to account for the twin statistical uncertainty inherent in any scientific estimation of causation.”

Id. at *7 n.12 (citing Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320-21 (9th Cir.1995) (holding that that a preponderance standard requires causation to be shown by probabilistic evidence of relative risk greater than two) (opinion on remand from Daubert v. Merrell Dow Pharms., 509 U.S. 579 (1993)).  The statistical impossibility derives from the asymptotic nature of the normal distribution, but the court failed to explain why a relative risk of two must be excluded as statistically implausible based upon the sample statistic.  After all, a relative risk greater than two, with a lower bound of a 95% confidence interval above one, based upon an unbiased sampling, suggests that our best evidence is that the population parameter is greater than two, as well.  The court, however, insisted upon stating the relative-risk-greater-than-two rule with a vengeance:

“All of this is not to say, however, that any and all attempts to establish a burden of proof of causation using relative risk will fail. Decisions can be – and in science or medicine are – premised on the lower limit of the relative risk ratio at a requisite confidence level. The point of this minor discussion is that one cannot apply the usual, singular ‘preponderance’ burden to the probability of causation when the only estimate of that probability is statistical relative risk. Instead, a statistical burden of proof of causation must consist of two interdependent parts: a requisite confidence of some minimum relative risk. As we explain in the body of our discussion, the flaws in Dr. Welch’s testimony mean we need not explore this issue any further.44

Id. (emphasis in original).

And despite having declared the improvidence of addressing the relative risk issue, and then the lack of necessity for addressing the issue given Dr. Welch’s flawed testimony, the court nevertheless tackled the issue once more, a couple of pages later:

“It would be folly to require an expert to testify with absolute certainty that a plaintiff was exposed to a specific dose or suffered a specific risk. Dose and risk fall on a spectrum and are not ‘true or false’. As such, any scientific estimate of those values must be expressed as one or more possible intervals and, for each interval, a corresponding confidence that the true value is within that interval.”

Id. at 9 (emphasis in original; internal citations omitted).  The court captured the frequentist concept of the confidence interval as being defined operationally by repeated samplings and their random variability, but the confidence of the confidence interval means that the specified coefficient represents the percentage of all such intervals that include the “true” value, not the probability that a particular interval, calculated from a given sample, contains the true value.  The true value is either in or not in the interval generated from a single sample risk statistic.  Again, it is unclear why the court was weighing in on this aspect of probabilistic evidence when plaintiffs’ expert witness, Welch, offered no quantitation of the overall risk or of the risk attributable to a specific product exposure.

The court indulged the plaintiffs’ no-threshold fantasy but recognized that the risks of low-level asbestos exposure were low, and likely below a doubling of risk, an issue that the court stressed it wanted to avoid.  The court cited one study that suggested a risk (odds) ratio of 1.1 for exposures less than 0.5 fiber/ml – years.  See id. at *5 (citing Y. Iwatsubo et al., “Pleural mesothelioma: dose-response relation at low levels of asbestos exposure in a French population-based case-control study,” 148 Am. J. Epidemiol. 133 (1998) (estimating an odds ratio of 1.1 for exposures less than 0.5 fibers/ml-years).  But the court, which tried to be precise elsewhere, appears to have lost its way in citing Iwatsubo here.  After all, how can a single odds ratio of 1.1 describe all exposures from 0 all the way up to 0.5 f/ml-years?  How can a single odds ratio describe all exposures in this range, regardless of fiber type, when chrystotile asbestos carries little to no risk for mesothelioma, and certainly orders of magnitude risk less than amphibole fibers such as amosite and crocidolite.  And if a low-level exposure has a risk ratio of 1.1, how can plaintiffs’ hired expert witness, Welch, even make the attribution of Dixon’s mesothelioma to the entirety of her exposure, let alone the speculative take-home chrysotile exposure involved from Ford’s brake linings?  Obviously, had the court posed these questions, it would it would have realized that “it is not possible” to permit Welch’s testimony at all.

The court further lost its way in addressing the exculpatory epidemiology put forward by the defense expert witnesses:

“Furthermore, the leading epidemiological report cited by Ford and its amici that specifically studied ‘brake mechanics’, P.A. Hessel et al., ‘Meso-thelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study’, 24 Risk Analysis 547 (2004), does not at all dispel the notion that this population faced an increased risk of mesothelioma due to their industrial asbestos exposure. … When calculated at the 95% confidence level, Hessel et al. estimated that the odds ratio of mesothelioma could have been as low as 0.01 or as high as 4.71, implying a nearly quintupled risk of mesothelioma among the population of brake mechanics. 24 Risk Analysis at 550–51.”

Id. at *8.  Again, the court is fixated with the confidence interval, to the exclusion of the estimated magnitude of the association!  This time, after earlier shouting that it was the lower bound of the interval that matters scientifically, the court emphasizes the upper bound.  The court here has strayed far from the actual data, and any plausible interpretation of them:

“The odds ratio (OR) for employment in brake installation or repair was 0.71 (95% CI: 0.30-1.60) when controlled for insulation or shipbuilding. When a history of employment in any of the eight occupations with potential asbestos exposure was controlled, the OR was 0.82 (95% CI: 0.36-1.80). ORs did not increase with increasing duration of brake work. Exclusion of those with any of the eight exposures resulted in an OR of 0.62 (95% CI: 0.01-4.71) for occupational brake work.”

P.A. Hessel et al., “Mesothelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study,” 24 Risk Analysis 547, 547 (2004).  All of Dr. Hessel’s estimates of effect sizes were below 1.0, and he found no trend for duration of brake work.  Cherry picking out the upper bound of a single subgroup analysis for emphasis was unwarranted, and hardly did justice to the facts or the science.

Dr. Welch’s conclusion that the exposure and risk in this case were “substantial” simply was not a scientific conclusion, and without it her testimony did not provide information for the jury to use in reaching its conclusion as to substantial factor causation. Id. at *7.  The court noted that Welch, and the plaintiffs, may have lacked scientific data to provide estimates of Dixon’s exposure to asbestos or relative risk of mesothelioma, but ignorance or uncertainty was hardly the basis to warrant an expert witness’s belief that the relevant exposures and risks are “substantial.” Id. at *10.  The court was well justified in being discomforted by the conclusory, unscientific opinion rendered by Laura Welch.

In the final puzzle of the Dixon case, the court vacated the judgment, and remanded for a new trial, “either without her opinion on substantiality or else with some quantitative testimony that will help the jury fulfill its charge.”  Id. at *10.  The court thus seemed to imply that an expert witness need not utter the magic word, “substantial,” for the case to be submitted to the jury against a brake defendant in a take-home exposure case.  Given the state of the record, the court should have simply reversed and rendered judgment for Ford.