Pin the Tail on the Significance Test

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply.  See .  An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve.  Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

 

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance.  As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.110 One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading.  The reason for the two-tailed test, however, is not really tied to multiple testing.  If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both  above and below the null hypothesis count in favor of rejecting the null.  A two-tailed test results.  Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication.  Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing?  There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature.  One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

 

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.49 Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321.  This statement is not quite consistent with the chapter on statistics, and it introduces new problems.  The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical.  Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions.  If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value.  The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests.  Id. at 321 n. 49.  The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

 

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression.  Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law.  I believe that the authors’ suggestion about the interest of “most investigators” is incorrect.  The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case).  The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test.  More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine.  The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke.  United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims.  As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims.  See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004  (1988).

 

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value.  According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association.  In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test.  One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand.  Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing  as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex.  Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

 

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests.  While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001).  These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance.  As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation  relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,4[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district.  A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009).  See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240.  In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared.  One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)