Schachtman Law » statistical evidence

TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Old-Fashioned Probablism – Origins of Legal Probabilism

October 26th, 2012

In several posts, I have addressed Professor Haack’s attack on legal probabilism. See

Haack Attack on Legal Probabilism (May 6, 2012). The probabilistic mode of reasoning is not a modern innovation; nor is the notion that the universe is entirely determined, although revealed to humans as a stochastic phenomenon:

“I returned, and saw under the sun, that the race is not to the swift, nor the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all.”

Ecclesiastes 9:11 King James Bible (Cambridge ed.)

The Old Testament describes the “casting of lots,” some sort of dice rolling or coin flipping, in a wide variety of human decision making. The practice is described repeatedly in the Old Testament, and half a dozen times in the New Testament.

Casting of lots figures more prominently in the Old Testament, in the making of important decisions, and in attempting to ascertain “God’s will.” The Bible describes matters of inheritance, Numbers 34:13; Joshua 14:2, and division of property, Joshua 14-21, Numbers 26:55, as decided by lots. Elections to important office, including offices and functions in the Temple, were determined by lot. 1 Chronicles 24:5, 31; 25:8-9; 26:13-14; Luke 1:9.

Casting lots was an early form of alternative dispute resolution – alternative to slaying and smiting. Proverbs describes the lot as used as a method to resolve quarrels. Proverbs 18:18. Lot casting determined fault in a variety of situations. Lots were cast to identify the culprit who had brought God’s wrath upon Jonah’s ship. Jonah 1:7 (“Come, let us cast lots, that we may know on whose account this evil has come upon us.”).

What we might take as a form of gambling appeared to have been understood by the Israelites as a method for receiving instruction from God. Proverbs 16:33 (“The lot is cast into the lap, but its every decision is from the Lord.”). This Old Testament fortune cookie suggests that the Lord knows the outcome of the lot casting, but mere mortals must wager. I like to think the passage means that events that appear to be stochastic to humans may have a divinely determined mechanism. In any event, the Bible describes various occasions on which lots were cast to access the inscrutable intentions and desires of the Lord. Numbers 26:55; 33:54; 34:13; 36:2; Joshua 18:6-10; 1 Chronicles 24:5,31; 1 Samuel 14:42; Leviticus 16:8-10 (distinguishing between sacrificial and scape goat).

In the New Testament, the Apostles cast lots to decide upon a replacement for Judas (Acts 1:26). Matthias was the winner. Matthew, Mark, and John describe Roman soldiers casting lots for Jesus’ garments (Matthew 27:35; Mark 15:24; John 19:24. See also Psalm 22:18. This use of lots by the Roman soldiers seems to have taken some of the magic out of lot casting, which fell into disrepute and gave way to consultations with the Holy Spirit for guidance on important decisions.

The Talmud deals with probabilistic inference in more mundane settings. The famous “Nine Shops” hypothetical poses 10 butcher shops in a town, nine of which sell kosher meat. The hypothetical addresses whether the dietary laws permit eating a piece of meat found in town, when its butchering cannot be attributed to either the nine kosher shops or the one non-kosher shop:

“A typical question involves objects whose identity is not known and reference is made to the likelihood that they derive from a specific type of source in order to determine their legal status, i.e. whether they be permitted or forbidden, ritually clean or unclean, etc. Thus, only meat which has been slaughtered in the prescribed manner is kasher, permitted for food. If it is known that most of the meat available in a town is kasher, there being, say, nine shops selling kasher meat and only one that sells non-kasher meat, then it can be assumed when an unidentified piece of meat is found in the street that it came from the majority and is therefore permitted.”

Nachum L. Rabinovitch, “Studies in the History of Probability and Statistics. XXII Probability in the Talmud,” 56 Biometrika 437, 437 (1969). Rabinovitch goes on to describe the Talmud’s resolution of this earthly dilemma: “follow the majority” or the most likely inference.

A small digression on this Talmudic hypothetical. First, why not try to find out whether someone has lost this package of meat? Or turn the package in to the local “lost and found.” Second, how can it be kosher to eat a piece of meat found lying around in the town? This is really not very appetizing, and it cannot be good hygiene. Third, why not open the package and determine whether it’s a nice pork tenderloin or a piece of cow? This alone could resolve the issue. Fourth, the hypothetical posed asks us to assume a 9:1 ratio of kosher to non-kosher shops, but what if the one non-kosher shop had a market share equal to the other nine? The majority rule could lead to an untoward result for those who wish to keep kosher.

The Talmud’s proposed resolution is, nevertheless, interesting in anticipating the controversy over the use of “naked statistical inferences” in deciding specific causation or discrimination cases. Of course, the 9:1 ratio is sufficiently high that it might allow an inference about the “likely” source of the meat. The more interesting case would have been a town with 11 butcher shops, six of which were kosher. Would the rabbis of old have had the intestinal fortitude to eat lost & found meat, on the basis of a ratio of 6:5?

In the 12th century, Maimonides rejected probabilistic conclusions for assigning criminal liability, at least where the death penalty was at issue:

“The 290th Commandment is a prohibition to carry out punishment on a high probability, even close to certainty . . .No punishment [should] be carried out except where . . . the matter is established in certainty beyond any doubt, and , moreover, it cannot be explained otherwise in any manner. If we do not punish on very strong probabilities, nothing can happen other than a sinner be freed; but if punishment be done on probability and opinion it is possible that one day we might kill an innocent man — and it is better and more desirable to free a thousand sinners, than ever kill one innocent.”

Stephen E. Fienberg, ed., The Evolving Role of Statistical Assessments as Evidence in the Courts 213 (N.Y. 1989), quoting from Nachum Rabinovitch, Probability and Statistical Inference in Ancient and Medieval Jewish Literature 111 (Toronto 1973).

Indiana Senate candidate and theocrat, Republican Richard E. Mourdock, recently opined that conception that results from rape was God’s will:

“I’ve struggled with it myself for a long time, but I came to realize that life is that gift from God. And even when life begins in that horrible situation of rape, that it is something that God intended to happen.”

Jonathan Weisman, “Rape Remark Jolts a Senate Race, and the Presidential One, Too,” N.Y. Times (Oct. 25, 2012 ).

Mourdock’s comments about pregnancies resulting from rape representing God’s will show that stochastic events continue to be interpreted as determined mechanistic events at some “higher plane.” Magical thinking is still with us.

Posted in statistical evidence | Comments Off on Old-Fashioned Probablism – Origins of Legal Probabilism

Siracusano Dicta Infects Daubert Decisions

September 22nd, 2012

Gatekeeping is sometimes intellectually challenging, but the challenge does not excuse sloppy thinking. Understandably, judges will sometimes misunderstand the relevant science. The process, however, allows the public and the scientific community to see what is happening in court cases, rather than allowing the critical scientific reasoning to be hidden in the black box of jury determinations. This transparency can and should invite criticism, commentary, corrections, and consensus, when possible.

Bad legal reasoning is much harder to excuse. The Supreme Court, in Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309 (2011), unanimously affirmed the reversal of a trial court’s Rule 12(b)(6) dismissal of a securities fraud class action. The corporate defendant objected that the plaintiffs failed to plead statistical significance in alleging causation between Zicam and the loss of the sense of smell. The Supreme Court, however, made clear that causation was not required to make out a claim of securities fraud. It was, and would be, sufficient for the company’s product to have raised sufficient regulatory concerns, which in turn would bring regulatory scrutiny and action that would affect the product’s marketability.

The Supreme Court could have disposed of the essential issue in a two page per curiam opinion. Instead the Court issued an opinion signed by Justice Sotomayor, who waxed carelessly about causation and statistical significance, which discussion was not necessary to the holding. Not only was Justice Sotomayor’s discussion obiter dicta, but the dicta were demonstrably incorrect. Matrixx Unloaded (Mar. 29, 2011).

The errant dicta in Siracusano has already led one MDL court astray:

“While the defendant repeatedly harps on the importance of statistically significant data, the United States Supreme Court recently stated that ‘[a] lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events …. medical experts rely on other evidence to establish an inference of causation.’ Matrixx Initiatives, Inc. v. Siracsano, 131 S.Ct. 1309, 1319 (2011).”

Memorandum Opinion and Order at 22, In re Chantix (Varenicline) Products Liability Litigation, MDL No. 2092, Case 2:09-cv-02039-IPJ Document 642 (N.D. Ala. Aug. 21, 2012)[hereafter cited as Chantix]. See Open Admissions for Expert Witnesses in Chantix Litigation.

It was only a matter of time before the Supreme Court’s dictum would be put to this predictably erroneous interpretation. See “The Matrixx Oversold” (April 4, 2011). Within two weeks, the error in Chantix propagated itself in another MDL case, with another trial court succumbing to the misleading dicta in Justice Sotomayor’s opinion. See Memorandum in Support of Separate Pretrial Order No. 8933, Cheek v. Wyeth Pharm. Inc. (E.D.Pa. Aug. 30, 2012)(Bartle, J.).

In Cheek, Judge Harvey Bartle rejected a Rule 702 challenge to plaintiffs’ expert witness’s opinion. I confess that I do not know enough about the expert witness’s opinion or the challenge to assess Judge Bartle’s conclusion. Judge Bartle, however, invoked the Matrixx decision for the dubious proposition that:

“Daubert does not require that an expert opinion regarding causation be based on statistical evidence in order to be reliable. Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1319 (2011). In fact, many courts have recognized that medical professionals often base their opinions on data other than statistical evidence from controlled clinical trials or epidemiological studies. Id. at 1320.”

Cheek at 16. The Cheek decision is a welter of non-sequiturs. The fact that in some instances statistical evidence is not necessary is hardly a warrant to excuse the lack of statistical evidence in every case. The truly disturbing gaps in reasoning, however, are not scientific, but legal. Siracusano was not a “Daubert” opinion; and Siracusano does not, and cannot, support the refusal to inquire whether statistical evidence was necessary in a causation opinion, in main part because causation was not at issue in Siracusano.

Posted in Causation, Expert Witnesses, Rule 702, Scientific Evidence, statistical evidence | Comments Off on Siracusano Dicta Infects Daubert Decisions

Open Admissions for Expert Witnesses in Chantix Litigation

September 1st, 2012

Chantix is medication that helps people stop smoking. Smoking kills people, but make a licensed drug and the lawsuits will come.

Earlier this month, Judge Inge Prytz Johnson, the MDL trial judge in the Chantix litigation, filed an opinion that rejected Pfizer’s challenges to plaintiffs’ general causation expert witnesses. Memorandum Opinion and Order, In re Chantix (Varenicline) Products Liability Litigation, MDL No. 2092, Case 2:09-cv-02039-IPJ Document 642 (N.D. Ala. Aug. 21, 2012)[hereafter cited as Chantix].

Plaintiffs claimed that Chantix causes depression and suicidality, sometimes severe enough to result in suicide, attempted or completed. Chantix at 3-4. Others have written about Judge Johnson’s decision. See Lacayo, “Win Some, Lose Some: Recent Federal Court Rulings on Daubert Challenges to Plaintiffs’ Experts,” (Aug. 30, 2012).

The breadth and depth of error of the trial court’s analysis, or lack thereof, remains, however, to be explored.

STATISTICAL SIGNIFICANCE

The Chantix MDL court notes several times that the defendant “harped” on this or that issue; the reader might think the defendant was a music label rather than a pharmaceutical manufacturer. One of the defendant’s chords that failed to resonate with the trial judge was the point that the plaintiffs’ expert witnesses relied upon statistically non-significant results. Here is how the trial court reported the issue:

“While the defendant repeatedly harps on the importance of statistically significant data, the United States Supreme Court recently stated that ‘[a] lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events …. medical experts rely on other evidence to establish an inference of causation.’ Matrixx Initiatives, Inc. v. Siracsano, 131 S.Ct. 1309, 1319 (2011).”

Chantix at 22.

Well, it was only a matter of time before the Supreme Court’s dictum would be put to this predictably erroneous interpretation. See “The Matrixx Oversold” (April 4, 2011).

Matrixx involved a motion to dismiss the complaint, which the trial court granted, but the Ninth Circuit reversed. No evidence was offered; nor was any ruling that evidence was unreliable or insufficient at issue. The Supreme Court affirmed the Circuit on the issue whether pleading statistical significance was necessary. Matrixx Initiatives took this position in the hopes of avoiding the merits, and so the issue of causation was never before the Supreme Court. A unanimous Supreme Court held that because FDA regulatory action does not require reliable evidence to support a causal conclusion, pleading materiality for a securities fraud suit does not require an allegation of causation, and thus does not require an allegation of statistically significant evidence. Everything that the Court said about statistical significance and causation was obiter dictum, and rather ill-considered dictum at that.

The Supreme Court thus wandered far beyond its holding to suggest that courts “frequently permit expert testimony on causation based on evidence other than statistical significance.” Matrixx Initiatives, Inc. v. Siracsano, 131 S.Ct. 1309, 1319 (2011) (citing Wells v. Ortho Pharm. Corp., 788 F.2d 741, 744-745 (11th Cir.1986)). But the Supreme Court’s citation to Wells, in Justice Sotomayor’s opinion, failed to support the point she was trying to make, or the decision that the trial court announced in Chantix.

Wells involved a claim of birth defects caused by the use of spermicidal jelly contraceptive. At least one study reported a statistically significant increase in detected birth defects over the expected rate. Wells v. Ortho Pharmaceutical Corp., 615 F. Supp. 262 (N.D.Ga. 1985), aff’d, and rev’d in part on other grounds, 788 F.2d 741 (11th Cir.), cert. denied, 479 U.S.950 (1986). Wells is not an example of a case in which an expert witness opined about causation in the absence of a scientific study with statistical significance. Of course, finding statistical significance is just the beginning of assessing the causality of an association; the Wells case was and remains notorious for the expert witness’s poor assessment of all the determinants of scientific causation, including the validity of the studies relied upon.

The Wells decision was met with severe criticism in the 1980s. The decision was widely criticized for its failure to evaluate the entire evidentiary display, as well as for its failure to rule out bias and confounding in the studies relied upon by the plaintiff. See, e.g., James L. Mills and Duane Alexander, “Teratogens and ‘Litogens’,” 15 New Engl. J. Med. 1234 (1986); Samuel R. Gross, “Expert Evidence,” 1991 Wis. L. Rev. 1113, 1121-24 (1991) (“Unfortunately, Judge Shoob’s decision is absolutely wrong. There is no scientifically credible evidence that Ortho-Gynol Contraceptive Jelly ever causes birth defects.”). See also Editorial, “Federal Judges v. Science,” N.Y. Times, December 27, 1986, at A22 (unsigned editorial); David E. Bernstein, “Junk Science in the Courtroom,” Wall St. J. at A 15 (Mar. 24,1993) (pointing to Wells as a prominent example of how the federal judiciary had embarrassed the American judicial system with its careless, non-evidence based approach to scientific evidence). A few years later, another case in the same judicial district, against the same defendant, for the same product, resulted in the grant of summary judgment. Smith v. Ortho Pharmaceutical Corp., 770 F. Supp. 1561 (N.D. Ga. 1991) (supposedly distinguishing Wells on the basis of more recent studies).

Neither the Justices in Matrixx Initiatives nor the trial court in Chantix can be excused for their poor scholarship, or their failure to note that Wells was overruled sub silentio by the Supreme Court’s own subsequent decisions in Daubert, Joiner, Kumho Tire, and Weisgram. And if the weight of precedent did not kill the concept, then there is the simple matter of a supervening statute: the 2000 amendment of Rule 702, of Federal Rules of Evidence.

CONFUSING REGULATORY ACTION WITH CAUSAL ASSESSMENTS

The Supreme Court in Matrixx Initiatives was careful to distinguish causal judgments from regulatory action, but then went on in dictum to conflate the two. The trial judge in Chantix showed no similar analytical care. Judge Johnson held that the asserted absence of statistical significance was not a basis for excluding plaintiffs’ expert witnesses’ opinions on general causation. Her Honor adverted to the Matrixx Initiatives dictum that the FDA “does not apply any single metric for determining when additional inquiry or action is necessary.” Matrixx, 131 S.Ct. at 1320. Chantix at 22. Judge Johnson noted

“that ‘[n]ot only does the FDA rely on a wide range of evidence of causation, it sometimes acts on the basis of evidence that suggests, but does not prove, causation…. the FDA may make regulatory decisions against drugs based on postmarketing evidence that gives rise to only a suspicion of causation’. Matrixx, id. The court declines to hold the plaintiffs’ experts to a more exacting standard as the defendant requests.”

Chantix at 23.

In the trial court’s analysis, the difference between regulatory action and civil litigation fact adjudication is obliterated. This, however, is not the law of the United States, which has consistently acknowledged the difference. See, e.g., IUD v. API, 448 U.S. 607, 656 (1980)(“agency is free to use conservative assumptions in interpreting the data on the side of overprotection rather than underprotection.”)

As the Second Edition of the Reference Manual on Scientific Evidence (which was the out-dated edition cited by the court in Chantix) explains:

“[p]roof of risk and proof of causation entail somewhat different questions because risk assessment frequently calls for a cost-benefit analysis. The agency assessing risk may decide to bar a substance or product if the potential benefits are outweighed by the possibility of risks that are largely unquantifiable because of presently unknown contingencies. Consequently, risk assessors may pay heed to any evidence that points to a need for caution, rather than assess the likelihood that a causal relationship in a specific case is more likely than not.”

Margaret A. Berger, “The Supreme Court’s Trilogy on the Admissibility of Expert Testimony,” in Reference Manual On Scientific Evidence at 33 (Fed. Jud. Ctr. 2d. ed. 2000).

CONCLUSIONS VS. METHODOLOGY

Judge Johnson insisted that the “court’s focus was solely on the principles and methodology, not on the conclusions they generate.” Chantix at 9. This insistence, however, is contrary to the established law of Rule 702.

Although the United States Supreme Court attempted, in Daubert, to draw a distinction between the reliability of an expert witness’s methodology and conclusion, that Court soon realized that the distinction was flawed. If an expert witness’s proffered testimony is discordant from regulatory and scientific conclusions, a reasonable, disinterested scientists would be led to question the reliability of the testimony’s methodology and its inferences from facts and data, to its conclusion. The Supreme Court recognized this connection in General Electric v. Joiner, and the connection between methodology and conclusions was ultimately incorporated into a statute, the revised Federal Rule of Evidence 702:

“[I]f scientific, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training or education, may testify thereto in the form of an opinion or otherwise, if

the testimony is based upon sufficient fact or data,

the testimony is the product of reliable principles and methods; and

the witness has applied the principles and methods reliably to the facts.”

When the testimony is a conclusion about causation, the Rule 702 directs an inquiry into whether that conclusion is based upon sufficient fact or data, and whether that conclusion is the product of reliable principles and methods. The court’s focus should indeed be on the conclusion as well the methodology claimed to generate the conclusion. The Chantix MDL court thus ignored the clear mandate of a statute, Rule 702(1), and applied dictum from Daubert, superseded by Joiner, and an Act of Congress. The ruling is thus legally invalid to the extent it departs from the statute.

EPIDEMIOLOGY

For obscure reasons, Judge Johnson sought to deprecate the need to rely upon epidemiologic studies, whether placebo-controlled clinical trials or observational studies. See Chantix at 25 (citing Rider v. Sandoz Pharm. Corp., 295 F.3d 1194, 1198-99 (11 Cir.2002)). Of course, the language cited in Rider came from a pre-Daubert, pre-Joiner, case, Wells v. Ortho Pharm. Corp., 788 F.2d 741, 745 (11th Cir.1986) (holding that “a cause-effect relationship need not be clearly established by animal or epidemiological studies”). This dubious legal lineage cannot support the glib dismissal of the need for epidemiologic evidence.

WEIGHT OF THE EVIDENCE (WOE)

According to Judge Johnson, plaintiffs’ expert witness Shira Kramer considered all the evidence relevant to Chantix and neuropsychiatric side effects, in what Kramer described as a “weight of the evidence” analysis. Chantix at 26. In her report, Kramer had written that determinations about the weight of evidence are “subjective interpretations” based upon “various lines of scientific evidence. Id. (citing and quoting Kramer’s report). Kramer also claimed that every scientist “brings a unique set of experiences, training and expertise …. Philosophical differences exist between experts…. Therefore, it is not surprising that differences of opinion exist among scientists. Such differences of opinion are not necessarily evidence of flawed scientific reasoning or methodology, but rather differences in judgment between scientists.” Id.

Without any support from scientific literature, or the Reference Manual on Scientific Evidence, Judge Johnson accepted Kramer’s explanation of a totally subjective, unprincipled approach as a scientific methodology. Not surprisingly, Judge Johnson cited the First Circuit’s embrace of a similar vacuous embrace of a WOE analysis in Milward v. Acuity Specialty Products Group, Inc. 639 F.3d 11, 22 (1st Cir. 2011). Chantix at 51.

CHERRY PICKING

Judge Johnson noted, contrary to her earlier suggestion that Shira Kramer had considered all the studies, that Kramer had excluded data from her analysis. Kramer’s basis for excluding data may have been based upon pre-specified exclusionary principles, or they may have been completely ad hoc, as were the lack of weighting principles in her WOE analysis. In its gatekeeping role, however, the trial court expressed complete indifference to Kramer’s selectivity in excluding data. “Why Dr. Kramer chose to include or exclude data from specific clinical trials is a matter for cross-examination.” Chantix at 27. This indifference is an abdication of the court’s gatekeeping responsibility.

POWER

The trial court attempted to justify its willingness to mute defendant’s harping on statistical significance by adverting to the concept of statistical power:

“Oftentimes, epidemiological studies lack the statistical power needed for definitive conclusions, either because they are small or the suspected adverse effect is particularly rare. Id. [Michael D. Green et al., “Reference Guide on Epidemiology,” in Reference Manual on Scientific Evidence 333, 335 (Fed. Judicial Ctr. 2d ed. 2000)… .

Chantix at 29 n.16.

To be fair to the trial court, the Reference Manual invited this illegitimate use of statistical power because it, at times, omits the specification that statistical power requires not only a level of statistical significance to be attained, but also a specified alternative hypothesis to assess power. See Power in the Courts — Part One; Power in the Courts — Part Two. The trial court offered no alternative hypothesis against which any measure of power was to be assessed.

Judge Johnson did not report any power analyses, and she certainly did not report any quantification of power or lack thereof against some specific alternative hypothesis. Judge Johnson’s invocation of power was just that – power used arbitrarily, without data, evidence, or reason.

CONFIDENCE INTERVALS

As with the invocation of statistical power, the trial also invoked the concept of confidence intervals to suggest that such intervals provide a more refined approach to assessing statistical significance:

“A study found to have ‘results that are unlikely to be the result of random error’ is ‘statistically significant’. Reference Guide on Epidemiology, supra, at 354. Statistical significance, however, does not indicate the strength of an association found in a study. Id. at 359. ‘A study may be statistically significant but may find only a very weak association; conversely, a study with small sample sizes may find a high relative risk but still not be statistically significant.’ Id. To reach a ‘more refined assessment of appropriate inferences about the association found in an epidemiologic study’, researchers rely on another statistical technique known as a confidence interval’. Id. at 360.”

Chantix at 30 n.17. True, true, but immaterial. The trial court, again, never carries through with the direction given by the Reference Manual. Not a single confidence interval is presented. No confidence intervals are subjected to this more refined assessment. Why have more refined assessments when even the cruder assessments are not done?

OPEN ADMISSIONS IN SCHOOL OF EXPERT WITNESSING

The trial court somehow had the notion that all it had to do was state that every disputed fact and opinion went to the weight not the admissibility, and then pass to a presumably more scientifically literate jury. To be sure, the court engaged in a good deal of hand waving, going through the motions of deciding a contested issues. Not only did the Judge Johnson smash poor Pfizer’s harp, Her Honor unhinged the gate that federal judges are supposed to keep. Chantix declares that it is now open admissions for expert witnesses testifying to causation in federal cases. This is a judgment in search of an appeal.

Posted in Causation, Expert Witnesses, Rule 702, Scientific Evidence, statistical evidence | Comments Off on Open Admissions for Expert Witnesses in Chantix Litigation

The Dow-Bears Debate the Decline of Daubert

August 10th, 2012

Last month, I posted a short screenplay about how judicial gatekeeping of expert witnesses has slackened recently. See “Daubert Approaching the Age of Majority” (July 17, 2012).

Dr. David Schwartz, of Innovative Science Solutions, has adapted the screenplay to the cinematic screen, and directed a full-length feature movie, The Daubert Will Set Your Client Free, using text-to-talk technology. Dr. Schwartz is not only a first-rate scientist, but he is also an aspiring film maker and artist.

OK; full-length is only a little more than 90 seconds, but you may still enjoy our movie-making debut. And it is coming to a YouTube screen near you, now.

Posted in Expert Witnesses, Rule 702, Scientific Evidence, statistical evidence | Comments Off on The Dow-Bears Debate the Decline of Daubert

Eighth Circuit Holds That Increased Risk Is Not Cause

August 4th, 2012

The South Dakota legislature took it upon itself to specify the “risks” to be included in the informed consent required by state law for an abortion procedure:

(1) A statement in writing providing the following information:
* * *
(e) A description of all known medical risks of the procedure and statistically significant risk factors to which the pregnant woman would be subjected, including:
(i) Depression and related psychological distress;
(ii) Increased risk of suicide ideation and suicide;
* * *

S.D.C.L. § 34-23A-10.1(1)(e)(i)(ii). Planned Parenthood challenged the law on constitutional grounds, and the district court granted a preliminary injunction against the South Dakota statute, which a panel of the Eight Circuit affirmed, only to have that Circuit en banc reverse and remand the case for further proceedings. Planned Parenthood Minn. v. Rounds, 530 F.3d 724 (8th Cir. 2008) (en banc).

On remand, the parties filed cross-motions for summary judgment. The district court held that the so-called suicide advisory was unconstitutional. On the second appeal to the Eight Circuit, a divided panel affirmed the trial court’s holding on the suicide advisory. 653 F.3d 662 (8th Cir. 2011). The Circuit, however, again granted rehearing en banc, and reversed the summary judgment for Planned Parenthood on the advisory. Planned Parenthood Minnesota v. Rounds, Slip op. July 24, 2012 (en banc)[Slip op.].

In support of the injunction, Planned Parenthood argued that the state’s mandatory suicide advisory violated women’s abortion rights and physicians’ free speech rights. The en banc court rejected this argument, holding that the required advisory was “truthful, non-misleading information,” which did not unduly burden abortion rights, even if it might cause women to forgo abortion. See Planned Parenthood of Southeastern Pennsylvania v. Casey, 505 U.S. 833, 882-83 (1992).

Risk ≠ Cause

Planned Parenthood’s success in the trial court turned on its identification of risk (or increased risk) with cause, and its expert witness evidence that causation had not been accepted in the medical literature. In other words, Planned Parenthood argued that the advisory required disclosure of a conclusive causal “link” between abortion and suicide or suicidal ideation. See 650 F. Supp. 2d 972, 982 (D.S.D. 2009). The en banc court, on the second appeal, sought to save the statute by rejecting Planned Parenthood’s reading. The court parsed the statute to suggest that the term “increased risk” is more precise and limited than the umbrella term of “risk,” standing alone. Slip op. at 6. The statute does not define “increased risk,” which the en banc court noted had various meanings in medicine. Id. at 7.

Reviewing the medical literature, the en banc court held that the term “increased risk” does not refer to causation but to a much more modest finding of “a relatively higher probability of an adverse outcome in one group compared to other groups—that is, to ‘relative risk’.” Id. The en banc majority seemed to embroil itself in some considerable semantic confusion. One the hand, the majority, in a rhetorical rift proclaimed that:

“It would be nonsensical for those in the field to distinguish a relationship of ‘increased risk’ from one of causation if the term ‘risk’ itself was equivalent to causation.”

Id. at 9. The majority’s nonsensical labeling is, well, … nonsensical. There is a compelling difference in assessment of risk and causation. Risk is an ex ante concept, applied before the effect has occurred. Assessment or attribution of causation takes place after the effect. Of course, there is a sense of risk or “increased risk,” which is epistemologically more modest, but that hardly makes the more rigorous use of risk as an ex ante cause, nonsensical.

The majority, however, is not content to leave the matter alone. Elsewhere, the en banc court contradicts itself, and endorses a view that risk = causation. For instance, in citing to a civil action involving a claimed causal relationship between Bendectin and a birth defect, the Eighth Circuit reduces risk to cause. See Slip op. at 26 n. 9 (citing Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307, 312 , modified on reh’g, 884 F.2d 166 (5th Cir. 1989)). The en banc court’s “explanatory” parenthetical explains the depths of its confusion:

“explaining that if studies establish, within an acceptable confidence interval, that those who use a pharmaceutical have a relative risk of greater than 1.0—that is, an increased risk—of an adverse outcome, those studies might be considered sufficient to support a jury verdict of liability on a failure-to-warn claim.”

This reading of Brock is wrong on two counts. First, the Fifth Circuit, in Brock, and consistently since, has required the relative risk greater than 1.0 to be statistically significant at the conventional significance probability, as well as other indicia of causality, such as the Bradford Hill factors. So Brock and its progeny did not confuse or conflate risk with cause, or dilute the meaning of cause such that it could be satisfied by a mere showing of an increased relative risk.

Second, Brock itself made a serious error in interpreting statistical significance and confidence intervals. The Bendectin studies at issue in Brock were not statistically significant, and the confidence intervals did not include a measure of no association (relative risk = one). Brock, however, in notoriously incorrect dicta claimed that the computation of confidence intervals took into account bias and confounding as well as sampling variability. Brock v. Merrill Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989)(“Fortunately, we do not have to resolve any of the above questions [as to bias and confounding], since the studies presented to us incorporate the possibility of these factors by the use of a confidence interval.”)(emphasis in original). See, e.g., David H. Kaye, David E. Bernstein, and Jennifer L. Mnookin, The New Wigmore – A Treatise on Evidence: Expert Evidence § 12.6.4, at 546 (2d ed. 2011); Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 86-87 (2009)(criticizing the over-interpretation of confidence intervals by the Brock court); Schachtman, “Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012).

The en banc majority’s discussion of the studies of abortion and suicidality make clear that the presence of bias and confounding in a study may prevent inference of causation, but they do not undermine the conclusion that the studies show an increased risk. A conclusion that the body of epidemiologic studies was inconclusive, and that it failed to “to disentangle confounding factors and establish relative risks of abortion compared to its alternatives,” did not, therefore, render the suicide advisory about risk or increased risk unsupported, untruthful, or misleading. Slip op. at 20. Indeed, the en banc court provided an example, outside the context of abortion, to illustrate its meaning. The en banc court’s use of the example of prolonged television viewing and “increased risk” of mortality suggests that the court took risk to mean any association, no matter how likely it was the result of bias or confounding. See id. at 10 n. 3 (citing Anders Grøntved, et al., “Television Viewing and Risk of Type 2 Diabetes, Cardiovascular Disease, and All-Cause Mortality, 305 J. Am. Med. Ass’n 2448 (2011). The en banc majority held that the advisory would be misleading only if Planned Parenthood could show that the available epidemiologic studies conclusively ruled out causation. Slip op. at 24-25.

The Suicide Advisory Has Little Content Because Risk Is Not Cause

The majority decision clarified that the mandatory disclosure does not require a physician to inform a patient that abortion causes suicide or suicidal thoughts. Slip op. at 25. The en banc court took solace in its realization that physicians’ reviewing the available studies could provide a disclosure that captures the difference between risk, relative risk, and causation. In other words, physicians are free to tell patients that this thing called increased risk is not concerning because the studies are highly confounded, and they do not show causation. Id. at 25-26. Indeed, it would be hard to imagine an ethical physician telling patients anything else.

Dissent

Four of the Eight Circuit judges dissented, pointing to evidence that the South Dakota legislators intended to mandate a disclosure about causality. Slip op. at 29. Putting aside whether the truthfulness of the suicide advisory can be saved by reverting to a more modest interpretation of risk or of increased risk, the dissenters appear to have the better argument that the advisory is misleading. The majority, however, by driving its wedge between causation and increased risk have allowed physicians to explain that the advisory has little or no meaning.

NOCEBO

The nocebo effect is the dark side of the placebo effect. As pointed out recently in the Journal of the American Medical Association, nocebos can induce harmful outcomes because of the expectation of injury from the “psychosocial context or therapeutic environment” affecting patients’ perception of their health. Luana Colloca & Damien Finniss, “Nocebo Effects, Patient-Clinician Communication, and Therapeutic Outcomes,” 307 J. Am. Med. Ass’n 567, 567 (2012). It is fairly well accepted that clinicians can inadvertently prejudice health outcomes by how they frame outcome information to patients. Colloca and Finniss note that the negative expectations created by nocebo communication can take place in the process of obtaining informed consent.

Unfortunately, there is no discussion of nocebo effects in the Eight Circuit’s decision. Planned Parenthood might well consider the role the nocebo effect has on the risk-benefit of an informed consent disclosure about a risk that really is not a risk, or is not a risk in the sense that it is a factor that will result in the putative cause, but rather only something that is under study and which cannot be separated from many confounding factors. Surely, physicians in South Dakota will figure out how to give truthful, non-misleading disclosures that incorporate the mandatory suicide advisory, as well as the scientific evidence.

Posted in Causation, Risk and Risk Factor, Scientific Evidence, statistical evidence | Comments Off on Eighth Circuit Holds That Increased Risk Is Not Cause

Statistical Significance – Will Judicial Notice Substitute for An Expert Witness?

July 23rd, 2012

Do litigants in civil and criminal proceedings need statistical expert witnesses to present statistical analyses? Or, can lawyers take the data that are in evidence, and present their own statistical analyses?

Surely, lawyers could add figures to arrive at a sum, which is relevant to the issues in dispute. Some lawyers and judges might be able to take model assumptions, and compare two means or two proportions, to show that they the statistics did not likely come from the same populations. Indeed, some lawyers may be able to do such analyses better than some expert witnesses, but this begs the question: is it legally permissible?

In In re Pfizer Inc. Securities Litig., 584 F.Supp. 2d 621 (S.D.N.Y. 2008), defendant Pfizer filed a motion to dismiss a securities class action complaint. The court found that Pfizer’s motion would require it to interpret statistical significance, and that it could not accept the parties’ non-expert assertions of the meaning of the concept; nor could the court take judicial notice of the meaning:

“The Court declines to take judicial notice of the meaning of statistical significance or of the data interpretations proffered by Defendants in the context of this motion practice. Rule 201 of the Federal Rules of Evidence provides that courts may only take notice of facts ‘either (1) generally known . . . or (2) capable of accurate and ready determination by resort to sources whose accuracy cannot reasonably be questioned’. Fed. R. Evid. 201(b). While statistical significance may have certain characteristics capable of general abstraction, it is far beyond the scope of Rule 201 to accept as fact the particular definitions of statistical significance proffered by Defendants as either facts generally known or as drawn from sources whose accuracy cannot reasonably be questioned. It is one thing to take notice of the fact that an author has written that 5% is the threshold for statistical significance. It is quite another thing entirely to use that 5% figure as a basis for rejecting the significance of complicated medical studies.”

Id. at 634. Similarly, the court refused to look at specific studies and conclude that they failed to find a statistically significant association between Celebrex and cardiovascular adverse events:

“A motion to dismiss a complaint is not an appropriate vehicle for determination as to the weight of the evidence, expert or otherwise. Clearly, the Court cannot take judicial notice that the three studies show a lack of any statistically significant link between Celebrex/Bextra and adverse cardiovascular events because that supposed fact is neither generally known nor capable of accurate and ready determination by reference to unquestionably accurate sources. Moreover, the Court cannot determine as a matter of law whether such links were statistically insignificant because statistical significance is a question of fact.”

Id. at 635.

In Bristol-Myers Squibb v. AIU Insurance Co., et al., Cause No. A-145,672, Jefferson County, 58th Judicial District, Texas, plaintiff’s counsel made a Batson challenge to the defendants’ exercise of peremptory challenges. See Daily Transcript in (May 13, 1997). Not having expected the defense counsel to exercise their peremptory challenges in an apparently discriminatory fashion, the plaintiff’s counsel did not have a statistician ready to analyze the pattern of challenges. One of the plaintiff’s counsel presented the analysis in his oral argument to the court. The venire panel was made up of 49 persons, 18 black, and 31 white. The defense exercised 6 of their 7 peremptory challenges to black veniremen. Based upon these numbers, plaintiff’s counsel presented a calculation of the probability that defense counsel would have exercised their challenges in such an extreme fashion if they made their choices independent of race. The defense objected to plaintiff’s counsel’s calculations, but the trial court overruled the objection and noted that the laws of probability were subject to judicial notice. Id. at 828-30.

The Texas trial court found a prima facie case of discrimination, and permitted plaintiff’s counsel to cross-examine defense counsel about their peremptory and selection decisions. Id. at 858. The case settled shortly afterwards. See also Andrew T. Berry, “Selecting Jurors,” 24 Litigation 8, 9 (Fall 1997)(“For example, in a recent (unreported) large civil case in the Southwest, the defendants successfully defeated a Batson challenge to their use of 85 percent of their peremptory challenges against protected class members. The successful defense? That the two dozen eminent counsel …, given less than a quarter-hour to exercise their peremptories were simply too disorganized to have struck jurors in violation of Batson.”)

In the welding fume MDL 1535, the plaintiffs persisted in challenges to a particular industry-funded, published epidemiologic study, which reported findings of no increased risks for Parkinson’s disease and parkinsonism among non-shipyard Danish welders. Jon Fryzek, J. Hansen, S. Cohen, J. Bonde, et al., “A cohort study of Parkinson’s disease and other neurodegenerative disorders in Danish welders,” 47 J. Occup. & Envt’l Med. 466 (2005). Plaintiffs’ counsel went to the extreme of traveling to Denmark, with one of their expert witnesses in tow, to analyze the underlying data for this study. Upon returning to the United States, the plaintiffs moved to bar reliance upon the Fryzek study, on the theory that the statistical analysis concerning the article’s finding of no statistically significant difference in the age of onset was incorrect. In support of their argument, one of the plaintiff’s counsel, a law professor who was assisting plaintiffs in the welding fume litigation, submitted an affidavit in support of the motion in limine to bar defense witness’s testimony. See Affidavit of Theodore Eisenberg, in In re Welding Fume Products Liability Litigation, Case No.: 1:03-cv-17000, MDL No. 1535, Document 1862 Filed 08/07/2006.

Eisenberg’s affidavit reported analyses of the Danish data, apparently based upon work done by an unnamed “programmer” at the Danish Cancer Society. The affidavit included truncated computer program output, without identification of the statistical tests, or of the statistical software, used. Eisenberg interpreted the p-value result of the attached statistical analysis to show that there was a statistically significant difference in the age of onset of Parkinson’s disease between welders and non-welders.

The defense opposed the motion on grounds that Eisenberg’s affidavit was an ethically impermissible attempt by a lawyer in the case to present an expert witness opinion. The defense also countered substantively with an affidavit from one of its expert witnesses, who analyzed the affidavit and realized that Eisenberg and the anonymous programmer had not presented the complete software output from their analyses, and that they had used a different test from that used in the published paper. Eisenberg’s affidavit therefore had not identified an error in the published paper. Declaration of Timothy L. Lash (Sept. 11, 2006), filed in In re Welding Fume Products Liability Litigation, Case No.: 1:03-cv-17000, MDL No. 1535. The trial court denied the plaintiff’s motion to bar reliance upon the Fryzek study, without comment on the propriety of Eisenberg’s affidavit.

The MTBE mass tort litigation gave rise a peculiar instance in which a trial court held that a real estate value appraiser had departed from the level of intellectual rigor used in assessing property value changes, claimed to have resulted from a gas station’s pollution of the ground water in a small town in Orange County, New York. The witness opined that the plaintiffs’ property suffered a 15% decline in market value, but he failed to identify the methods he used to arrive at his opinion. In re Methyl Tertiary Butyl Ether (“MTBE”) Prods. Liab. Litig., 2008 U.S. Dist. LEXIS 44216 (S.D.N.Y. June 4, 2008)(Scheindlin, J.). The expert witness did explain that there were so few sales in the affected town that he could not use regression analysis, and that it was thus necessary to look at “trend data on sales by sub-markets, sales/list price analysis and days on the market comparisons.” Id. at *5. Even so, the trial court could not otherwise discern what method the witness did use:

“In this case, I am unable to discern any method — much less a reliable method — that Langer used to reach his conclusion that the value of plaintiffs’ property decreased by fifteen percent because of MTBE contamination. Rather, Langer has merely compiled market data and then offered his conclusions, yet he has failed to explain the relationship between the two.”

Id. at *11.

Although the expert witness’s departure from the professional standard of care rendered his opinion inadmissible, the trial court decided that the would-be expert witness could still testify as a fact witness to the facts that he had collected about sales trends in the affected community and elsewhere. According to the court, the statistics gathered by this witness were relevant, and the plaintiffs’ counsel could argue plausible inferences to the jury, from the sales figures. Id. at *16-17. The court thus remarkably permitted the plaintiffs’ counsel to provide the statistical analysis that his designated expert witness had failed to give in a legally reliably form.

Posted in Expert Witnesses, statistical evidence | Comments Off on Statistical Significance – Will Judicial Notice Substitute for An Expert Witness?

Discovery of Statistician Expert Witnesses

July 19th, 2012

This post has been updated and superseded by “Discovery of Retained, Testifying Statistician Expert Witnesses (Part 1) (June 30, 2015).“

Posted in Expert Witnesses, Rule 702, statistical evidence, Underlying Data | Comments Off on Discovery of Statistician Expert Witnesses

Pin the Tail on the Significance Test

July 14th, 2012

Statistical significance has proven a difficult concept for many judges and lawyers to understand and apply. See Reference Manual on Scientific Evidence (3d edition) on Statistical Significance. An adequate understanding of significance probability requires the recognition that the tail probability that represents the probability of a result at least as extreme as the result obtained if the null hypothesis is true could be the area under one or both sides of the probability distribution curve. Specifying an attained significance probability requires us to specify further whether the p-value is one- or two-sided; that is, whether we have ascertained the result and the more extreme results in one or both directions.

Reference Manual on Scientific Evidence

As with many other essential statistical concepts, we can expert courts and counsel to look to the Reference Manual for guidance. As with the notion of statistical significance itself, the Manual is not entirely consistent or accurate.

Statistics Chapter

The statistics chapter in the Reference Manual on Scientific Evidence provides a good example of one- versus two-tail statistical tests:

“One tail or two?

In many cases, a statistical test can be done either one-tailed or two-tailed; the second method often produces a p-value twice as big as the first method. The methods are easily explained with a hypothetical example. Suppose we toss a coin 1000 times and get 532 heads. The null hypothesis to be tested asserts that the coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.

That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468 heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%. Because small p-values are evidence against the null hypothesis, the one-tailed test seems to produce stronger evidence than its two-tailed counterpart. However, the advantage is largely illusory, as the example suggests. (The two-tailed test may seem artificial, but it offers some protection against possible artifacts resulting from multiple testing—the topic of the next section.)

Some courts and commentators have argued for one or the other type of test, but a rigid rule is not required if significance levels are used as guidelines rather than as mechanical rules for statistical proof.¹¹⁰ One-tailed tests often make it easier to reach a threshold such as 5%, at least in terms of appearance. However, if we recognize that 5% is not a magic line, then the choice between one tail and two is less important—as long as the choice and its effect on the p-value are made explicit.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 255-56 (3ed 2011). This advice is pragmatic but a bit misleading. The reason for the two-tailed test, however, is not really tied to multiple testing. If there were 20 independent tests, doubling the p-value would hardly be “some protection” against multiple testing artifacts. In some cases, where the hypothesis test specifies an alternative hypothesis that is not equal to the null hypothesis, extreme values both above and below the null hypothesis count in favor of rejecting the null. A two-tailed test results. Multiple testing may be a reason for modifying our interpretation of the strength of a p-value, but it really should not drive our choice between one-tailed and two-tailed tests.

The authors of the statistics chapter are certainly correct that 5% is not “a magic line,” but they might ask what does the FDA do when looking to see whether a clinical trial has established efficacy of a new medication. Does it license the medication if the sponsor’s trial comes close to 5%, or does it demand 5%, two-tailed, as a minimal showing? There are times in science, industry, regulation, and law, when a dichotomous test is needed.

Kaye and Freedman provide an important further observation, which is ignored in the subsequent epidemiology chapter’s discussion:

“One-tailed tests at the 5% level are viewed as weak evidence—no weaker standard is commonly used in the technical literature. One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.”

Id. at 255 n.10. This statement is a helpful bulwark against the oft-repeated suggestion that any p-value would be an arbitrary cut-off for rejecting null hypotheses.

Chapter on Multiple Regression

This chapter explains how the choice of the statistical tests, whether one- or two-sided, may be tied to prior beliefs and the selection of the alternative hypothesis in the hypothesis test.

“3. Should statistical tests be one-tailed or two-tailed?

When the expert evaluates the null hypothesis that a variable of interest has no linear association with a dependent variable against the alternative hypothesis that there is an association, a two-tailed test, which allows for the effect to be either positive or negative, is usually appropriate. A one-tailed test would usually be applied when the expert believes, perhaps on the basis of other direct evidence presented at trial, that the alternative hypothesis is either positive or negative, but not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement on the price of the infringed product was either zero or negative. (The sales of the infringing product competed with the sales of the infringed product, thereby lowering the price.) By using a one-tailed test, the expert is in effect stating that prior to looking at the data it would be very surprising if the data pointed in the direct opposite to the one posited by the expert.

Because using a one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test, the choice of a one-tailed test makes it easier for the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed test makes null hypothesis rejection less likely. Because there is some arbitrariness involved in the choice of an alternative hypothesis, courts should avoid relying solely on sharply defined statistical tests.⁴⁹ Reporting the p-value or a confidence interval should be encouraged because it conveys useful information to the court, whether or not a null hypothesis is rejected.”

Id. at 321. This statement is not quite consistent with the chapter on statistics, and it introduces new problems. The choice of the alternative hypothesis is not always arbitrary, there are times when the use of a one-tail or a two-tail test is preferable, but the chapter withholds its guidance. The statement that “one-tailed test produces p-values that are one-half the size of p-values using a two-tailed test” is true for Gaussian distributions, which of necessity are symmetrical. Doubling the one-tailed test value will not necessarily yield a correct two-tailed measure for some asymmetrical binomial or hypergeometric distributions. If great weight must be placed on the exactness of the p-value for legal purposes, and whether the p-value is less than 0.05, then courts must realize that there may alternative approaches to calculating significance probability such as the mid-p-value. The author of the chapter on multiple regression goes on to note that most courts have shown a preference for two-tailed tests. Id. at 321 n. 49. The legal citations, however, are limited, and given the lack sophistication in many courts, it is not clear what prescriptive effect such a preference, if correct, should have.

Chapter on Epidemiology

The chapter on epidemiology appears to be substantially at odds with the chapters on statistics and multiple regression. Remarkably the authors of the epidemiology chapter declare that “most investigators of toxic substances are only interested in whether the agent increases the incidence of disease (as distinguished from providing protection from the disease), a one-tailed test is often viewed as appropriate.” Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 577 n. 83 (3d ed. 2011).

The chapter cites no support for what “most investigators” are “only interested in,” and they fail to provide a comprehensive survey of the case law. I believe that the authors’ suggestion about the interest of “most investigators” is incorrect. The chapter authors cite to a questionable case involving over-the-counter medications that contained phenylpropanolamine (PPA), for allergy and cold decongestion. Id. citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a one-tailed test for statistical significance in a toxic substance case). The PPA case cited another case, Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002), which explicitly rejected the use of the one-tailed test. More important, the preliminary report of the key study in the PPA litigation, used one-tailed tests, when submitted to the FDA, but was revised to use two-tailed tests, when the authors prepared their manuscript for publication in the New England Journal of Medicine. The PPA case thus represents a case which, for regulatory purposes, the one-tail test was used, but for a scientific and clinical audience, the two-tailed test was used.

The other case cited by the epidemiology chapter was the District of Columbia Circuit’s review of an EPA risk assessment of second-hand smoke. United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen). The EPA is a federal agency in the “protection” business, not in investigating scientific claims. As widely acknowledged in many judicial decisions, regulatory action if often based upon precautionary principle judgments, and are different from scientific causal claims. See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y.1984)(“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d in relevant part, 818 F.2d 145 (2d Cir.1987), cert. denied sub nom. Pinkney v. Dow Chemical Co., 484 U.S. 1004 (1988).

Litigation

In the securities fraud class action against Pfizer over Celebrex, one of plaintiffs’ expert witnesses criticized a defense expert witness’s meta-analysis for not using a one-sided p-value. According to Nicholas Jewell, Dr. Lee-Jen Wei should have used a one-sided test for his summary meta-analytic estimates of association. In his deposition testimony, however, Jewell was unable to identify any published or unpublished studies of NSAIDs that used a one-sided test. One of plaintiffs’ expert witnesses, Prof. Madigan, rejected the use of one-sided p-values in this situation, out of hand. Another plaintiffs’ expert witness, Curt Furberg, referred to Jewell’s one-side testing as “cheating” because it assumes an increased risk and artificially biases the analysis against Celebrex. Pfizer’s Mem. of Law in Opp. to Plaintiffs’ Motion to Exclude Expert Testimony by Dr. Lee-Jen Wei at 2, filed Sept. 8, 2009, in In re Pfizer, Inc. Securities Litig., Nos. 04 Civ. 9866(LTS)(JLC), 05 md 1688(LTS), Doc. 153 (S.D.N.Y.)(citing Markel Decl., Ex. 18 at 223, 226, 229 (Jewell Dep., In re Bextra); Ex. 7, at 123 (Furberg Dep., Haslam v. Pfizer)).

Legal Commentary

One of the leading texts on statistical analyses in the law provides important insights into the choice between one-tail and two-tail statistical tests. While scientific studies will almost always use two-tail tests of significance probability, there are times, especially in discrimination cases, when a one-tail test is appropriate:

“Many scientific researchers recommend two-tailed tests even if there are good reasons for assuming that the result will lie in one direction. The researcher who uses a one-tailed test is in a sense prejudging the result by ignoring the possibility that the experimental observation will not coincide with his prior views. The conservative investigator includes that possibility in reporting the rate of possible error. Thus routine calculation of significance levels, especially when there are many to report, is most often done with two-tailed tests. Large randomized clinical trials are always tested with two-tails.

In most litigated disputes, however, there is no difference between non-rejection of the null hypothesis because, e.g., blacks are represented in numbers not significantly less than their expected numbers, or because they are in fact overrepresented. In either case, the claim of underrepresentation must fail. Unless whites also sue, the only Type I error possible is that of rejecting the null hypothesis in cases of underrepresentation when in fact there is no discrimination: the rate of this error is controlled by a one-tailed test. As one statistician put it, a one-tailed test is appropriate when ‘the investigator is not interested in a difference in the reverse direction from the hypothesized’. Joseph Fleiss, Statistical Methods for Rates and Proportions 21 (2d ed. 1981).”

Michael Finkelstein & Bruce Levin, Statistics for Lawyers at 121-22 (2d ed. 2001). These authors provide a useful corrective to the Reference Manual‘s quirky suggestion that scientific investigators are not interested in two-tailed tests of significance. As Finkelstein and Levin point out, however, discrimination cases may involve probability models for which we care only about random error in one direction.

Professor Finkelstein elaborates further in his basic text, with an illustration from a Supreme Court case, in which the choice of the two-tailed test was tied to the outcome of the adjudication:

“If intended as a rule for sufficiency of evidence in a lawsuit, the Court’s translation of social science requirements was imperfect. The mistranslation relates to the issue of two-tailed vs. one-tailed tests. In most social science pursuits investigators recommend two-tailed tests. For example, in a sociological study of the wages of men and women the question may be whether their earnings are the same or different. Although we might have a priori reasons for thinking that men would earn more than women, a departure from equality in either direction would count as evidence against the null hypothesis; thus we should use a two-tailed test. Under a two-tailed test, 1.96 standard errors is associated with a 5% level of significance, which is the convention. Under a one-tailed test, the same level of significance is 1.64 standard errors. Hence if a one-tailed test is appropriate, the conventional cutoff would be 1.64 standard errors instead of 1.96. In the social science arena a one-tailed test would be justified only if we had very strong reasons for believing that men did not earn less than women. But in most settings such a prejudgment has seemed improper to investigators in scientific or academic pursuits; and so they generally recommend two-tailed tests. The setting of a discrimination lawsuit is different, however. There, unless the men also sue, we do not care whether women earn the same or more than men; in either case the lawsuit on their behalf is correctly dismissed. Errors occur only in rejecting the null hypothesis when men do not earn more than women; the rate of such errors is controlled by one-tailed test. Thus when women earn at least as much as men, a 5% one-tailed test in a discrimination case with the cutoff at 1.64 standard deviations has the same 5% rate of errors as the academic study with a cutoff at 1.96 standard errors. The advantage of the one-tailed test in the judicial dispute is that by making it easier to reject the null hypothesis one makes fewer errors of failing to reject it when it is false.

The difference between one-tailed and two-tailed tests was of some consequence in Hazelwood School District v. United States,⁴[433 U.S. 299 (1977)] a case involving charges of discrimination against blacks in the hiring of teachers for a suburban school district. A majority of the Supreme Court found that the case turned on whether teachers in the city of St. Louis, who were predominantly black, had to be included in the hiring pool and remanded for a determination of that issue. The majority based that conclusion on the fact that, using a two-tailed test and a hiring pool that excluded St. Louis teachers, the underrepresentation of black hires was less than two standard errors from expectation, but if St. Louis teachers were included, the disparity was greater than five standard errors. Justice Stevens, in dissent, used a one-tailed test, found that the underrepresentation was statistically significant at the 5% level without including the St. Louis teachers, and concluded that a remand was unnecessary because discrimination was proved with either pool. From our point of view. Justice Stevens was right to use a one-tailed test and the remand was unnecessary.”

Michael Finkelstein, Basic Concepts of Probability and Statistics in the Law 57-58 (N.Y. 2009). See also William R. Rice & Stephen D. Gaines, “Heads I Win, Tails You Lose: Testing Directional Alternative Hypotheses in Ecological and Evolutionary Research,” 9 Trends in Ecology & Evolution 235‐237, 235 (1994) (“The use of such one‐tailed test statistics, however, poses an ongoing philosophical dilemma. The problem is a conflict between two issues: the large gain in power when one‐tailed tests are used appropriately versus the possibility of ‘surprising’ experimental results, where there is strong evidence of non‐compliance with the null hypothesis (Ho) but in the unanticipated direction.”); Anthony McCluskey & Abdul Lalkhen, “Statistics IV: Interpreting the Results of Statistical Tests,” 7 Continuing Education in Anesthesia, Critical Care & Pain 221 (2007) (“It is almost always appropriate to conduct statistical analysis of data using two‐tailed tests and this should be specified in the study protocol before data collection. A one‐tailed test is usually inappropriate. It answers a similar question to the two‐tailed test but crucially it specifies in advance that we are only interested if the sample mean of one group is greater than the other. If analysis of the data reveals a result opposite to that expected, the difference between the sample means must be attributed to chance, even if this difference is large.”).

The treatise, Modern Scientific Evidence, addresses some of the caselaw that faced disputes over one- versus two-tailed tests. David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 240. In discussing a Texas case, Kelley, cited infra, these authors note that the court correctly rejected an expert witness’s attempt to claim statistical significance on the basis of a one-tail test of data in a study of silicone and autoimmune disease.

The following is an incomplete review of cases that have addressed the choice between one- and two-tailed tests of statistical significance.

First Circuit

Chang v. University of Rhode Island, 606 F.Supp. 1161, 1205 (D.R.I.1985) (comparing one-tail and two-tail test results).

Second Circuit

Procter Gamble Co. v. Chesebrough-Pond’s Inc., 747 F. 2d 114 (2d Cir. 1984)(discussing one-tail versus two in the context of a Lanham Act claim of product superiority)

Ottaviani v. State University of New York at New Paltz, 679 F.Supp. 288 (S.D.N.Y. 1988) (“Defendant’s criticism of a one-tail test is also compelling: since under a one-tail test 1.64 standard deviations equal the statistically significant probability level of .05 percent, while 1.96 standard deviations are required under the two-tailed test, the one-tail test favors the plaintiffs because it requires them to show a smaller difference in treatment between men and women.”) (“The small difference between a one-tail and two-tail test of probability is not relevant. The Court will not treat 1.96 standard deviation as the dividing point between valid and invalid claims. Rather, the Court will examine the statistical significance of the results under both one and two tails and from that infer what it can about the existence of discrimination against women at New Paltz.”)

Third Circuit

United States v. Delaware, 2004 U.S. Dist. LEXIS 4560, at *36 n.27 (D. Del. Mar. 22, 2004) (stating that for a one-tailed test to be appropriate, “one must assume … that there will only be one type of relationship between the variables”)

Fourth Circuit

Equal Employment Opportunity Comm’n v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir. 1983)(“We repeat, however, that we are not persuaded that it is at all proper to use a test such as the “one-tail” test which all opinion finds to be skewed in favor of plaintiffs in discrimination cases, especially when the use of all other neutral analyses refutes any inference of discrimination, as in this case.”), rev’d on other grounds, sub nom. Cooper v. FRB of Richmond, 467 U.S. 867 (1984)

Hoops v. Elk Run Coal Co., Inc., 95 F.Supp.2d 612 (S.D.W.Va. 2000)(“Some, including our Court of Appeals, suggest a one-tail test favors a plaintiff’s point of view and might be inappropriate under some circumstances.”)

Fifth Circuit

Kelley v. American Heyer-Schulte Corp., 957 F. Supp. 873, 879, (W.D. Tex. 1997), appeal dismissed, 139 F.3d 899 (5th Cir. 1998)(rejecting Shanna Swan’s effort to reinterpret study data by using a one-tail test of significance; ‘‘Dr. Swan assumes a priori that the data tends to show that breast implants have negative health effects on women—an assumption that the authors of the Hennekens study did not feel comfortable making when they looked at the data.’’)

Brown v. Delta Air Lines, Inc., 522 F.Supp. 1218, 1229, n. 14 (S.D.Texas 1980)(discussing how one-tailed test favors plaintiff’s viewpoint)

Sixth Circuit

Dobbs-Weinstein v. Vanderbilt Univ., 1 F.Supp.2d 783 (M.D. Tenn. 1998) (rejecting one-tailed test in discrimination action)

Seventh Circuit

Mozee v. American Commercial Marine Service Co., 940 F.2d 1036, 1043 & n.7 (7th Cir. 1991)(noting that district court had applied one-tailed test and that plaintiff did not challenge that application on appeal), cert. denied, ___ U.S. ___, 113 S.Ct. 207 (1992)

Premium Plus Partners LLP v. Davis, 653 F.Supp. 2d 855 (N.D. Ill. 2009)(rejecting challenge based in part upon use of a one-tailed test), aff’d on other grounds, 648 F.3d 533 (7th Cir. 2011)

Ninth Circuit

In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (refusing to reject reliance upon a study of stroke and PPA use, which was statistically significant only with a one-tailed test)

Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1242-43 (E.D. Wash. 2002) (rejecting use of one-tailed test when its use assumes fact in dispute)

Stender v. Lucky Stores, Inc., 803 F.Supp. 259, 323 (N.D.Cal. 1992)(“Statisticians can employ either one or two-tailed tests in measuring significance levels. The terms one-tailed and two-tailed indicate whether the significance levels are calculated from one or two tails of a sampling distribution. Two-tailed tests are appropriate when there is a possibility of both overselection and underselection in the populations that are being compared. One-tailed tests are most appropriate when one population is consistently overselected over another.”)

District of Columbia Circuit

United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use one-tailed test in assessing whether second-hand smoke was a carcinogen)

Palmer v. Shultz, 815 F.2d 84, 95-96 (D.C.Cir.1987)(rejecting use of one-tailed test; “although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion.”)

Moore v. Summers, 113 F. Supp. 2d 5, 20 & n.2 (D.D.C. 2000)(stating preference for two-tailed test)

Hartman v. Duffey, 88 F.3d 1232, 1238 (D.C.Cir. 1996)(“one-tailed analysis tests whether a group is disfavored in hiring decisions while two-tailed analysis tests whether the group is preferred or disfavored.”)

Csicseri v. Bowsher, 862 F. Supp. 547, 565, 574 (D.D.C. 1994)(noting that a one-tailed test is “not without merit,” but a two-tailed test is preferable)

Berger v. Iron Workers Reinforced Rodmen Local 201, 843 F.2d 1395 (D.C. Cir. 1988)(describing but avoiding choice between one-tail and two-tail tests as “nettlesome”)

Segar v. Civiletti, 508 F.Supp. 690 (D.D.C. 1981)(“Plaintiffs analyses are one tailed. In discrimination cases of this kind, where only a positive disparity is of interest, the one tailed test is superior.”)

Posted in Reference Manual on Scientific Evidence, Rule 702, statistical evidence | Comments Off on Pin the Tail on the Significance Test

Tal Golan’s Preliminary History of Epidemiologic Evidence in U.S. Courts

July 10th, 2012

Tal Golan is an historian, with a special interest in the history of science in the 18th and 19th centuries, and in historical relationships among, science, technology, and the law. He now teaches history at the University of California, San Diego. Golan’s book on the history of expert witnesses in the common law is an important starting place in understanding the evolution of the adversarial expert witness system in English and American courts. Tal Golan, Laws of Man and Laws of Nature: A History of Scientific Expert Testimony (Harvard 2004).

Last year, Golan led a faculty seminar at the University of Haifa’s Law School on the history of epidemiologic evidence in 20th century American litigation. A draft of Golan’s paper is available at the school’s website, and for those interested in the evolution of the American courts’ treatment of statistical and epidemiologic evidence, the paper is worth a look. Tal Golan, “A preliminary history of epidemiological evidence in the twentieth-century American Courtroom” manuscript (2011) [Golan 2011].

There are problems, however, with Golan’s historical narrative. Golan points to tobacco cases as the earliest forays into the use of epidemiologic evidence to prove health claims in court:

“I found only four toxic tort cases in the 1960s that involved epidemiological evidence – two tobacco and two vaccine cases. In the tobacco cases, the plaintiffs tried and failed to establish a causal relation between smoking and cancer via the testimony of epidemiological experts. In both cases the judges dismissed the epidemiological evidence and directed summary verdicts for the tobacco companies.³⁸”

Golan 2011 at 11 & n. 38 (citing Pritchard v. Liggett & Myers Tobacco Co., 295 F.2d 292 (1961); Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (1963)). Golan may be correct that some of the early tobacco cases were dismissive of statistical and epidemiologic evidence, but these citations do not support his assertion. The Latrigue case resulted in a defense verdict after a jury trial. The judgment for the defendant was affirmed on appeal, with specific reference to the plaintiff’s use of epidemiologic evidence. Lartigue v. R.J. Reynolds Tobacco Co., 317 F.2d 19 (5th Cir. 1963) (“The plaintiff contends that the jury’s verdict was contrary to the manifest weight of the evidence. The record consists of twenty volumes, not to speak of exhibits, most of it devoted to medical opinion. The jury had the benefit of chemical studies, epidemiological studies, reports of animal experiments, pathological evidence, reports of clinical observations, and the testimony of renowned doctors. The plaintiff made a convincing case, in general, for the causal connection between tobacco and cancer and, in particular, for the causal connection between Lartigue’s smoking and his cancer. The defendants made a convincing case for the lack of any causal connection.”), cert. denied, 375 U.S. 865 (1963), and cert. denied, 379 U.S. 869 (1964). Golan is thus wrong to suggest that the plaintiffs in Lartigue suffered a summary judgment or a directed verdict on their causation claims.

In Pritchard, the plaintiff had three trials in the course of litigating his tobacco-related claims. See Pritchard v. Liggett & Myers Tobacco Co., 134 F. Supp. 829 (W.D. Pa. 1955), rev’d, 295 F.2d 292, 294 (3d Cir. 1961), 350 F.2d 479 (3d Cir. 1965), cert. denied, 382 U.S. 987 (1966), amended, 370 F.2d 95 (3d Cir. 1966), cert. denied, 386 U.S. 1009 (1967). The Pritchard case ultimately turned on liability more than causation issues. In both cases, Golan’s citations are abridged and incorrect.

Golan also wades into a discussion of statistical significance in which he misstates the meaning of the concept and he incorrectly describes how it was handled in at least one important case:

“Statistics provides such an assurance by calculating the probability of false association, and the epidemiological dogma demands it to be smaller than 5% (i.e, less than 1 in 20) for the association to be considered statistically significant.”

Golan 2011, at 18. This statement is wrong. Statistics do not provide a probability of the truth or falsity of the association. The significance probability to which Golan refers measures the probability of data at least as extreme as those observed if the null hypothesis of no difference is correct.

Having misunderstood and misstated the meaning of significance probability, Golan proceeds to make the classic misidentification of statistical significance probability with the probability of the either the null hypothesis or the observed result. Frequentist statistical testing cannot do this, and Golan’s error has no place in a history of these concepts other than to point out that courts have frequently made this mistake:

“The ‘statistical significance‘ standard is far more demanding than the ‘preponderance of the evidence‘ or ‘more likely than not‘ standard used in civil law. It reflects the cautious attitude of scientists who wish to be 95% certain that their measurements are not spurious.

**********

Epidemiologists have considered the price well worth paying. So has criminal law, which emphasizes the minimization of false conviction, even at the price of overlooking true crime. But civil law does not share this concern.”

This narrative misstates what epidemiologist are doing in using significance probability and null hypothesis significance testing. The confusion between epidemiologic statistical standards and burden of proof in criminal cases is a serious error.

Golan compares and contrasts the approaches of the trial judges in Allen v. United States, and in In re Agent Orange:

“Judge Weinstein, on the other hand, was far less concerned with the strictness of the epidemiology. A scholar of evidence, member of the Advisory Committee that drafted the Federal Rules of Evidence during the early 1970s, and a critic of the partisan deployment of science in the adversarial courtroom, Weinstein embraced the stringent 95% significance threshold as a ready-made admissibility test that could validate the veracity of the statistical evidence used in court. Thus, while he referred to epidemiological studies as ―the best (if not the sole) available evidence in mass exposure cases,‖ he nevertheless refused to accept them in evidence, unless they were statistically significant.⁶⁴”

Golan at 19. Weinstein is all that and more, but he never simplistically embraced statistical significance as a “ready-made admissibility test.” Of course 95% is the coefficient of confidence, and the complement of alpha of 0.05%, but this alpha is not a particularly stringent threshold unless it is misunderstood as a burden of proof. Contrary to Golan’s suggestion, Judge Weinstein was not being conservative or restrictive in his approach in In re Agent Orange.

Golan’s “preliminary” history is a good start, but it misses an important perspective. After World War II, biological science, in the form of genetics, as well as epidemiology and other areas, grew to encompass stochastic processes as well as mechanistic processes. To a large extent, in permitting judgments to be based upon statistical and epidemiologic evidence, the law was struggling to catch up with developments in science. There is quite a bit of evidence that the law is still struggling.

Posted in Rule 702, Scientific Evidence, statistical evidence | Comments Off on Tal Golan’s Preliminary History of Epidemiologic Evidence in U.S. Courts

Reference Manual on Scientific Evidence (3d edition) on Statistical Significance

July 8th, 2012

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance? Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards. Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

Berger’s citations in support are curiously inaccurate. Footnote 62 cites the Cook case:

“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”

The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study.

Footnote 63 is no better:

“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”

With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis. The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies. The court’s review was hardly extensive. When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony. In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009). Berger’s characterization of the review is incorrect, and her failure to cite the subsequent procedural history disturbing.

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction. The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011). Although the chapter confuses and conflates the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57. This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.

Chapter on Multiple Regression

The chapter on regression does not add much to the earlier and later discussions. The author asks rhetorically what is the appropriate level of statistical significance, and answers:

“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47”

Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.

Chapter on Epidemiology

The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573. The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics. The problem is that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample. Furthermore, the chapter’s statement here omits reference to the conditional nature of the probability that makes it dependent upon the assumption of correctness of the null hypothesis.

The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.” The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary. Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off. Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”

Id. at 573n.68. Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing. The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance. See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

The epidemiology chapter also chastises courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct:

“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).”

Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”

577 n81. The footnotes goes to explain further the difference between alpha probability and burden of proof probability, but incorrectly asserts that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.” Id. The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expect value if the null hypothesis is true. Kaye and Freedman’s chapter on statistics does much better at describing and avoiding the transpositional fallacy when describing p-values.

When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct. What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87

Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”

Id. at 578-79. Mostly true, but again rather unhelpful to judges and lawyers. The authors divide the world up into “strict” testers and those critical of “strict” testing. Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all? I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis. Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them. But what of those critical of strict testing? This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008). The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample. Confidence intervals, however, do not eliminate the need to interpret the extent of random error.

In the final analysis, the epidemiology chapter is unclear and imprecise. I believe it confuses matters more than it clarifies. There is clearly room for improvement in the Fourth Edition.

Posted in Meta-analysis, Rule 702, statistical evidence | Comments Off on Reference Manual on Scientific Evidence (3d edition) on Statistical Significance