TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Traditional, Frequentist Statistics Still Hegemonic

March 25th, 2017

The Defense Fallacy

In civil actions, defendants, and their legal counsel sometimes argue that the absence of statistical significance across multiple studies requires a verdict of “no cause” for the defense. This argument is fallacious, as can be seen where there are many studies, say eight or nine, which all consistently find elevated risk ratios, but with p-values slightly higher than 5%. The probability that eight studies, free of bias, would consistently find an elevated risk ratio, regardless of the individual studies’ p-values, is itself very small. If the studies were amenable to meta-analysis, the summary estimate of the risk ratio would itself likely be highly statistically significant in this hypothetical.

The Plaintiffs’ Fallacy

The plaintiffs’ fallacy derives from instances, such as the hypothetical one above, in which statistical significance, taken as a property of individual studies, is lacking. Even though we can hypothesize such instances, plaintiffs fallaciously extrapolate from them to the conclusion that statistical significance, or any other measure of sampling estimate precision, is unnecessary to support a conclusion of causation.

In courtroom proceedings, epidemiologist Kenneth Rothman is frequently cited by plaintiffs as having shown or argued that statistical significance is unimportant. For instance, in the Zoloft multi-district birth defects litigation, plaintiffs argued in a motion for reconsideration of the exclusion of their epidemiologic witness that the trial court had failed to give appropriate weight to the Supreme Court’s decision in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27 (2011), as well as to the Third Circuit’s invocation of the so-called “Rothman” approach in a Bendectin birth defects case, DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941 (3d Cir. 1990). According to the plaintiffs’ argument, their excluded epidemiologic witness, Dr. Anick Bérard, had used this approach in arriving at her novel conclusion that sertraline causes virtually every kind of birth defect.

The Zoloft plaintiffs did not call Rothman as a witness; nor did they even present an expert witness to explain what Rothman’s arguments were. Instead, the plaintiffs’ counsel, sneaked in some references and vague conclusions into their cross-examinations of defense expert witnesses, and submitted snippets from Rothman’s textbook, Modern Epidemiology.

If plaintiffs had called Dr. Rothman to testify, he would have probably insisted that statistical significance is not a criterion for causation. Such insistence is not as helpful to plaintiffs in cases such as Zoloft birth defects cases as their lawyers might have thought or hoped. Consider for instance the cases in which causal inferences are arrived at without formal statistical analysis. These instances are often not relevant to mass tort litigation that involve prevalent exposure and a prevalent outcome.

Rothman also would have likely insisted that consideration of random variation and bias are essential to the assessment of causation, and that many apparently or nominally statistically significant associations do not and cannot support valid inferences of causation. Furthermore, he might have been given the opportunity to explain that his criticisms of significance testing are as much directed to the creation of false positive as to false negative rates in observational epidemiology. In keeping with his publications, Rothman would have challenged strict significance testing with p-values as opposed to the use of sample statistical estimates in conjunction with confidence intervals. The irony of the Zoloft case and many other litigations was that the defense was not using significance testing in the way that Rothman had criticized; rather the plaintiffs were over-endorsing statistical significance that was nominal, plagued by multi-testing, and inconsistent.

Judge Rufe, who presided over the Zoloft MDL, pointed out that the Third Circuit in DeLuca had never affirmatively endorsed Professor Rothman’s “approach,” but had reversed and remanded the Bendectin case to the district court for a hearing under Rule 702:

by directing such an overall evaluation, however, we do not mean to reject at this point Merrell Dow’s contention that a showing of a .05 level of statistical significance should be a threshold requirement for any statistical analysis concluding that Bendectin is a teratogen regardless of the presence of other indicia of reliability. That contention will need to be addressed on remand. The root issue it poses is what risk of what type of error the judicial system is willing to tolerate. This is not an easy issue to resolve and one possible resolution is a conclusion that the system should not tolerate any expert opinion rooted in statistical analysis where the results of the underlying studies are not significant at a .05 level.”

2015 WL 314149, at *4 (quoting from DeLuca, 911 F.2d at 955). And in DeLuca, after remand, the district court excluded the DeLuca plaintiffs’ expert witnesses, and granted summary judgment, based upon the dubious methods employed by plaintiffs’ expert witnesses (including the infamous Dr. Done, and Shanna Swan), in cherry picking data, recalculating risk ratios in published studies, and ignoring bias and confounding in studies. On subsequent appeal, the Third Circuit affirmed the judgment for Merrell Dow. DeLuca v. Merrell Dow Pharma., Inc., 791 F. Supp. 1042 (3d Cir. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993).

Judge Rufe similarly rebuffed the plaintiffs’ use of the Rothman approach, their reliance upon Matrixx, and their attempt to banish consideration of random error in the interpretation of epidemiologic studies. In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2015 WL 314149 (E.D. Pa. Jan. 23, 2015) (Rufe, J.) (denying PSC’s motion for reconsideration). SeeZoloft MDL Relieves Matrixx Depression” (Feb. 4, 2015).

Some Statisticians’ Errors

Recently, Dr. Rothman and three other epidemiologists set out to track the change, over time, from 1975 to 2014, of the use of various statistical methodologies. Andreas Stang, Markus Deckert, Charles Poole & Kenneth J. Rothman, “Statistical inference in abstracts of major medical and epidemiology journals 1975–2014: a systematic review,” 32 Eur. J. Epidem. 21 (2017) [cited below as Stang]. They made clear that their preferred methodological approach was to avoid the strictly dichotomous null hypothesis significance testing (NHST), which has evolved from Fisher’s significance testing and Neyman’s null hypothesis testing (NHT), in favor of the use of estimation with confidence intervals (CI). The authors conducted a meta-study, that is a study of studies, to track the trends in use of NHST, ST, NHT and CI reporting in the major bio-medical journals.

Unfortunately, the authors limited their data and analysis to abstracts, which makes their results very likely misleading and incomplete. Even when abstracts reported using so-called CI-only approaches, the authors may well have reasoned that point estimates with CIs that spanned no association were “non-significant.” Similarly, authors who found elevated risk ratios with very wide confidence intervals may well have properly acknowledged that their study did not provide credible evidence of an association. See W. Douglas Thompson, “Statistical criteria in the interpretation of epidemiologic data,” 77 Am. J. Public Health 191, 191 (1987) (discussing the over-interpretation of skimpy data).

Rothman and colleagues found that while a few epidemiologic journals had a rising prevalence of CI-only reports in abstracts, for many biomedical journals the NHST approach remained more common. Interestingly, at three of the major clinical medical journals, the Journal of the American Medical Association, the New England Journal of Medicine, and Lancet, the NHST has prevailed over the almost four decades of observation.

The clear implication of Rothman’s meta-study is that consideration of significance probability, whether or not treated as a dichotomous outcome, and whether or not treated as a p-value or a point estimate with a confidence interval, is absolutely critical to how biomedical research is conducted, analyzed, and reported. In Rothman’s words:

Despite the many cautions, NHST remains one of the most prevalent statistical procedures in the biomedical literature.”

Stang at 22. See also David Chavalarias, Joshua David Wallach, Alvin Ho Ting & John P. A. Ioannidis, “Evolution of Reporting P Values in the Biomedical Literature, 1990-2015,” 315 J. Am. Med. Ass’n 1141 (2016) (noting the absence of the use of Bayes’ factors, among other techniques).

There is one aspect to the Stang article that is almost Trump-like in its citing to an inappropriate, unknowledgable source and then treating its author as having meaningful knowledge of the subject. As part of their rhetorical goals, Stang and colleagues declare that:

there are some indications that it has begun to create a movement away from strict adherence to NHT, if not to ST as well. For instance, in the Matrixx decision in 2011, the U.S. Supreme Court unanimously ruled that admissible evidence of causality does not have to be statistically significant [12].”

Stang at 22. Whence comes this claim? Footnote 12 takes us to what could well be fake news of a legal holding, an article by a statistician about a legal case:

Joseph L. Gastwirth, “Statistical considerations support the Supreme Court’s decision in Matrixx Initiatives v. Siracusano, 52 Jurimetrics J. 155 (2012).

Citing a secondary source when the primary source is readily available, and what is at issue, seems like poor scholarship. Professor Gastwirth is a statistician, not a lawyer, and his exegesis of the Supreme Court’s decision is wildly off target. As any first year law student could discern, the Matrixx case could not have been about the admissibility of evidence because the case had been dismissed on the pleadings, and no evidence had ever been admitted or excluded. The only issue on appeal was the adequacy of the allegations, not the admissibility of evidence.

Although the Court managed to muddle its analysis by wandering off into dicta about causation, the holding of the case is that alleging causation was not required to plead a case of materiality for a securities fraud case. Having dispatched causality from the case, the Court had no serious business in setting the considerations for alleging in pleadings or proving at trial the elements of causation. Indeed, the Court made it clear that its frolic and detour into causation could not be taken seriously:

We need not consider whether the expert testimony was properly admitted in those cases [cited earlier in the opinion], and we do not attempt to define here what constitutes reliable evidence of causation.”

Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27, 131 S.Ct. 1309, 1319 (2011).

The word “admissible” or “admissibility” never appear in the Court’s opinion, and the above quote explains that the admissibility was not considered. Laughably, the Court went on to cite three cases as examples of supposed causation opinions in the absence of statistical significance. Two of the three were specific causation, differential etiology cases that involved known general causation. The third case involved a claim of birth defects from contraceptive jelly, when the plaintiffs’ expert witnesses actually relied upon statistically significant (but thoroughly flawed and invalid) associations.1

When it comes to statistical testing the legal world would be much improved if lawyers actually and carefully read statistics authors, and if statisticians and scientists actually read court opinions.

Washington Legal Foundation’s Paper on Statistical Significance in Rule 702 Proceedings

March 13th, 2017

The Washington Legal Foundation has released a Working Paper, No. 201, by Kirby Griffis, entitledThe Role of Statistical Significance in Daubert / Rule 702 Hearings,” in its Critical Legal Issues Working Paper Series, (Mar. 2017) [cited below as Griffis]. I am a fan of many of the Foundation’s Working Papers (having written one some years ago), but this one gives me pause.

Griffis’s paper manages to avoid many of the common errors of lawyers writing about this topic, but adds little to the statistics chapter in the Reference Manual on Scientific Evidence (3d ed. 2011), and he propagates some new, unfortunate misunderstandings. On the positive side, Griffis studiously avoids the transposition fallacy in defining significance probability, and he notes that multiplicity from subgroups and multiple comparisons often undermines claims of statistical significance. Griffis gets both points right. These are woefully common errors, and they deserve the emphasis Griffis gives to them in this working paper.

On the negative side, however, Griffis falls into error on several points. Griffis helpfully narrates the Supreme Court’s evolution in Daubert and then in Joiner, but he fails to address the serious mischief and devolution introduced by the Court’s opinion in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27, 131 S.Ct. 1309 (2011). See Schachtman, “The Matrixx – A Comedy of Errors” (April 6, 2011)”; David Kaye, “Trapped in the Matrixx: The U.S. Supreme Court and the Need for Statistical Significance,” BNA Product Safety & Liability Reporter 1007 (Sept. 12, 2011). With respect to statistical practice, this Working Paper is at times wide of the mark.

Non-Significance

Although avoiding the transposition fallacy, Griffis falls into another mistake in interpreting tests of significance; he states that a non-significant result tells us that an hypothesis is “perfectly consistent with mere chance”! Griffis at 9. This is, of course, wrong, or at least seriously misleading. A failure to reject the null hypothesis does not prove the null such that we can say that the “null results” in one study were perfectly consistent with chance. The test may have lacked power to detect an “effect size” of interest. Furthermore, tests of significance cannot rule out systematic bias or confounding, and that limitation alone ensures that Griffis’s interpretation is mistaken. A null result may have resulted from bias or confounding that obscured a measurable association.

Griffis states that p-values are expressed as percentages “usually 95% or 99%, corresponding to 0.05 or 0.01,” but this states things backwards. The p-value that is pre-specified to be “significant” is a probability or percentage that is low; it is the coefficient of confidence used to construct a confidence interval that is the complement of the significance probability. Griffis at 10. An alpha, or pre-specified statistical significance level, of 5% thus corresponds to a coefficient of confidence of 95% (or 1.0 – 0.05).

The Mid-p Controversy

In discussing the emerging case law, Griffis rightly points to cases that chastise Dr. Nicholas Jewell for the many liberties he has taken in various litigations as an expert witness for the lawsuit industry. One instance cited by Griffis is the Lipitor diabetes litigation, where the MDL court suggested that Jewell switched improperly from a Fisher’s exact test to a mid-test. Griffis at 18-19. Griffis seems to agree, but as I have explained elsewhere, Fisher’s exact test generates a one-tailed measure of significance probability, and the analyst is left to one of several ways of calculating a two-tailed test. SeeLipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test” (April 21, 2016). The mid-p is one legitimate approach for asymmetric distributions, and is more favorable to the defense than passing off the one-tailed measure as the result of the test. The mere fact that a statistical software package does not automatically specify the mid-p for a Fisher’s exact analysis does not make invoking this measure into p-hacking or other misconduct. Doubling the attained significance probability of a particular Fisher’s exact test result is generally considered less accurate than a mid-p calculation, even though some software packages using doubling attained significance probability as a default. As much as we might dislike bailing Jewell out of Daubert limbo, on this one, limited point, he deserved a better hearing.

Mis-Definitions

On recounting the Bendectin litigation, Griffis refers to the epidemiologic studies of birth defects and Bendectin as “experiments,” Griffis at 7, and then describes such studies as comparing “populations,” when he clearly meant “samples.” Griffis at 8.

Griffis conflates personal bias with bias as a scientific concept of systematic error in research, a confusion usually perpetuated by plaintiffs’ counsel. See Griffis at 9 (“Coins are not the only things that can be biased: scientists can be, too, as can their experimental subjects, their hypotheses, and their manipulations of the data.”) Of course, the term has multiple connotations, but too often an accusation of personal bias, such as conflict of interest, is used to avoid engaging with the merits of a study.

Relative Risks

Griffis correctly describes the measure known as “relative risk” as a determination of the “the strength of a particular association.” Griffis at 10. The discussion then lapses into using a given relative risk as a measure of the likelihood that an individual with the exposure studied develop the disease. Sometimes this general-to-specific inference is warranted, but without further analysis, it is impossible to tell whether Griffis lapsed from general to specific, deliberately or inadvertently, in describing the interpretation of relative risk.

Conclusion

Griffis is right in his chief contention that the proper planning, conduct and interpretation statistical tests is hugely important to judicial gatekeeping of some expert witness opinion testimony under Federal Rule of Evidence 702 (and under Rule 703, too). Judicial and lawyer aptitude in this area is low, and needs to be bolstered.

Hacked By GeNErAL

November 15th, 2016

~!Hacked By GeNErAL alias Mathis!~

Hacked By GeNErAL

 

Greetz : Kuroi’SH, RxR, ~

\!/Just for Fun ~Hacked By GeNErAL\!/

Hacked By GeNErAL! !

Statistical Analysis Requires an Expert Witness with Statistical Expertise

November 13th, 2016

Christina K. Connearne sued her employer, Main Line Hospitals, for age discrimination. Main Line charged Connearne with fabricating medical records, but Connearne replied that the charge was merely a pretext. Connearney v. Main Line Hospitals, Inc., Civ. Action No. 15-02730, 2016 WL 6569292 (E.D. Pa. Nov. 4, 2016) [cited as Connearney]. Connearne’s legal counsel engaged Christopher Wright, an expert witness on “human resources,” for a variety of opinions, most of which were not relevant to the action. Alas, for Ms. Connearne, the few relevant opinions proffered by Wright were unreliable. On a Rule 702 motion, Judge Pappert excluded Wright from testifying at trial.

Although not a statistician, Wright sought to offer his statistical analysis in support of the age discrimination claim. Connearney at *4. According to Judge Pappert’s opinion, Wright had taken just two classes in statistics, but perhaps His Honor meant two courses. (Wright Dep., at 10:3–4.) If the latter, then Wright had more statistical training than most physicians who are often permitted to give bogus statistical opinions in health effects litigation. In 2015, the Medical College Admission Test apparently started to include some very basic questions on statistical concepts. Some medical schools now require an undergraduate course in statistics. See Harvard Medical School Requirements for Admission (2016). Most medical schools, however, still do not require statistical training for their entering students. See Veritas Prep, “How to Select Undergraduate Premed Coursework” (Dec. 5, 2011); “Georgetown College Course Requirements for Medical School” (2016).

Regardless of formal training, or lack thereof, Christopher Wright demonstrated a profound ignorance of, and disregard for, statistical concepts. (Wright Dep., at 10:15–12:10; 28:6–14.) Wright was shown to be the wrong expert witness for the job by his inability to define statistical significance. When asked what he understood to be a “statistically significant sample,” Wright gave a meaningless, incoherent answer:

I think it depends on the environment that you’re analyzing. If you look at things like political polls, you and I wouldn’t necessarily say that serving [sic] 1 percent of a population is a statistically significant sample, yet it is the methodology that’s used in the political polls. In the HR field, you tend to not limit yourself to statistical sampling because you then would miss outliers. So, most HR statistical work tends to be let’s look at the entire population of whatever it is we’re looking at and go from there.”

Connearney at *5 (Wright Dep., at 10:15–11:7). When questioned again, more specifically on the meaning of statistical significance, Wright demonstrated his complete ignorance of the subject:

Q: And do you recall the testimony it’s generally around 85 to 90 employees at any given time, the ER [emergency room]?

A: I don’t recall that specific number, no.

Q: And four employees out of 85 or 90 is about what, 5 or 6 percent?

A: I’m agreeing with your math, yes.

Q: Is that a statistically significant sample?

A: In the HR [human resources] field it sure is, yes.

Q: Based on what?

A: Well, if one employee had been hit, physically struck, by their boss, that’s less than 5 percent. That’s statistically significant.”

Connearney at *5 n.5 (Wright Dep., at 28:6–14)

In support of his opinion about “disparate treatment,” Wright’s report contained nothing than a naked comparison of two raw percentages and a causal conclusion, without any statistical analysis. Even for this simplistic comparison of rates, Wright failed to explain how he obtained the percentages in a way that permitted the parties and the trial court to understand his computation and his comparisons. Without a statistical analysis, the trial court concluded that Wright had failed to show that the disparity in termination rates among younger and older employees was not likely consistent with random chance. See also Moultrie v. Martin, 690 F. 2d 1078 (4th Cir. 1982) (rejecting writ of habeas corpus when petitioner failed to support claim of grand jury race discrimination with anything other than the numbers of white and black grand jurors).

Although Wright gave the wrong definition of statistical significance, the trial court relied upon judges of the Third Circuit who also did not get the definition quite right. The trial court cited a 2010 case in the Circuit, which conflated substantive and statistical significance and then gave a questionable definition of statistical significance:

The Supreme Court has not provided any definitive guidance about when statistical evidence is sufficiently substantial, but a leading treatise notes that ‘[t]he most widely used means of showing that an observed disparity in outcomes is sufficiently substantial to satisfy the plaintiff’s burden of proving adverse impact is to show that the disparity is sufficiently large that it is highly unlikely to have occurred at random.’ This is typically done by the use of tests of statistical significance, which determine the probability of the observed disparity obtaining by chance.”

See Connearney at *6 & n.7, citing and quoting from Stagi v. National RR Passenger Corp., 391 Fed. Appx. 133, 137 (3d Cir. 2010) (emphasis added) (internal citation omitted). Ultimately, however, this was all harmless error on the way to the right result.