Schachtman Law » statistical evidence

TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Welding Litigation – Another Positive Example of Litigation-Generated Science

July 11th, 2017

In a recent post¹, I noted Samuel Tarry’s valuable article² for its helpful, contrarian discussion of the importance of some scientific articles with litigation provenances. Public health debates can spill over to the courtroom, and developments in the courtroom can, on occasion, inform and even resolve those public health debates that gave rise to the litigation. Tarry provided an account of three such articles, and I provided a brief account of another article, a published meta-analysis, from the welding fume litigation.

The welding litigation actually accounted for several studies, but in this post, I detail the background of another published study, this one an epidemiologic study by a noted Harvard epidemiologist. Not every expert witness’s report has the making of a published paper. In theory, if the expert witness has conducted a systematic review, and reached a conclusion that is not populated among already published papers, we might well expect that the witness had achieved the “least publishable unit.” The reality is that most causal claims are not based upon what could even remotely be called a systematic review. Given the lack of credibility to the causal claim, rebuttal reports are likely to have little interest to serious scientists.

Martin Wells

In the welding fume cases, one of plaintiffs’ hired expert witnesses, Martin Wells, a statistician, proffered an analysis of Parkinson’s disease (PD) mortality among welders and welding tradesmen. Using the National Center for Health Statistics (NCHS) database, Wells aggregated data from 1993 to 1999, for PD among welders and compared this to PD mortality among non-welders. Wells claimed to find an increased risk of PD mortality among younger (under age 65 at death) welders and welding tradesmen in this dataset.

The defense sought discovery of Wells’s methods and materials, and obtained the underlying data from the NCHS. Wells had no protocol, no pre-stated commitment to which years in the dataset he would use, and no pre-stated statistical analysis plan. At a Rule 702 hearing, Wells was unable to state how many welders were included in his analysis, why he selected some years but not others, or why he had selected age 65 as the cut off. His analyses appeared to be pure data dredging.

As the defense discovered, the NCHS dataset contained mortality data for many more years than the limited range employed by Wells in his analysis. Working with an expert witness at the Harvard School of Public Health, the defense discovered that Wells had gerrymandered the years included (and excluded) in his analysis in a way that just happened to generate a marginally, nominally statistically significant association.

NCHS Welder Age Distribution

The defense was thus able to show that the data overall, and in each year, were very sparse. For most years, the value was either 0 or 1, for PD deaths under age 65. Because of the huge denominators, however, the calculated mortality odds ratios were nominally statistically significant. The value of four PD deaths in 1998 is clearly an outlier. If the value were three rather than four, the statistical significance of the calculated OR would have been lost. Alternatively, a simple sensitivity test suggests that if instead of overall n = 7, n were 6, statistical significance would have been lost. The chart below, prepared at the time with help from Dr. David Schwartz, of Innovative Science solutions, shows the actual number of “underlying cause” PD deaths that were in the dataset for each year in the NCHS dataset, and how sparse and “granular” these data were:

A couple of years later, the Wells’ litigation analysis showed up as a manuscript, with only minor changes in its analyses, and with authors listed as Martin T. Wells and Katherine W. Eisenberg, in the editorial offices of Neurology. Katherine W. Eisenberg, AB and Martin T. Wells, Ph.D., “A Mortality Odds Ratio Study of Welders and Parkinson Disease.” Wells disclosed that he had testified for plaintiffs in the welding fume litigation, but Eisenberg declared no conflicts. Having only an undergraduate degree, and attending medical school at the time of submission, Ms. Eisenberg would not seem to have had the opportunity to accumulate any conflicts of interest. Undisclosed to the editors of Neurology, however, was that Ms. Eisenberg was the daughter of Theodore (Ted) Eisenberg, a lawyer who taught at Cornell University and who represented plaintiffs in the same welding MDL as the one in which Wells testified. Inquiring minds might have wondered whether Ms. Eisenberg’s tuition, room, and board were subsidized by Ted’s earnings in the welding fume and other litigations. Ted Eisenberg and Martin Wells had collaborated on many other projects, but in the welding fume litigation, Ted worked as an attorney for MDL welding plaintiffs, and Martin Wells was compensated handsomely as an expert witness. The acknowledgment at the end of the manuscript thanked Theodore Eisenberg for his thoughtful comments and discussion, without noting that he had been a paid member of the plaintiff’s litigation team. Nor did Wells and Eisenberg tells the Neurology editors that the article had grown out of Wells’ 2005 litigation report in the welding MDL.

The disclosure lapses and oversights by Wells and the younger Eisenberg proved harmless error because Neurology rejected the Wells and Eisenberg paper for publication, and it was never submitted elsewhere. The paper used the same restricted set of years of NCHS data, 1993-1999. The defense had already shown, through its own expert witness’s rebuttal report, that the manuscript’s analysis achieved statistical significance only because it omitted years from the analysis. For instance, if the authors had analyzed 1992 through 1999, their Parkinson’s disease mortality point estimate for younger welding tradesmen would no longer have been statistically significant.

Robert Park

One reason that Wells and Eisenberg may have abandoned their gerrymandered statistical analysis of the NCHS dataset was that an ostensibly independent group³ of investigators published a paper that presented a competing analysis. Robert M. Park, Paul A. Schulte, Joseph D. Bowman, James T. Walker, Stephen C. Bondy, Michael G. Yost, Jennifer A. Touchstone, and Mustafa Dosemeci, “Potential Occupational Risks for Neurodegenerative Diseases,” 48 Am. J. Ind. Med. 63 (2005) [cited as Park (2005)]. The authors accessed the same NCHS dataset, and looked at hundreds of different occupations, including welding tradesmen, and four neurodegenerative diseases.

Park, et al., claimed that they looked at occupations that had previously shown elevated proportional mortality ratios (PMR) in a previous publication of the NIOSH. A few other occupations were included; in all their were hundreds of independent analyses, without any adjustment for multiple testing. Welding occupations⁴ were included “[b]ecause of reports of Parkinsonism in welders [Racette et al.,, 2001; Levy and Nassetta, 2003], possibly attributable to manganese exposure (from welding rods and steel alloys)… .”⁵ Racette was a consultant for the Lawsuit Industry, which had been funded his research on parkinsonism among welders. Levy was a testifying expert witness for Lawsuit, Inc. A betting person would conclude that Park had consulted with Wells and Eisenberg, and their colleagues.

These authors looked at four neurological degenerative diseases (NDDs), Alzheimer’s disease, Parkinson’s disease, motor neuron disease, and pre-senile dementia. The authors looked at NCHS death certificate occupational information from 1992 to 1998, which was remarkable because Wells had insisted that 1992 somehow was not available for inclusion in his analyses. During 1992 to 1998, in 22 states, there were 2,614,346 deaths with 33,678 from Parkinson’s diseases. (p. 65b). Then for each of the four disease outcomes, the authors conducted an analysis for deaths below age 65. For the welding tradesmen, none of the four NDDs showed any associations. Park went on to conduct subgroup analyses for each of the four NDDs for death below age 65. In these subgroup analyses for welding tradesmen, the authors purported to find only an association only with Parkinson’s disease:

“Of the four NDDs under study, only PD was associated with occupations where arc-welding of steel is performed, and only for the 20 PD deaths below age 65 (MOR=1.77, 95% CI=1.08-2.75) (Table V).”

Park (2005), at 70.

The exact nature of the subgroup was obscure, to say the least. Remarkably, Park and his colleagues had not calculated an odds ratio for welding tradesmen under age 65 at death compared with non-welding tradesmen under age 65 at death. The table’s legend attempts to explain the authors’ calculation:

“Adjusted for age, race, gender, region and SES. Model contains multiplicative terms for exposure and for exposure if age at death <65; thus MOR is estimate for deaths occurring age 65+, and MOR, age <65 is estimate of enhanced risk: age <65 versus age 65+”

In other words, Park looked to see whether welding tradesmen who died at a younger age (below age 65) were more likely to have a PD cause of death than welding tradesmen who died an older age (over age 65). The meaning of this internal comparison is totally unclear, but it cannot represent a comparison of welder’s with non-welders. Indeed, every time, Park and his colleagues calculated and reported this strange odds ratio for any occupational group in the published paper, the odds ratio was elevated. If the odds ratio means anything, it is that younger Parkinson’s patients, regardless of occupation, are more likely to die of their neurological disease than older patients. Older men, regardless of occupation, are more likely to die of cancer, cardiovascular disease, and other chronic diseases. Furthermore, this age association within (not between) an occupational groups may be nothing other than a reflection of the greater severity of early-onset Parkinson’s disease in anyone, regardless of their occupation.

Like the manuscript by Eisenberg and Wells, the Park paper was an exercise in data dredging. The Park study reported increased odds ratios for Parkinson’s disease among the following groups on the primary analysis:

biological, medical scientists [MOR 2.04 (95% CI, 1.37-2.92)]

clergy [MOR 1.79 (95% CI, 1.58-2.02)]

religious workers [MOR 1.70 (95% CI, 1.27-2.21)]

college teachers [MOR 1.61 (95% CI, 1.39-1.85)]

social workers [MOR 1.44 (95% CI, 1.14-1.80)]

As noted above, the Park paper reported all of the internal mortality odds ratios for below versus above age 65, within occupational groups were nominally statistically significantly elevated. Nonetheless, the Park authors were on a mission, and determined to make something out of nothing, at least when it came to welding and Parkinson’s disease among younger patients. The authors’ conclusion reflected stunningly poor scholarship:

“Studies in the US, Europe, and Korea implicate manganese fumes from arc-welding of steel in the development of a Parkinson’s-like disorder, probably a manifestation of manganism [Sjogren et al., 1990; Kim et al., 1999; Luccini, et al., 1999; Moon et al., 1999]. The observation here that PD mortality is elevated among workers with likely manganese exposures from welding, below age 65 (based on 20 deaths), supports the welding-Parkinsonism connection.”

Park (2005) at 73.

Stunningly bad because the cited papers by Sjogren, Luccini, Kim, and Moon did not examine Parkinson’s disease as an outcome; indeed, they did not even examine a parkinsonian movement disorder. More egregious, however, was the authors’ assertion that their analysis, which compared the odds of Parkinson’s disease mortality between welders under age 65 to that mortality for welders over age 65, supported an association between welding and “Parkinsonism.”

Every time the authors conducted this analysis internal to an occupational group, they found an elevation among under age 65 deaths compared with over age 65 deaths within the occupational group. They did not report comparisons of any age-defined subgroup of a single occupational group with similarly aged mortality in the remaining dataset.

Elan Louis

The plaintiffs’ lawyers used the Park paper as “evidence” of an association that they claimed was causal. They were aided by a cadre of expert witnesses who could cite to a paper’s conclusions, but could not understand its methods. Occasionally, one of the plaintiffs’ expert witnesses would confess ignorance about exactly what Robert Park had done in this paper. Elan Louis, one of the better qualified expert witnesses on the side of claimants, for instance, testified in the plaintiffs’ attempt to certify a national medical monitoring class action for welding tradesmen. His testimony about what to make of the Park paper was more honest than most of the plaintiffs’ expert witnesses:

Q. My question to you is, is it true that that 1.77 point estimate of risk, is not a comparison of this welder and allied tradesmen under this age 65 mortality, compared with non-welders and allied tradesmen who die under age 65?

A. I think it’s not clear that the footnote — I think that the footnote is not clearly written. When you read the footnote, you didn’t read the punctuation that there are semicolons and colons and commas in the same sentence. And it’s not a well constructed sentence. And I’ve gone through this sentence many times. And I’ve gone through this sentence with Ted Eisenberg many times. This is a topic of our discussion. One of the topics of our discussions. And it’s not clear from this sentence that that’s the appropriate interpretation. * * * However, the footnote, because it’s so poorly written, it obscures what he actually did. And then I think it opens up alternative interpretations.

Q. And if we can pursue that for a moment. If you look at other tables for other occupational titles, or exposure related variables, is it true that every time that Mr. Park reports on that MOR age under 65, that the estimate is elevated and statistically significantly so?

A. Yes. And he uses the same footnote every time. He’s obviously cut and paste that footnote every single time, down to the punctuation is exactly the same. And I would agree that if you look for example at table 4, the mortality odds ratios are elevated in that manner for Parkinson’s Disease, with reference to farming, with reference to pesticides, and with reference to farmers excluding horticultural deaths.

Deposition testimony of Elan Louis, at p. 401-04, in Steele v. A. O. Smith Corp., no. 1:03 CV-17000, MDL 1535 (Jan. 18, 2007). Other less qualified, or less honest expert witnesses on the plaintiffs’ side were content to cite Park (2005) as support for their causal opinions.

Meir Stampfer

The empathetic MDL trial judge denied the plaintiffs’ request for class certification in Steele, but individual personal injury cases continued to be litigated. Steele v. A.O. Smith Corp., 245 F.R.D. 279 (N.D. Ohio 2007) (denying class certification); In re Welding Fume Prods. Liab. Litig., No. 1:03-CV-17000, MDL 1535, 2008 WL 3166309 (N.D. Ohio Aug. 4, 2008) (striking pendent state-law class actions claims)

Although Elan Louis was honest enough to acknowledge his own confusion about the Park paper, other expert witnesses continued to rely upon it, and plaintiffs’ counsel continued to cite the paper in their briefs and to use the apparently elevated point estimate for welders in their cross-examinations of defense expert witnesses. With the NCHS data in hand (on a DVD), defense counsel returned to Meir Stampfer, who had helped them unravel the Martin Wells’ litigation analysis. The question for Professor Stampfer was whether Park’s reported point estimate for PD mortality odds ratio was truly a comparison of welders versus non-welders, or whether it was some uninformative internal comparison of younger welders versus older welders.

The one certainty available to the defense is that it had the same dataset that had been used by Martin Wells in the earlier litigation analysis, and now by Robert Park and his colleagues in their published analysis. Using the NCHS dataset, and Park’s definition of a welder or a welding tradesman, Professor Stampfer calculated PD mortality odds ratios for each definition, as well as for each definition for deaths under age 65. None of these analyses yielded statistically significant associations. Park’s curious results could not be replicated from the NCHS dataset.

For welders, the overall PD mortality odds ratio (MOR) was 0.85 (95% CI, 0.77–0.94), for years 1985 through 1999, in the NCHS dataset. If the definition of welders was expanded to including welding tradesmen, as used by Robert Park, the MOR was 0.83 (95% CI, 0.78–0.88) for all years available in the NCHS dataset.

When Stampfer conducted an age-restricted analysis, which properly compared welders or welding tradesmen with non-welding tradesmen, with death under age 65, he similarly obtained no associations for PD MOR. For the years 1985-1991, death under 65 from PD, Stampfer found MORs 0.99 (95% CI, 0.44–2.22) for just welders, and 0.83 (95% CI, 0.48–1.44) all welding tradesmen.

And for 1992-1999, the years used by Park (2005), and similar to the date range used by Martin Wells, for PD deaths at under age 65, for welders only, Stampfer found a MOR of 1.44 (95% CI, 0.79–2.62), and for all welding tradesmen, 1.20 (95% CI, 0.79–1.84)

None of Park’s slicing, dicing, and subgrouping of welding and PD results could be replicated. Although Dr. Stampfer submitted a report in Steele, there remained the problem that Park (2005) was a peer-reviewed paper, and that plaintiffs’ counsel, expert witnesses, and other published papers were citing it for its claimed results and errant discussion. The defense asked Dr. Stampfer whether the “least publishable unit” had been achieved, and Stampfer reluctantly agreed. He wrote up his analysis, and published it in 2009, with an appropriate disclosure⁶. Meir J. Stampfer, “Welding Occupations and Mortality from Parkinson’s Disease and Other Neurodegenerative Diseases Among United States Men, 1985–1999,” 6 J. Occup. & Envt’l Hygiene 267 (2009).

Professor Stampfer’s paper may not be the most important contribution to the epidemiology of Parkinson’s disease, but it corrected the distortions and misrepresentations of data in Robert Park’s paper. His paper has since been cited by well-known researchers in support of their conclusion that there is no association between welding and Parkinson’s disease⁷. Park’s paper has been criticized on PubPeer, with no rebuttal⁸.

Almost comically, Park has cited Stampfer’s study tendentiously for a claim that there is a healthy worker bias present in the available epidemiology of welding and PD, without noting, or responding to, the devastating criticism of his own Park (2005) work:

“For a mortality study of neurodegenerative disease deaths in the United States during 1985 – 1999, Stampfer [61] used the Cause of Death database of the US National Center for Health Statistics and observed adjusted mortality odds ratios for PD of 0.85 (95% CI, 0.77 – 0.94) and 0.83 (95% CI, 0.78 – 0.88) in welders, using two definitions of welding occupations [61]. This supports the presence of a significant HWE [healthy worker effect] among welders. An even stronger effect was observed in welders for motor neuron disease (amyotrophic lateral sclerosis, OR 0.71, 95% CI, 0.56 – 0.89), a chronic condition that clearly would affect welders’ ability to work.”

Robert M. Park, “Neurobehavioral Deficits and Parkinsonism in Occupations with Manganese Exposure: A Review of Methodological Issues in the Epidemiological Literature,” 4 Safety & Health at Work 123, 126 (2013). Amyotrophic lateral sclerosis has a sudden onset, usually in middle age, without any real prodomal signs or symptoms, which would keep a young man from entering welding as a trade. Just shows you can get any opinion published in a peer-reviewed journal, somewhere. Stampfer’s paper, along with Mortimer’s meta-analysis helped put the kabosh on welding fume litigation.

Addendum

A few weeks ago, the Sixth Circuit affirmed the dismissal of a class action that was attempted based upon claims of environmental manganese exposure. Abrams v. Nucor Steel Marion, Inc., Case No. 3:13 CV 137, 2015 WL 6872511 (N. D. Ohio Nov. 9, 2015) (finding testimony of neurologist Jonathan Rutchik to be nugatory, and excluding his proffered opinions), aff’d, 2017 U.S. App. LEXIS 9323 (6th Cir. May 25, 2017). Class plaintiffs employed one of the regulators, Jonathan Rutchik, from the welding fume parkinsonism litigation).

1 See “Samuel Tarry’s Protreptic for Litigation-Sponsored Publications” (July 9, 2017).

2 Samuel L. Tarry, Jr., “Can Litigation-Generated Science Promote Public Health?” 33 Am. J. Trial Advocacy 315 (2009)

3 Ostensibly, but not really. Robert M. Park was an employee of NIOSH, but he had spent most of his career working as an employee for the United Autoworkers labor union. The paper acknowledged help from Ed Baker, David Savitz, and Kyle Steenland. Baker is a colleague and associate of B.S. Levy, who was an expert witness for plaintiffs in the welding fume litigation, as well as many others. The article was published in the “red” journal, the American Journal of Industrial Medicine.

4 The welding tradesmen included in the analyses were welders and cutters, boilermakers, structural metal workers, millwrights, plumbers, pipefitters, and steamfitters. Robert M. Park, Paul A. Schulte, Joseph D. Bowman, James T. Walker, Stephen C. Bondy, Michael G. Yost, Jennifer A. Touchstone, and Mustafa Dosemeci, “Potential Occupational Risks for Neurodegenerative Diseases,” 48 Am. J. Ind. Med. 63, 65a, ¶2 (2005).

5 Id.

6 “The project was supported in part through a consulting agreement with a group of manufacturers of welding consumables who had no role in the analysis, or in preparing this report, did not see any draft of this manuscript prior to submission for publication, and had no control over any aspect of the work or its publication.” Stampfer, at 272.

7 Karin Wirdefeldt, Hans-Olov Adami, Philip Cole, Dimitrios Trichopoulos, and Jack Mandel, “Epidemiology and etiology of Parkinson’s disease: a review of the evidence,” 26 Eur. J. Epidemiol. S1 (2011).

8 The criticisms can be found at <https://pubpeer.com/publications/798F9D98B5D2E5A832136C0A4AD261>, last visited on July 10, 2017.

Posted in Scientific Evidence, Scientific Publishing, statistical evidence, Underlying Data | Comments Off on Welding Litigation – Another Positive Example of Litigation-Generated Science

Slemp Trial Part 3 – The Defense Expert Witness – Huh

July 9th, 2017

On June 19, 2017, the U.S. Supreme Court curtailed the predatory jurisdictional practices of the lawsuit industry in seeking out favorable trial courts with no meaningful connection to their claims. See Bristol-Myers Squib Co. v. Superior Court, No. 16-466, 582 U.S. ___ (June 19, 2017). The same day, the defendants in a pending talc cancer case in St. Louis filed a motion for a mistrial. Swann v. Johnson & Johnson, Case No. 1422-CC09326-01, Division 10, Circuit Court of St. Louis City, Missouri. Missouri law may protect St. Louis judges from having to get involved in gatekeeping scientific expert witness testimony, but when the Supreme Court speaks to the requirements of the federal constitution’s due process clause, even St. Louis judges must listen. Bristol-Myers held that the constitution limits the practice of suing defendants in jurisdictions unrelated to the asserted claims, and the St. Louis trial judge, Judge Rex Burlison, granted the requested mistrial in Swann. As a result, there will not be another test of plaintiffs’ claims that talc causes ovarian cancer, and the previous Slemp case will remain an important event to interpret.

The Sole Defense Expert Witness

Previous posts¹ addressed some of the big picture issues as well as the opening statements in Slemp. This posts turns to the defense expert witness, Dr. Walter Huh, in an attempt to understand how and why the jury returned its egregious verdict. Juries can, of course, act out of sympathy, passion, or prejudice, but their verdicts are usually black boxes when it comes to discerning their motivations and analyses. A more interesting and fruitful exercise is to ask whether a reasonable jury could have reached the conclusion in the case. The value of this exercise is limited, however. A reasonable jury should have reasonable expertise in the subject matter, and in our civil litigation system, this premise is usually not satisfied.

Dr. Walter Huh, a gynecologic oncologist, was the only expert witness who testified for the defense. As the only defense witness, and as a clinician, Huh had a terrible burden. He had to meet and rebut testimony outside his fields of expertise, including pathology, toxicology, and most important, epidemiology. Huh was by all measures well-spoken, articulate, and well-qualified as a clinical gynecologic oncologist. Defense counsel and Huh, however, tried to make the case that Huh was qualified to speak to all issues in the case. The initial examination on qualifications was long and tedious, and seemed to overcompensate for the obvious gaps in Dr. Huh’s qualifications. In my view, the defense never presented much in the way of credible explanations about where Huh had obtained the training, experience, and expertise to weigh in on areas outside clinical medicine. Ultimately, the cross-examination is the crucial test of whether this strategy of one witness for all subjects can hold. The cross-examination of Dr. Huh, however, exposed the gaps in qualifications, and more important, Dr. Huh made substantive errors that were unnecessary and unhelpful to the defense of the case.

The defense pitched the notion that Dr. Huh somehow trumped all the expert witnesses called by plaintiff because Huh was the “only physician heard by the jury” in court. Somehow, I wonder whether the jury was so naïve. It seems like a poor strategic choice to hope that the biases of the jury in favor of the omniscience of physicians (over scientists) will carry the day.

There were, to be sure, some difficult clinical issues, on which Dr. Huh could address within his competence. Cancer causation itself is a multi-disciplinary science, but in the case of a disease, such as ovarian cancer, with a substantial base-rate in the general population and without any biomarker of a causal pathway between exposure and outcome, epidemiology will be a necessary tool. Huh was thus forced to “play” on the plaintiffs’ expert witnesses’ home court, much to his detriment.

General Causation

Don’t confuse causation with links, association, and risk factors

The defense strong point is that virtually no one, other than the plaintiffs’ expert witnesses themselves, and only in the context of litigation, has causally attributed ovarian cancer to talc exposure. There are, however, some ways that this point can be dulled in the rough and tumble of trial. Lawyers, like journalists, and even some imprecise scientists, use a variety of terms such as “risk,” “risk factor,” “increased risk,” and “link,” for something less than causation. Sometimes these terms are used deliberately to try to pass off something less than causation as causation; sometimes the speaker is confused; and sometimes the speaker is simply being imprecise. It seems incumbent upon the defense to explain the differences between and among these terms, and to stick with a consistent, appropriate terminology.

One instance in which Dr. Huh took his eye off the “causation ball,” arose when plaintiffs’ counsel showed him a study conclusion that talc use among African American women was statistically significantly associated with ovarian cancer. Huh answered, non-responsively, “I disagree with the concept that talc causes ovarian cancer.” The study, however, did not advance a causal conclusion and there was no reason to suggest to the jury that he disagreed with anything in the paper; rather it was the opportunity to repeat that association is not causation, and the article did not contradict anything he had said.

Similarly, Dr. Huh was confronted with several precautionary recommendations that women “may” benefit from avoiding talc. Remarkably, Huh simply disagreed, rather than making the obvious point that the recommendation was not stated as something that would in fact benefit women.

When witnesses answer long, involved questions, with a simple “yes,” then they may have made every implied proposition in the questions into facts in the case. In an exchange between plaintiff’s counsel and Huh, counsel asked whether a textbook listed talc as a risk factor.² Huh struggled to disagree, which disagreement tended to impair his credibility, for disagreeing with a textbook he acknowledged using and relying upon. Disagreement, however, was not necessary; the text merely stated that “talc … may increase risk.” If “increased risk” had been defined and explained as something substantially below causation, then Huh could have answered simply “yes, but that quotation does not support a causal claim.”

At another point, plaintiffs’ counsel, realizing that none of the individual studies reached a causal conclusion, asked whether it would be improper for a single study to give such a conclusion. It was a good question, with a solid premise, but Dr. Huh missed the opportunity for explaining that the authors of all the various individual studies had not conducted systematic reviews that advanced the causal conclusion that plaintiffs would need. Certainly, the authors of individual studies were not prohibited from taking the next step to advance a causal conclusion in a separate paper with the appropriate analysis.

Bradford Hill’s Factors

Dr. Huh’s testimony provided the jury with some understanding of Sir Austin Bradford Hill’s nine factors, but Dr. Huh would have helped himself by acknowledging several important points. First, as Hill explained, the nine factors are invoked only after there is a clear-cut (valid) association beyond that which we care to attribute to chance. Second, establishing all nine factors is not necessary. Third, some of the nine factors are more important than others.

Study validity

In the epidemiology of talc and ovarian cancer, statistical power and significance are not the crucial issues; study validity is. It should have been the plaintiff’s burden to rule out bias, and confounding, as well as chance. Hours had passed in the defense examination of Dr. Huh before study validity was raised, and it was never comprehensively explained. Dr. Huh explained recall bias as a particular problem of case-control studies, which made up the bulk of evidence upon which plaintiffs’ expert witnesses relied. A more sophisticated witness on epidemiology might well have explained that the selection of controls can be a serious problem without obvious solutions in case-control studies.

On cross-examination, plaintiffs’ counsel, citing Kenneth Rothman, asked whether misclassification bias always yields a lower risk ratio. Dr. Huh resisted with “not necessarily,” but failed to dig in whether the conditions for rejecting plaintiffs’ generalization (such as polychotomous exposure classification) obtained in the relevant cohort studies. More importantly, Huh missed the opportunity to point out that the most recent, most sophisticated cohort study reported a risk ratio below 1.0, which on the plaintiffs’ theory about misclassification would have been even lower than 1.0 than reported in the published paper. Again, a qualified epidemiologist would not have failed to make these points.

Dr. Huh never read the testimony of one of the plaintiffs’ expert witnesses on epidemiology, Graham Colditz, and offered no specific rebuttal of Colditz’s opinions. With respect to the other of plaintiffs’ epidemiology expert witness, Dr. Cramer, Huh criticized him for engaging in post-hoc secondary analyses and asserted that Cramer’s meta-analysis could not be validated. Huh never attempted to validate the meta-analysis himself; nor did Huh offer his own meta-analysis or explain why a meta-analysis of seriously biased studies was unhelpful. These omissions substantially blunted Huh’s criticisms.

On the issue of study validity, Dr. Huh seem to intimate that cohort studies were necessarily better than case-control studies because of recall bias, but also because there are more women involved in the cohort studies than in the case-control studies. The latter point, although arithmetically correct, is epidemiologically bogus. There are often fewer ovarian cancer cases in the cohort study, especially if the cohort is not followed for a very long time. The true test comes in the statistical precision of the point estimate, relative risk or odds ratio, in the different type of study. The case-control studies often generate much more precise point estimates as seen from their narrower confidence intervals. Of course, the real issue is not precision here, but accuracy. Still, Dr. Huh appeared to have endorsed the defense counsel misleading argument about study size, a consideration that will not help the defense when the contentions of the parties are heard in scientific fora.

Statistical Significance

Huh appeared at times to stake out a position that if a study does not have statistical significance, then we must accept the null hypothesis. I believe that most careful scientists would reject this position. Null studies simply fail to reject the null hypothesis.

Although there seems to be no end to fallacious reasoning by plaintiffs, there is a particular defense fallacy seen in some cases that turn on epidemiology. What if we had 10 studies that each found an elevated risk ratio of 1.5, with two-tailed 95 percent confidence intervals of 0.92 – 2.18, or so. Can the defense claim victory because no study is statistically significant? Huh seemed to suggest so, but this is clearly wrong. Of course, we might ask why no one conducted the 11^th study, with sufficient power to detect a risk ratio of 1.5, at the desired level of significance. But parties go to trial with the evidence they have, not what they might want to have. On the above 10-study hypothetical, a meta-analysis might well be done (assuming the studies could be appropriately included), and the summary risk ratio for all studies would be 1.5, and highly statistically significant.

On the question of talc and ovarian cancer, there were several meta-analyses at issue, and so the role of statistical significance of individual studies was less relevant. The real issue was study validity. This issue was muddled by assertions that risk ratios such as 2.05 (95%, 0.94 – 4.47) were “chance findings.” Chance may not have been ruled out, but the defense can hardly assert that chance and chance alone produced the findings; otherwise, it will be sunk by the available meta-analyses.

Strength of Association

The risk ratios involved in most of the talc ovarian cancer studies are small, and that is obviously an important factor to consider in evaluating the studies for causal conclusions. Still, it is also obvious that sometimes real causal associations can be small in magnitude. Dr Huh could and should have conceded in direct that small associations can be causal, but explained that validity concerns about the studies that show small associations become critical. Examples would have helped, such as the body of observational epidemiology that suggested that estrogen replacement therapy in post-menopausal women provided cardiovascular benefit, only to be reversed by higher quality clinical trials. Similarly, observational studies suggested that lung cancer rates were reduced by Vitamin A intake, but again clinical trial data showed the opposite.

Consistency of Studies

Are studies that have statistically non-significant risk ratios above 1.0 inconsistent with studies that find statistically significant elevated risk ratios? At several points, Huh appears to say that such a group of studies is inconsistent, but that is not necessarily so. Huh’s assertion provoked a good bit of harmful cross-examination, in which he seemed to resist the notion that meta-analysis could help answer whether a group of studies is statistically consistent. Huh could have conceded the point readily but emphasized that a group of biased studies would give only a consistently biased estimate of association.

Authority

One of the cheapest tricks in the trial lawyers’ briefcase is the “learned treatise” exception to the rule against hearsay.”³ The lawyer sets up witnesses in deposition by obtaining their agreement that a particular author or text is “authoritative.” Then at trial, the lawyer confronts the witnesses with a snippet of text, which appears to disagree with the expert witnesses’ testimony. Under the rule, in federal and in some state courts, the jury may accept the snippet or sound bite as true, and also accept that the witnesses do not know what they are talking about when they disagree with the “authoritative” text.

The rule is problematic and should have been retired long ago. Since 1663, the Royal Society has sported the motto: “Nullius in verba.” Disputes in science are resolved with data, from high-quality, reproducible experimental or observational studies, not with appeals to the prestige of the speaker. And yet, we lawyers will try, and sometimes succeed, with this greasy kidstuff approach cross-examination. Indeed, when there is an opportunity to use it, we may even have an obligation to use so-called learned treatises to advance our clients’ cause.

In the Slemp trial, the plaintiff’s counsel apparently had gotten a concession from Dr. Huh that plaintiff’s expert witness on epidemiology, Dr. Daniel Cramer, was “credible and authoritative.” Plaintiff’s counsel then used Huh’s disagreement with Cramer’s testimony as well as his published papers to undermine Huh’s credibility.

This attack on Huh was a self-inflicted wound. The proper response to a request for a concession that someone or some publication is “authoritative,” is that this word really has no meaning in science. “Nullius in verba,” and all that. Sure, someone can be a respected research based upon past success, but past performance is no guarantee of future success. Look at Linus Pauling and Vitamin C. The truth of a conclusion rests on the data and the soundness of the inferences therefrom.

Collateral Attacks

The plaintiff’s lawyer in Slemp was particularly adept at another propaganda routine – attacking the witness on the stand for having cited another witness, whose credibility in turn was attacked by someone else, even if that someone else was a crackpot. Senator McCarthy (Joseph not Eugene) would have been proud of plaintiff’s lawyer’s use of the scurrilous attack on Paolo Boffetta for his views on EMF and cancer, as set out in Microwave News, a fringe publication that advances EMF-cancer claims. Now, the claim that non-ionizing radiation causes cancer has not met with much if any acceptance, and Boffetta’s criticisms of the claims are hardly unique or unsupported. Yet plaintiff’s counsel used this throw-away publication’s characterization of Boffetta as “the devil’s advocate,” to impugn Boffetta’s publications and opinions on EMF, as well as Huh’s opinions that relied upon some aspect of Boffetta’s work on talc. Not that “authority” counts, but Boffetta is the Associate Director for Population Sciences of the Tisch Cancer Institute and Chief of the Division of Cancer Prevention and Control of the Department of Oncological Sciences, at the Mt. Sinai School of Medicine in New York. He has published many epidemiologic studies, as well as a textbook on the epidemiology of cancer.⁴

The author from the Microwave News was never identified, but almost certainly lacks the training, experience, and expertise of Paolo Boffetta. The point, however, is that this cross-examination was extremely collateral, had nothing to do with Huh, or the issues in the Slemp case, and warranted an objection and admonition to plaintiff’s counsel for the scurrilous attack. An alert trial judge, who cared about substantial justice, might have shut down this frivolous, highly collateral attack, sua sponte. When Huh was confronted with the “devil’s advocate” characterization, he responded “OK,” seemingly affirming the premise of the question.

Specific Causation

Dr. Huh and the talc defendants took the position that epidemiology never informs assessment of individual causation. This opinion is hard to sustain. Elevated risk ratios reflect more individual cases than expected in a sample. Epidemiologic models are used to make individual predictions of risk for purposes of clinical monitoring and treatment. Population-based statistics are used to define range of normal function and to assess individuals as impaired or disabled, or not.

At one point in the cross-examination, plaintiffs’ counsel suggested the irrelevance of the size of relative risk by asking whether Dr. Huh would agree that a 20% increased risk was not small if you are someone who has gotten the disease. Huh answered “Well, if it is a real association.” This answer fails on several levels. First, it conflates “increased risk” and “real association” with causation. The point was for Huh to explain that an increased risk, if statistically significant, may be an association, but it is not necessary causal.

Second, and equally important, Huh missed the opportunity to explain that even if the 20% increased risk was real and causal, it would still mean that an individual patient’s ovarian cancer was most likely not caused by the exposure. See David H. Schwartz, “The Importance of Attributable Risk in Toxic Tort Litigation,” (July 5, 2017).

Conclusion

The defense strategy of eliciting all their scientific and medical testimony from a single witness was dangerous at best. As good a clinician as Dr. Huh appears to be, the defense strategy did not bode well for a good outcome when many of the scientific issues were outside of Dr. Huh’s expertise.

1 “The Slemp Case, Part I – Jury Verdict for Plaintiff – 10 Initial Observations” (May 13, 2017); The Slemp Case, Part 2 – Openings (June 10, 2017).

2 Jonathan S. Berek & Neville F. Hacker, Gynecologic Oncology at 231 (6th ed. 2014).

3 See “Trust-Me Rules of Evidence” (Oct. 18 2012).

4 See, e.g., Paolo Boffetta, Stefania Boccia, Carol La Vecchia, A Quick Guide to Cancer Epidemiology (2014).

Posted in Causation, Meta-analysis, statistical evidence | Comments Off on Slemp Trial Part 3 – The Defense Expert Witness – Huh

Traditional, Frequentist Statistics Still Hegemonic

March 25th, 2017

The Defense Fallacy

In civil actions, defendants, and their legal counsel sometimes argue that the absence of statistical significance across multiple studies requires a verdict of “no cause” for the defense. This argument is fallacious, as can be seen where there are many studies, say eight or nine, which all consistently find elevated risk ratios, but with p-values slightly higher than 5%. The probability that eight studies, free of bias, would consistently find an elevated risk ratio, regardless of the individual studies’ p-values, is itself very small. If the studies were amenable to meta-analysis, the summary estimate of the risk ratio would itself likely be highly statistically significant in this hypothetical.

The Plaintiffs’ Fallacy

The plaintiffs’ fallacy derives from instances, such as the hypothetical one above, in which statistical significance, taken as a property of individual studies, is lacking. Even though we can hypothesize such instances, plaintiffs fallaciously extrapolate from them to the conclusion that statistical significance, or any other measure of sampling estimate precision, is unnecessary to support a conclusion of causation.

In courtroom proceedings, epidemiologist Kenneth Rothma n is frequently cited by plaintiffs as having shown or argued that statistical significance is unimportant. For instance, in the Zoloft multi-district birth defects litigation, plaintiffs argued in a motion for reconsideration of the exclusion of their epidemiologic witness that the trial court had failed to give appropriate weight to the Supreme Court’s decision in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27 (2011), as well as to the Third Circuit’s invocation of the so-called “Rothman” approach in a Bendectin birth defects case, DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941 (3d Cir. 1990). According to the plaintiffs’ argument, their excluded epidemiologic witness, Dr. Anick Bérard, had used this approach in arriving at her novel conclusion that sertraline causes virtually every kind of birth defect.

The Zoloft plaintiffs did not call Rothman as a witness; nor did they even present an expert witness to explain what Rothman’s arguments were. Instead, the plaintiffs’ counsel, sneaked in some references and vague conclusions into their cross-examinations of defense expert witnesses, and submitted snippets from Rothman’s textbook, Modern Epidemiology.

If plaintiffs had called Dr. Rothman to testify, he would have probably insisted that statistical significance is not a criterion for causation. Such insistence is not as helpful to plaintiffs in cases such as Zoloft birth defects cases as their lawyers might have thought or hoped. Consider for instance the cases in which causal inferences are arrived at without formal statistical analysis. These instances are often not relevant to mass tort litigation that involve prevalent exposure and a prevalent outcome.

Rothman also would have likely insisted that consideration of random variation and bias are essential to the assessment of causation, and that many apparently or nominally statistically significant associations do not and cannot support valid inferences of causation. Furthermore, he might have been given the opportunity to explain that his criticisms of significance testing are as much directed to the creation of false positive as to false negative rates in observational epidemiology. In keeping with his publications, Rothman would have challenged strict significance testing with p-values as opposed to the use of sample statistical estimates in conjunction with confidence intervals. The irony of the Zoloft case and many other litigations was that the defense was not using significance testing in the way that Rothman had criticized; rather the plaintiffs were over-endorsing statistical significance that was nominal, plagued by multi-testing, and inconsistent.

Judge Rufe, who presided over the Zoloft MDL, pointed out that the Third Circuit in DeLuca had never affirmatively endorsed Professor Rothman’s “approach,” but had reversed and remanded the Bendectin case to the district court for a hearing under Rule 702:

“by directing such an overall evaluation, however, we do not mean to reject at this point Merrell Dow’s contention that a showing of a .05 level of statistical significance should be a threshold requirement for any statistical analysis concluding that Bendectin is a teratogen regardless of the presence of other indicia of reliability. That contention will need to be addressed on remand. The root issue it poses is what risk of what type of error the judicial system is willing to tolerate. This is not an easy issue to resolve and one possible resolution is a conclusion that the system should not tolerate any expert opinion rooted in statistical analysis where the results of the underlying studies are not significant at a .05 level.”

2015 WL 314149, at *4 (quoting from DeLuca, 911 F.2d at 955). And in DeLuca, after remand, the district court excluded the DeLuca plaintiffs’ expert witnesses, and granted summary judgment, based upon the dubious methods employed by plaintiffs’ expert witnesses (including the infamous Dr. Done, and Shanna Swan), in cherry picking data, recalculating risk ratios in published studies, and ignoring bias and confounding in studies. On subsequent appeal, the Third Circuit affirmed the judgment for Merrell Dow. DeLuca v. Merrell Dow Pharma., Inc., 791 F. Supp. 1042 (3d Cir. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993).

Judge Rufe similarly rebuffed the plaintiffs’ use of the Rothman approach, their reliance upon Matrixx, and their attempt to banish consideration of random error in the interpretation of epidemiologic studies. In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2015 WL 314149 (E.D. Pa. Jan. 23, 2015) (Rufe, J.) (denying PSC’s motion for reconsideration). See “Zoloft MDL Relieves Matrixx Depression” (Feb. 4, 2015).

Some Statisticians’ Errors

Recently, Dr. Rothman and three other epidemiologists set out to track the change, over time, from 1975 to 2014, of the use of various statistical methodologies. Andreas Stang, Markus Deckert, Charles Poole & Kenneth J. Rothman, “Statistical inference in abstracts of major medical and epidemiology journals 1975–2014: a systematic review,” 32 Eur. J. Epidem. 21 (2017) [cited below as Stang]. They made clear that their preferred methodological approach was to avoid the strictly dichotomous null hypothesis significance testing (NHST), which has evolved from Fisher’s significance testing and Neyman’s null hypothesis testing (NHT), in favor of the use of estimation with confidence intervals (CI). The authors conducted a meta-study, that is a study of studies, to track the trends in use of NHST, ST, NHT and CI reporting in the major bio-medical journals.

Unfortunately, the authors limited their data and analysis to abstracts, which makes their results very likely misleading and incomplete. Even when abstracts reported using so-called CI-only approaches, the authors may well have reasoned that point estimates with CIs that spanned no association were “non-significant.” Similarly, authors who found elevated risk ratios with very wide confidence intervals may well have properly acknowledged that their study did not provide credible evidence of an association. See W. Douglas Thompson, “Statistical criteria in the interpretation of epidemiologic data,” 77 Am. J. Public Health 191, 191 (1987) (discussing the over-interpretation of skimpy data).

Rothman and colleagues found that while a few epidemiologic journals had a rising prevalence of CI-only reports in abstracts, for many biomedical journals the NHST approach remained more common. Interestingly, at three of the major clinical medical journals, the Journal of the American Medical Association, the New England Journal of Medicine, and Lancet, the NHST has prevailed over the almost four decades of observation.

The clear implication of Rothman’s meta-study is that consideration of significance probability, whether or not treated as a dichotomous outcome, and whether or not treated as a p-value or a point estimate with a confidence interval, is absolutely critical to how biomedical research is conducted, analyzed, and reported. In Rothman’s words:

“Despite the many cautions, NHST remains one of the most prevalent statistical procedures in the biomedical literature.”

Stang at 22. See also David Chavalarias, Joshua David Wallach, Alvin Ho Ting & John P. A. Ioannidis, “Evolution of Reporting P Values in the Biomedical Literature, 1990-2015,” 315 J. Am. Med. Ass’n 1141 (2016) (noting the absence of the use of Bayes’ factors, among other techniques).

There is one aspect to the Stang article that is almost Trump-like in its citing to an inappropriate, unknowledgable source and then treating its author as having meaningful knowledge of the subject. As part of their rhetorical goals, Stang and colleagues declare that:

“there are some indications that it has begun to create a movement away from strict adherence to NHT, if not to ST as well. For instance, in the Matrixx decision in 2011, the U.S. Supreme Court unanimously ruled that admissible evidence of causality does not have to be statistically significant [12].”

Stang at 22. Whence comes this claim? Footnote 12 takes us to what could well be fake news of a legal holding, an article by a statistician about a legal case:

Joseph L. Gastwirth, “Statistical considerations support the Supreme Court’s decision in Matrixx Initiatives v. Siracusano, 52 Jurimetrics J. 155 (2012).

Citing a secondary source when the primary source is readily available, and what is at issue, seems like poor scholarship. Professor Gastwirth is a statistician, not a lawyer, and his exegesis of the Supreme Court’s decision is wildly off target. As any first year law student could discern, the Matrixx case could not have been about the admissibility of evidence because the case had been dismissed on the pleadings, and no evidence had ever been admitted or excluded. The only issue on appeal was the adequacy of the allegations, not the admissibility of evidence.

Although the Court managed to muddle its analysis by wandering off into dicta about causation, the holding of the case is that alleging causation was not required to plead a case of materiality for a securities fraud case. Having dispatched causality from the case, the Court had no serious business in setting the considerations for alleging in pleadings or proving at trial the elements of causation. Indeed, the Court made it clear that its frolic and detour into causation could not be taken seriously:

“We need not consider whether the expert testimony was properly admitted in those cases [cited earlier in the opinion], and we do not attempt to define here what constitutes reliable evidence of causation.”

Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27, 131 S.Ct. 1309, 1319 (2011).

The word “admissible” or “admissibility” never appear in the Court’s opinion, and the above quote explains that the admissibility was not considered. Laughably, the Court went on to cite three cases as examples of supposed causation opinions in the absence of statistical significance. Two of the three were specific causation, differential etiology cases that involved known general causation. The third case involved a claim of birth defects from contraceptive jelly, when the plaintiffs’ expert witnesses actually relied upon statistically significant (but thoroughly flawed and invalid) associations.¹

When it comes to statistical testing the legal world would be much improved if lawyers actually and carefully read statistics authors, and if statisticians and scientists actually read court opinions.

1 See “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 1”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 2”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 3”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 4”; “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 5”; and “Wells v. Ortho Pharmaceutical Corp. Reconsidered – Part 6”

Posted in Rule 702, statistical evidence | Comments Off on Traditional, Frequentist Statistics Still Hegemonic

Washington Legal Foundation’s Paper on Statistical Significance in Rule 702 Proceedings

March 13th, 2017

The Washington Legal Foundation has released a Working Paper, No. 201, by Kirby Griffis, entitled “The Role of Statistical Significance in Daubert / Rule 702 Hearings,” in its Critical Legal Issues Working Paper Series, (Mar. 2017) [cited below as Griffis]. I am a fan of many of the Foundation’s Working Papers (having written one some years ago), but this one gives me pause.

Griffis’s paper manages to avoid many of the common errors of lawyers writing about this topic, but adds little to the statistics chapter in the Reference Manual on Scientific Evidence (3d ed. 2011), and he propagates some new, unfortunate misunderstandings. On the positive side, Griffis studiously avoids the transposition fallacy in defining significance probability, and he notes that multiplicity from subgroups and multiple comparisons often undermines claims of statistical significance. Griffis gets both points right. These are woefully common errors, and they deserve the emphasis Griffis gives to them in this working paper.

On the negative side, however, Griffis falls into error on several points. Griffis helpfully narrates the Supreme Court’s evolution in Daubert and then in Joiner, but he fails to address the serious mischief and devolution introduced by the Court’s opinion in Matrixx Initiatives, Inc. v. Siracusano, 563 U.S. 27, 131 S.Ct. 1309 (2011). See Schachtman, “The Matrixx – A Comedy of Errors” (April 6, 2011)”; David Kaye, “Trapped in the Matrixx: The U.S. Supreme Court and the Need for Statistical Significance,” BNA Product Safety & Liability Reporter 1007 (Sept. 12, 2011). With respect to statistical practice, this Working Paper is at times wide of the mark.

Non-Significance

Although avoiding the transposition fallacy, Griffis falls into another mistake in interpreting tests of significance; he states that a non-significant result tells us that an hypothesis is “perfectly consistent with mere chance”! Griffis at 9. This is, of course, wrong, or at least seriously misleading. A failure to reject the null hypothesis does not prove the null such that we can say that the “null results” in one study were perfectly consistent with chance. The test may have lacked power to detect an “effect size” of interest. Furthermore, tests of significance cannot rule out systematic bias or confounding, and that limitation alone ensures that Griffis’s interpretation is mistaken. A null result may have resulted from bias or confounding that obscured a measurable association.

Griffis states that p-values are expressed as percentages “usually 95% or 99%, corresponding to 0.05 or 0.01,” but this states things backwards. The p-value that is pre-specified to be “significant” is a probability or percentage that is low; it is the coefficient of confidence used to construct a confidence interval that is the complement of the significance probability. Griffis at 10. An alpha, or pre-specified statistical significance level, of 5% thus corresponds to a coefficient of confidence of 95% (or 1.0 – 0.05).

The Mid-p Controversy

In discussing the emerging case law, Griffis rightly points to cases that chastise Dr. Nicholas Jewell for the many liberties he has taken in various litigations as an expert witness for the lawsuit industry. One instance cited by Griffis is the Lipitor diabetes litigation, where the MDL court suggested that Jewell switched improperly from a Fisher’s exact test to a mid-test. Griffis at 18-19. Griffis seems to agree, but as I have explained elsewhere, Fisher’s exact test generates a one-tailed measure of significance probability, and the analyst is left to one of several ways of calculating a two-tailed test. See “Lipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test” (April 21, 2016). The mid-p is one legitimate approach for asymmetric distributions, and is more favorable to the defense than passing off the one-tailed measure as the result of the test. The mere fact that a statistical software package does not automatically specify the mid-p for a Fisher’s exact analysis does not make invoking this measure into p-hacking or other misconduct. Doubling the attained significance probability of a particular Fisher’s exact test result is generally considered less accurate than a mid-p calculation, even though some software packages using doubling attained significance probability as a default. As much as we might dislike bailing Jewell out of Daubert limbo, on this one, limited point, he deserved a better hearing.

Mis-Definitions

On recounting the Bendectin litigation, Griffis refers to the epidemiologic studies of birth defects and Bendectin as “experiments,” Griffis at 7, and then describes such studies as comparing “populations,” when he clearly meant “samples.” Griffis at 8.

Griffis conflates personal bias with bias as a scientific concept of systematic error in research, a confusion usually perpetuated by plaintiffs’ counsel. See Griffis at 9 (“Coins are not the only things that can be biased: scientists can be, too, as can their experimental subjects, their hypotheses, and their manipulations of the data.”) Of course, the term has multiple connotations, but too often an accusation of personal bias, such as conflict of interest, is used to avoid engaging with the merits of a study.

Relative Risks

Griffis correctly describes the measure known as “relative risk” as a determination of the “the strength of a particular association.” Griffis at 10. The discussion then lapses into using a given relative risk as a measure of the likelihood that an individual with the exposure studied develop the disease. Sometimes this general-to-specific inference is warranted, but without further analysis, it is impossible to tell whether Griffis lapsed from general to specific, deliberately or inadvertently, in describing the interpretation of relative risk.

Conclusion

Griffis is right in his chief contention that the proper planning, conduct and interpretation statistical tests is hugely important to judicial gatekeeping of some expert witness opinion testimony under Federal Rule of Evidence 702 (and under Rule 703, too). Judicial and lawyer aptitude in this area is low, and needs to be bolstered.

Posted in Rule 702, statistical evidence | Comments Off on Washington Legal Foundation’s Paper on Statistical Significance in Rule 702 Proceedings

Statistical Analysis Requires an Expert Witness with Statistical Expertise

November 13th, 2016

Christina K. Connearne sued her employer, Main Line Hospitals, for age discrimination. Main Line charged Connearne with fabricating medical records, but Connearne replied that the charge was merely a pretext. Connearney v. Main Line Hospitals, Inc., Civ. Action No. 15-02730, 2016 WL 6569292 (E.D. Pa. Nov. 4, 2016) [cited as Connearney]. Connearne’s legal counsel engaged Christopher Wright, an expert witness on “human resources,” for a variety of opinions, most of which were not relevant to the action. Alas, for Ms. Connearne, the few relevant opinions proffered by Wright were unreliable. On a Rule 702 motion, Judge Pappert excluded Wright from testifying at trial.

Although not a statistician, Wright sought to offer his statistical analysis in support of the age discrimination claim. Connearney at *4. According to Judge Pappert’s opinion, Wright had taken just two classes in statistics, but perhaps His Honor meant two courses. (Wright Dep., at 10:3–4.) If the latter, then Wright had more statistical training than most physicians who are often permitted to give bogus statistical opinions in health effects litigation. In 2015, the Medical College Admission Test apparently started to include some very basic questions on statistical concepts. Some medical schools now require an undergraduate course in statistics. See Harvard Medical School Requirements for Admission (2016). Most medical schools, however, still do not require statistical training for their entering students. See Veritas Prep, “How to Select Undergraduate Premed Coursework” (Dec. 5, 2011); “Georgetown College Course Requirements for Medical School” (2016).

Regardless of formal training, or lack thereof, Christopher Wright demonstrated a profound ignorance of, and disregard for, statistical concepts. (Wright Dep., at 10:15–12:10; 28:6–14.) Wright was shown to be the wrong expert witness for the job by his inability to define statistical significance. When asked what he understood to be a “statistically significant sample,” Wright gave a meaningless, incoherent answer:

“I think it depends on the environment that you’re analyzing. If you look at things like political polls, you and I wouldn’t necessarily say that serving [sic] 1 percent of a population is a statistically significant sample, yet it is the methodology that’s used in the political polls. In the HR field, you tend to not limit yourself to statistical sampling because you then would miss outliers. So, most HR statistical work tends to be let’s look at the entire population of whatever it is we’re looking at and go from there.”

Connearney at *5 (Wright Dep., at 10:15–11:7). When questioned again, more specifically on the meaning of statistical significance, Wright demonstrated his complete ignorance of the subject:

“Q: And do you recall the testimony it’s generally around 85 to 90 employees at any given time, the ER [emergency room]?

A: I don’t recall that specific number, no.

Q: And four employees out of 85 or 90 is about what, 5 or 6 percent?

A: I’m agreeing with your math, yes.

Q: Is that a statistically significant sample?

A: In the HR [human resources] field it sure is, yes.

Q: Based on what?

A: Well, if one employee had been hit, physically struck, by their boss, that’s less than 5 percent. That’s statistically significant.”

Connearney at *5 n.5 (Wright Dep., at 28:6–14)

In support of his opinion about “disparate treatment,” Wright’s report contained nothing than a naked comparison of two raw percentages and a causal conclusion, without any statistical analysis. Even for this simplistic comparison of rates, Wright failed to explain how he obtained the percentages in a way that permitted the parties and the trial court to understand his computation and his comparisons. Without a statistical analysis, the trial court concluded that Wright had failed to show that the disparity in termination rates among younger and older employees was not likely consistent with random chance. See also Moultrie v. Martin, 690 F. 2d 1078 (4th Cir. 1982) (rejecting writ of habeas corpus when petitioner failed to support claim of grand jury race discrimination with anything other than the numbers of white and black grand jurors).

Although Wright gave the wrong definition of statistical significance, the trial court relied upon judges of the Third Circuit who also did not get the definition quite right. The trial court cited a 2010 case in the Circuit, which conflated substantive and statistical significance and then gave a questionable definition of statistical significance:

“The Supreme Court has not provided any definitive guidance about when statistical evidence is sufficiently substantial, but a leading treatise notes that ‘[t]he most widely used means of showing that an observed disparity in outcomes is sufficiently substantial to satisfy the plaintiff’s burden of proving adverse impact is to show that the disparity is sufficiently large that it is highly unlikely to have occurred at random.’ This is typically done by the use of tests of statistical significance, which determine the probability of the observed disparity obtaining by chance.”

See Connearney at *6 & n.7, citing and quoting from Stagi v. National RR Passenger Corp., 391 Fed. Appx. 133, 137 (3d Cir. 2010) (emphasis added) (internal citation omitted). Ultimately, however, this was all harmless error on the way to the right result.

Posted in Rule 702, statistical evidence | Comments Off on Statistical Analysis Requires an Expert Witness with Statistical Expertise

Benhaim v. St. Germain – Supreme Court of Canada Wrestles With Probability

November 11th, 2016

On November 10, 2016, the Supreme Court of Canada handed down a divided (four-to-three decision) in a medical malpractice case, which involved statistical evidence, or rather probabilistic inference. Benhaim v. St-Germain, 2016 SCC 48 (Nov. 10, 2016). The case involved an appeal from a Quebec trial court, and the Quebec Court of Appeal, and some issues peculiar to Canadian lawyers. For one thing, Canadian law does not appear to follow lost-chance doctrine outlined in the American Law Institute’s Restatement. The consequence seems to be that negligent omissions in the professional liability context are assessed for their causal effect by the Canadian “balance of probabilities” standard.

The facts were reasonably clear, although their interpretation were disputed. In November 2005, Mr. Émond was 44 years old, a lifelong non-smoker, and in good health. At his annual physical with general practitioner Dr. Albert Benhaim, Émond had a chest X-ray (CXR). Benhaim at 11, ¶ 6. Remarkably, neither the majority nor the dissent commented upon the lack of reasonable medical necessity for a CXR in a healthy, non-smoking 40-something male. Few insurers in the United States would have paid for such a procedure. Maybe Canadian healthcare is more expansive than what we see in the United States.

The radiologist reviewing Mr. Émond’s CXR reported a 1.5 to 2.0 cm solitary lesion, and suggested a review with previous CXRs and a recommendation for a CT scan of the thorax. Dr. Benhaim did not follow the radiologist’s suggestions, but Mr. Émond did have a repeat CXR two months later, on January 17, 2006, which was interpreted as unchanged. A recommendation for a follow-up third CXR in four months was not acted upon. Benhaim at 11, ¶ 7. The trial court found that the defendant physicians deviated from the professional standard of care, a finding from which there was no appeal.

Mr. Émond did have a follow-up CXR at the end of 2006, on December 4, 2006, which showed that the solitary lung nodule had grown. Follow up CT and PET scans confirmed that Mr. Émond had Stage IV lung cancer. Id.

The issues in controversy turned on the staging of Mr. Émond’s lung cancer at the time of his first CXR, in November 2005, the medical consequences of the delay in diagnosis. Plaintiffs presented expert witness opinion testimony that Mr. Émond’s lung cancer was only Stage I (or at most IIA), at initial radiographic discovery of a nodule, and that he was at Stage III or IV in December 2006, when CT and PET scans confirmed the actual diagnosis of lung cancer. In the view of plaintiff’s expert witnesses, the delay in diagnosis, and the accompanying growth of the tumor and change from Stage I to IV, dramatically decreased Émond’s chance of survival. Id. At 13, ¶15-16. Indeed, plaintiff’s expert witnesses opined that had Mr. Émond been timely diagnosed and treated in November 2005, he probably would have been cured.

The defense expert witness, Dr. Ferraro, testified that Mr. Émond’s lung cancer was Stage III or IV in November 2005, when the radiographic nodule was first seen, and his chances of survival at that time were already quite poor. According to Dr. Ferraro, earlier intervention and treatment would probably not have been successful in curing Mr. Émond, and the delay in diagnosis was not a cause of his death.

The trial court rejected plaintiffs’ expert witnesses’ opinions on factual grounds. These witnesses had argued that Mr. Émond’s lung cancer was at Stage I in November 2005 because the lung nodule was less than 3 cm., and because Mr. Émond was asymptomatic and in good health. These three points of contention were clearly unreliable because they were all present in January 2007, when Mr. Émond was diagnosed with Stage IV cancer, according to all the expert witnesses. Every point cited by plaintiffs’ expert witnesses in support of their staging failed to discriminate Stage I from Stage III. In Her Honor’s opinion, the lung cancer was probably Stage III in November 2005, and this staging implied a poor prognosis on all the expert witnesses’ opinions. The failure to diagnose until late 2006 was thus not, on the “balance of probabilities” a cause of death. Id. At 15, ¶21.

The intermediate appellate court reversed on grounds of a presumption of causation, which comes into being when the defendant’s negligence interferes with plaintiff’s ability to show causation, and there is some independent evidence of causation to support the case. I will leave this presumption, which the Supreme Court of Canada held inappropriate on the facts of this case, to Canadian lawyers to debate. What was more interesting was the independent evidence adduced by plaintiffs. This evidence consisted of statistical evidence in the form of generality that 78 percent of fortuitously discovered lung cancers are at Stage I, which in turn is associated with a cure rate of 70 percent. Id. at 18 ¶30.

The plaintiffs’ witnesses hoped to apply this generality to this case, notwithstanding that Émond’s nodule was close to 2 cm. on CXR, that the general statistic was based up more sensitive CT studies, and that Émond had been a non-smoker (which may have influenced tumor growth and staging). Furthermore, there was an additional, ominous finding in Mr. Émond’s first CXR, of hilar prominence, which supported the defense’s differentiation of his case from the generality of fortuitously discovered (presumably small, solitary lung nodules without hilar involvement). Id. at 44 ¶83.

The trial court rejected the inference from the group statistic of 70% survival to the conclusion that Mr. Émond had a 70% probability of survival. Tellingly, there was no discussion of the variance for the 70% figure; nor any mention of relevant subgroups. The Court of Appeals, however, would have turned this statistic into a binding presumption by virtue of accepting the 78 percent as providing strong evidencec that the 70% survival figure pertained to Mr. Émond. The intermediate appellate court would then have taken the group survival rate as providing a more likely than not conclusion about Mr. Émond, while rejecting the defense expert witness’s statistics as mere speculation. Id. at 36 ¶67.

Adopting a skeptical stance with respect to probabilistic evidence, the Supreme Court reversed the Quebec Court of Appeal’s reversal of the trial court’s judgment. The Court cited Richard Wright and Jonathan Cohen’s criticisms of probabilistic evidence (and Cohen’s Gatecrasher’s Paradox), and urged caution in applying class or group statistics to generate probabilities that class members share the group characteristic.

“Appellate courts should generally not interfere with a trial judge’s decision not to draw an inference from a general statistic to a particular case. Statistics themselves are silent about whether the particular parties before the court would have conformed to the trend or been an exception from it. Without an evidentiary bridge to the specific circumstances of the plaintiff, statistical evidence is of little assistance. For this reason, such general trends are not determinative in particular cases. What inferences follow from such evidence — whether the generalization that a statistic represents is instantiated in the particular case — is a matter for the trier of fact. This determination must be made with reference to the whole of the evidence.”

Benhaim at 39, ¶74, 75 (internal citations omitted).

To some extent, the Supreme Court’s comments about statistical evidence were rather wide of there mark. The 78% statistic was based upon a high level of generality, namely all cases, without regard for the size of the radiographically discovered lesion, the manner of discovery (CXR versus CT), presence or absence of hilar pathology, or group or individual’s smoking status. In the context of the facts of the case, however, the trial court clearly had a factual basis for resisting the application of the group statistic (78% fortuitously discovered tumors were Stage I with 70% five-year survival).

The Canadian Supreme Court seems to have navigated these probabilistic waters fairly adeptly, although the majority opinion contains broad brush generalities and inaccuracies, which will, no doubt, show up in future lower court cases. For instance:

“This is because the law requires proof of causation only on a balance of probabilities, whereas scientific or medical experts often require a higher degree of certainty before drawing conclusions on causation (p. 330). Simply put, scientific causation and factual causation for legal purposes are two different things.”

Benhaim at 24, ¶47. The Court cited legal precedent for its observation, and not any scientific treatises. And then, the Supreme Court suggested that all one needs to prevail in a tort case in Canada is a medical expert witness who speculates:

“Trial judges are empowered to make legal determinations even where medical experts are not able to express an opinion with certainty.”

Benhaim at 37, ¶72. Clearly dictum on the facts of Benhaim, but it seems that judges in Canada are like those in the United States. Black robes empower them to do what mere scientists could not do. If we were to ignore the holding of Benhaim, we might think that all one needs in Canada is a medical expert who speculates.

Posted in statistical evidence | Comments Off on Benhaim v. St. Germain – Supreme Court of Canada Wrestles With Probability

Lawyer and Economist Expert Witnesses Fail the t-Test

July 7th, 2016

Chad L. Staller is a lawyer and James Markham is an economist. The two testify frequently in litigation. They are principals in a litigation-mill known as the Center for Forensic Economic Studies (CFES), which has been a provider of damages opinions-for-hire for decades.

According to its website, the CFES is:

“a leading provider of expert economic analysis and testimony. Our economists and statisticians consult on matters arising in litigation, with a focus on the analysis of economic loss and expert witness testimony on damages.

We assist with discovery, uncover key data, critique opposing claims and produce clear, credible reports and expert testimony. Attorneys and their clients have relied on our expertise in thousands of cases in jurisdictions across the country.”

Modesty was never CFES’s strong suit. CFES was founded by Chad Staller’s father, the late Jerome M. Staller, who infused the run-away inflation of the early 1980s into his reports for plaintiffs in personal injury actions. When this propensity for inflation brought in a large volume of litigation consulting, Staller brought on Brian P. Sullivan. The CFES website notes that Sullivan’s “courtroom demeanor was a model of modesty and good humor, yet he was known to be merciless when cross examined by an opposing attorney.” My personal recollection is that Sullivan sweated profusely on cross-examination. In one case, in which I cross-examined him, Sullivan had added several figures incorrectly to the plaintiff’s detriment. My cross-examination irked the trial judge (Judge Dowling, who was easily irked) to the point that he interrupted me to ask why I was wasting time to point out an error that favored the defense. The question allowed me to give a short summation about how I thought the jury might want to know that the witness, Sullivan, had such difficulty in adding uncomplicated numbers.

In Butt v. v. United Brotherhood of Carpenters & Joiners of America, 2016 WL 3365772 (E.D. Pa. June 16, 2016) [cited as Butt], plaintiffs, women union members sued for alleged disparate treatment, which treatment supposedly caused them to have lower incomes than male union members. To support their claims, the women produced reports prepared by CFES’s Chad Staller and James Markham. Counsel for the union challenged the admissibility of the proffered opinions under Rule 702. The magistrate judge sustained the Rule 702 challenges, in an opinion that questioned the reliability and ability of the challenged putative expert witnesses.[1]

Staller and Markham apparently had proffered a “t-test,” which, in their opinion, showed a statistically significant disparity in male and female hours worked, “not attributable to chance.” Butt at *1. Staller and Markham failed, however, to explain or justify their use of the t-test. The sample size in their analysis included 17 women and 388 men on average across ten years. The magistrate judge noted serious reservations over the CFES analysis’s failure to specify how many men or women were employed in any given year. Plaintiffs’ counsel improvidently attempted to support the CFES analysis by adverting to the Reference Manual on Scientific Evidence (3d ed. 2011), which properly notes that the t-test is designed for small samples, but also issues the caveat that “[a] t-test is not appropriate for small samples drawn from a population that is not normal.” Butt at *1 n.2. The CFES reports, submitted without statistical analysis output, apparently did not attempt to justify the assumption of normality; nor did they proffer a non-parametric analysis.

Putting aside the plaintiffs’ expert witnesses’ failure to explain and justify its use of the t-test, the magistrate judge took issue with the assumption that a comparison of average salaries between the genders was an appropriate analysis in the first place. Butt at *2.

First, the CFES reports assigned damages beyond the years used in their data analysis, which ended in 2012. This extrapolation was especially speculative unwarranted given that union carpenter working hours were trending downward after 2009. Butt at *3. Second, and even more seriously, the magistrate judge saw that no useful comparison could be made between male and female salaries without taking into account several important additional variables such as their individual skills, the extent that individual carpenters solicited employment, or used referral systems, or accepted out-of-town employment. Butt at *3.[2] Without an appropriate multivariate analysis, the CFES reports could not conclude that the discrepancy in hours worked was caused by, rather than merely correlated with, gender. Butt at *4.[3]

[1] See Calhoun v. Yamaha Motor Corp., U.S.A., 350 F.3d 316, 322 (3d Cir. 2003) (affirming exclusion of “speculative and unreliable” expert evidence).

[2] citing Stair v. Lehigh Valley Carpenters Local Union No. 600 of United Brotherhood of Carpenters and Joiners of America, No. Civ. A. 91-1507, 1993 WL 235491, at *7, *18 (E.D. Pa. July 24, 1993) (Huyett, J.), aff’d, 43 F.3d 1463 (3d Cir. 1994) (“Many variables determine the number of hours worked by a carpenter: whether the carpenter solicits employment, whether he or she uses the referral system, whether an employer asks for that carpenter by name, whether the carpenter will accept out of town employment, and whether the carpenter has the skills requested by an employer when that employer calls the Union for a referral.”

[3] Interesting cases cited by the magistrate judge in support included Molthan v. Temple University, 778 F.2d 955, 963 (3d Cir. 1985) (“Because the considerations affecting promotion decisions may differ greatly from one department to another, statistical evidence of a general underrepresentation of women in the position of full professor adds little to a disparate treatment claim.”); Riding v. Kaufmann’s Dep’t Store, 220 F.Supp. 2d 442, 459 (W.D. Pa. 2002) (“Plaintiff’s statistical evidence is mildly interesting, but she does not put the data in context (how old were the women?) [or] tell us what to do with it or what inferences should be gathered from it…”); Brown v. Cost Co., No. Civ. A. 03-224 ERIE, 2006 WL 544296, at *3 (W.D. Pa. Mar. 3, 2006) (excluding statistical evidence proffered in support of claims of disparate treatment).

Posted in Rule 702, statistical evidence | Comments Off on Lawyer and Economist Expert Witnesses Fail the t-Test

National Academies’ Teaching Modules on Scientific Policy Issues

June 30th, 2016

Today, the National Academies of Sciences, Engineering, and Medicine announced its release of nine teaching modules to help public policy decision makers and students in professional schools understand the role of science in policy decision making.[1] The modules were developed by university faculty members for the use of other faculty who want to help their students appreciate the complexity and nuances of the evidence for and against scientific claims.

A group within the Academies’ Committee on Science, Technology and the Law supervised the development of the teaching modules, which are now publicly available at the Academies’ website. The Committee was chaired by Paul Brest, former dean and professor emeritus (active), Stanford Law School, and Saul Perlmutter, Franklin W. and Karen Weber Dabby Chair, University of California, Berkeley, and senior scientist, E.O. Lawrence Berkeley National Laboratory. The Gordon and Betty Moore Foundation and the National Biomedical Research Foundation sponsored the development of the modules.

The modules use case studies to illustrate basic scientific and statistical principles involved in contemporary scientific issues that have significant policy implications. The modules are designed to help future policy and decision makers understand and evaluate the scientific evidence that they will doubtlessly encounter. To date, nine modules have been developed and released, in the hope that they will serve as references and examples for future teaching modules.

The nine modules prepared to date are:

Models: Scientific Practice in Context

prepared by:
– Elizabeth Fisher, Professor of Environmental Law, Faculty of Law and Corpus Christi College, Oxford University
– Pasky Pascual, Environmental Protection Agency
– Wendy Wagner, Joe A. Worsham Centennial Professor, University of Texas at Austin School of Law

The Interpretation of DNA Evidence: A Case Study in Probabilities

prepared by:

– David H. Kaye, Associate Dean for Research and Distinguished Professor, The Pennsylvania State University (Penn State Law)

Translating Science into Policy: The Role of Decision Science

prepared by:

– Paul Brest, Former Dean and Professor Emeritus (active), Stanford Law School

Placing a Bet: A New Therapy for Parkinson’s Disease

prepared by:

– Kevin W. Sharer, Senior Lecturer, Harvard Business School, Harvard University

Shale Gas Development

prepared by:

– John D. Graham, Dean, School of Public and Environmental Affairs, Indiana University
– John A. Rupp, Adjunct Instructor, School of Public and Environmental Affairs, and Senior Research Scientist, Indiana Geological Survey, Indiana University
– Adam V. Maltese, Associate Professor of Science Education, School of Education, and Adjunct Faculty in Department of Geological Sciences, Indiana University

Drug-Induced Birth Defects: Exploring the Intersection of Regulation, Medicine, Science, and Law

prepared by:

– Nathan A. Schachtman, Lecturer in Law, Columbia Law School

Vaccines

prepared by:

– Arturo Casadevall, Professor and Chair, W. Harry Feinstone Department of Molecular Microbiology and Immunology, Johns Hopkins University Bloomberg School of Public Health

Forensic Pattern Recognition Evidence

prepared by:

– Simon A. Cole, Professor, Department of Criminology, Law, and Society, Director, Newkirk Center for Science and Society, University of California, Irvine
– Alyse Berthental, Ph.D. Candidate, Department of Criminology, Law, and Society, University of California, Irvine
– Jaclyn Seelagy, Scholar, PULSE (Program on Understanding Law, Science, and Evidence), University of California, Los Angeles School of Law

Scientific Evidence of Factual Causation

prepared by:

– Steve C. Gold, Professor of Law, Rutgers School of Law-Newark
– Michael D. Green, Williams Professor of Law, Wake Forest University School of Law
– Joseph Sanders, A.A. White Professor of Law, University of Houston Law Center

[1] See “Academies Release Educational Modules to Help Future Policymakers and Other Professional-School Students Understand the Role of Science in Decision Making” (June 30, 2016).

Posted in Causation, Scientific Evidence, statistical evidence | Comments Off on National Academies’ Teaching Modules on Scientific Policy Issues

Reinventing the Burden of Proof

April 27th, 2016

If lawyers make antic claims that keep the courtrooms busy, law professors make antic proposals to suggest that the law is conceptually confused and misguided, to keep law reviews full.

A few years ago, an article by Professor Edward Cheng claimed that common law courts have failed to grasp the true meaning of burdens of proof. Edward K. Cheng, “Reconceptualizing the Burden of Proof,” 122 Yale L. J. 1254 (2013) [Cheng]. Every law student knows that the preponderance-of-the-evidence standard requires that the party with the burden of proof to establish each element of the claim or defense to a probability greater than 50%. Cheng acknowledges that courts know this as well (citations omitted), but then he goes on to state some remarkable assertions.

First, Cheng suggests that the legal system has engaged in a “casual recharacterization of the burden of proof into p > 0.5 and p > 0.95.” Cheng at 1258. Being charitable, let’s say “characterization” rather than “recharacterization,” for Cheng cites nothing for his suggestion that there was some prior characterization that the law mischievously changed. Cheng at 1258.

Second, Cheng claims that the failure to deal with quantified posterior probabilities is the result of an educational or psychological deficiency of judges and lawyers:

“By comparison, the criminal beyond-a-reasonable-doubt standard is akin to a probability greater than 0.9 or 0.95. Perhaps, as most courts have ruled, the prosecution is not allowed to quantify ‘reasonable doubt’, but that is only an odd quirk of the math-phobic legal system.”

Cheng at 1256 (internal citations omitted). Cheng’s “recharacterization” has given way to his own mischaracterization of the legal system. There is a pandemic math phobia in the legal system, but the refusal to quantify the burden of proof in criminal cases has nothing to do with fear or mathematical incompetence. Most cases simply do not permit any rational or principled quantification of posterior probabilities. And even if they were to allow such a cognitive maneuver, most people, and even judges, cannot map practical certainty, or something like “beyond a reaonable doubt” on to a probability scale of 0 to 1. No less than Judge Jack Weinstein, certainly a friend to the notion that “all evidence is probabilistic,” showed in his informal survey of federal judges of the Eastern District of New York, that judges have no idea of what probability corresponds to the criminal burden of proof:

US v Fatico BoP

U.S. v. Fatico, 458 F.Supp. 388 (E.D.N.Y. 1978). Judge Weinstein’s informal survey showed well enough that there is no real understanding of how to map reasonable doubt or its complement onto a scale of 0 to 1. Furthermore, for the vast majority of cases, there is simply no way to assign meaningful probabilities to events, causes, and states of mind, which make up the elements of claims and defenses in our legal system.

Third, Cheng makes much of the non-existence of absolute probabilities in legal contexts. The word “absolute” is used 14 times in his essay. This point is confusing as stated because no one, to my knowledge, has claimed that the burden of proof is an absolute probability that is stated or arrived at independently of evidence in the case. Plaintiffs and defendants can have burdens of proof and claims and defenses, respectively, but for sake of simplicity, let’s follow Cheng and describe the civil burden of proof as the plaintiff’s burden. The relevant probability is not the absolute probability P(Hπ), but rather the conditional posterior probability: P(Hπ | E).

Fourth, Cheng’s principal innovation, the introduction of a probability ratio as the true meaning and model of the burden of proof has little or no support in case law or in evidence theory. Cheng cites virtually no cases, and only a few selected publications from the world of law reviews. Cheng proposes to recast burdens of proof as a ratio of conditional probabilities of the plaintiff’s and defendant’s “stories.” If the posterior probability of the plaintiff’s story at trial’s end is P(Hπ | E)¹, and the defendant’s story is represented as P(Hδ | E), then Cheng argues that the plaintiff has carried his burden of proof whenever

P(Hπ | E) / P(Hδ | E) > 1.0

This innovation seems fundamentally wrong for several reasons. Again, assuming that the plaintiff or the State has the burden of proof, the defendant has none. If the plaintiff presents no evidence, then the numerator will be zero, and the ratio will be zero. The defendant prevails, and Cheng’s theory holds. But if the plaintiff presents some evidence and the defendant presents none, then the ratio is undefined. Alternatively, we may see the ratio in this situation as approaching infinity as a limit as the probability of the defendant’s “story” based upon his evidence approaches zero. On either interpretation of this scenario, the ratio Cheng invents is huge, and yet the plaintiff may well lose as for instance when plaintiff’s case is insufficient as a matter of law.

Cheng’s ratio theory thus fails as a descriptive theory. The theory appears to fail prescriptively as well. In most civil and criminal cases, the finder of fact is instructed that the defendant has no burden of proof and need not present any evidence at all. Even when the defendant has remained silent, and the plaintiff has presented a legally sufficient case, the fact finder may return a verdict for the defendant when the P(Hπ | E) seems too low with respect to the burden of proof.

Let’s consider an example, perhap not too far fetched in some American courtrooms. The plaintiff claims that drug A has caused him to develop Syndrome Z. Plaintiff has no clinical trial, or analytical epidemiologic, or animal evidence to support his claim. All the plaintiff can adduce is a so-called disproportionality analysis based upon the reporting of adverse events to the FDA. The defendant does not present any evidence of safety. The end point of interest in the lawsuit, Syndrome Z, was not observed in the trials, and was never looked for in any epidemiologic or toxicologic study. The defendant thus has no affirmative evidence of safety that counts for P(Hδ | E).

Assuming that the trial court does not toss this claim pretrial on a Rule 702 motion, or on a directed verdict, the defendant must address the plaintiff’s claim and the assertion that P(Hπ | E) > 0. The plaintiff supports his claim and assertion by presenting an expert witness who endorses the validity, accuracy, and probativeness of the disproportionality analysis. The defendant confronts this evidence solely on cross-examination, and not by trying to suggest that the plaintiff’s expert witness’s analysis is actually evidence of safety. The point of the cross-examination is to show that the proferred analysis is not a valid tool and lacks validity, accuracy, and probativeness.

In this situation, the plaintiff’s P(Hπ | E) might have been greater than 0.5 at the end of direct examination, but if defense counsel has done his job, then at the end of the cross-examination, the P(Hπ | E) < 0.5. Perhaps at this stage of the proceedings, P(Hπ | E) < 0.01.

The defendant, having no affirmative evidence of safety, rests without presenting any evidence. P(Hδ | E) = 0. Alas, we cannot say that P(Hδ | E) is the complement of P(Hπ | E). There is, in most cases, way too much room for ignorance, indeterminate, or unknown probability of the P(Hδ). In this hypothetical, however, there is no evidence adduced for safety at all, only very weak and unreliable evidence of harm. The ratio is undefined, but the law would allow the dismissal of the plaintiff’s case, or would affirm a rational fact finder’s return of a defense verdict. And the law should do those things.

Fifth, Cheng commits other errors along the way to arriving at his ratio theory. In one instance, he commits a serious category mistake:

“Looking at the statistical world, we immediately see that characterizing any decision rule as a 0.5 probability threshold is odd. Statisticians rarely attempt to prove the truth of a proposition or hypothesis by using its absolute probability. Instead, hypothesis testing is usually comparative. There is a null hypothesis and an alternative hypothesis, and one is rejected in favor of the other depending on the evidence observed and the consistency of that evidence with the two hypotheses.”

Cheng at 1259 (internal citations omitted; emphasis added).

Again, Cheng is correct insofar as he suggests that statisticians do not often use use absolute probabilities. Attained levels of significance probabilities, whether used in hypothesis testing or otherwise, are conditional probabilities that describe the probability of observing the sample statistic, or one more extreme, based upon the statistical model and posited null hypothesis. Indeed, many methodologically rigorous statisticians and scientists would resist placing a quantified posterior probability on the truth of a proposition or hypothesis. The measures of probability may be helpful in identifying uncertainties due to random error, or even on occasion due to bias, but these measures do not translate into assigning the quantified posterior probabilites that Cheng wants and needs to make his ratio theory work. There is nothing, however, odd about using the quantified posterior probability of greater than 50% as a metaphor.

But whence comes rejecting one hypothesis “in favor of” another, as a matter of statistics? The null hypothesis is not accepted in the hypothesis test; rather it was assumed in order to conduct the test. The inference Cheng describes would be improper. In a footnote, Cheng asserts that “classical hypothesis testing strongly favors the null hypothesis,” but this conflates attained level of significance with posterior probabilities. Cheng at 1259 n. 12. Cheng states that “the null hypothesis can be given no specific preference,” in legal contexts, id., but this statement seems to ignore what it means for a party to have a burden of proving facts needed to establish its claim or defense.

Of course, over the course of multiple studies, which look at the issue repeatedly with increasingly precise and valid experiments and studies, and which consistently fail to reject a given null hypothesis, we sometimes do, as a matter of judgment, accept the null hypothesis. This situation has little to do with the Cheng’s ratio theory, however.

1 Where P stands for probability, Hπ for the plaintiff’s “story,” Hδ for the defendant’s story, P(Hπ | E) represents the posterior probability at trial’s end of the plaintiff’s story given the evidence, and P(Hδ | E) represents the posterior probability at trial’s end of the defendant’s story given the evidence.

Posted in statistical evidence | Comments Off on Reinventing the Burden of Proof

Lipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test

April 21st, 2016

Muriel Bristol was a biologist who studied algae at the Rothamsted Experimental Station in England, after World War I. In addition to her knowledge of plant biology, Bristol claimed the ability to tell whether tea had been added to milk, or the tea poured first and then milk had been added. Bristol, as a scientist and a proper English woman, preferred the latter.

Ronald Fisher, who also worked at Rothamsted, expressed his skepticism over Dr. Bristol’s claim. Fisher set about to design a randomized experiment that would efficiently and effectively test her claim. Bristol was presented with eight cups of tea, four of which were prepared with milk added to tea, and four prepared with tea added to milk. Bristol, of course, was blinded to which was which, but was required to label each according to its manner of preparation. Fisher saw his randomized experiment as a 2 x 2 contingency table, from he could calculate the observed outcome (and ones more extreme if there were any more extreme outcomes) using the assumption of fixed marginal rates and the hypergeometric probability distribution. Fisher’s Exact Test was born at tea time.[1]

Fisher described the origins of his Exact Test in one of his early texts, but he neglected to report whether his experiment vindicated Bristol’s claim. According to David Salsburg, H. Fairfield Smith, one of Fisher’s colleagues, acknowledged that Bristol nailed Fisher’s Exact test, with all eight cups correctly identified. The test has gone on to become an important tool in the statistician’s armamentarium.

Fisher’s Exact, like any statistical test, has model assumptions and preconditions. For one thing, the test is designed for categorical data, with binary outcomes. The test allows us to evaluate whether two proportions are likely different by chance alone, by calculating the probability of the observed outcome, as well as more extreme outcomes.

The calculation of an exact attained significance probability, using Fisher’s approach, provides a one-sided p-value, with no unique solution to calculating a two-side attained significance probability. In discrimination cases, the one-sided p-value may well be more appropriate for the issue at hand. The Fisher’s Exact Test has thus played an important role in showing the judiciary that small sample size need not be an insuperable barrier to meaningful statistical analysis. In discrimination cases, the one-sided p-value provided by the test is not a particular problem.[2]

The difficulty of using Fisher’s Exact for small sample sizes is that the hypergeometric distribution, upon which the test is based, is highly asymmetric. The observed one-sided p-value does not measure the probability of a result equally extreme in the opposite direction. There are at least three ways to calculate the p-value:

Double the one-sided p-value.
Add the point probabilities from the opposite tail that are more extreme than the observed point probability.
Use the mid-P value; that is, add all values more extreme (smaller) than the observed point probability from both sides of the distribution, PLUS ½ of the observed point probability.

Some software programs will proceed in one of these ways by default, but their doing so does guarantee the most accurate measure of two-tailed significance probability.

In the Lipitor MDL for diabetes litigation, Judge Gergel generally used sharp analyses to cut through the rancid fat of litigation claims, to get to the heart of the matter. By and large, he appears to have done a splendid job. In course of gatekeeping under Federal Rule of Evidence 702, however, Judge Gergel may have misunderstood the nature of Fisher’s Exact Test.

Nicholas Jewell is a well-credentialed statistician at the University of California. In the courtroom, Jewell is a well-known expert witness for the litigation industry. He is no novice at generating unreliable opinion testimony. See In re Zoloft Prods. Liab. Litig., No. 12–md–2342, 2015 WL 7776911 (E.D. Pa. Dec. 2, 2015) (excluding Jewell’s opinions as scientifically unwarranted and methodologically flawed). In the Lipitor cases, some of Jewell’s opinions seemed outlandish indeed, and Judge Gergel generally excluded them. See In re Lipitor Marketing, Sales Practices and Prods. Liab. Litig., MDL No. 2:14-mn-02502-RMG, ___ F.Supp. 3d ___ (2015), 2015 WL 7422613 (D.S.C. Nov. 20, 2015) [Lipitor Jewell], reconsideration den’d, 2016 WL 827067 (D.S.C. Feb. 29, 2016) [Lipitor Jewell Reconsidered].

As Judge Gergel explained, Jewell calculated a relative risk for abnormal blood glucose in a Lipitor group to be 3.0 (95% C.I., 0.9 to 9.6), using STATA software. Also using STATA, Jewell obtained an attained significance probability of 0.0654, based upon Fisher’s Exact Test. Lipitor Jewell at *7.

Judge Gergel did not report whether Jewell’s reported p-value of 0.0654, was one- or two-sided, but he did state that the attained probability “indicates a lack of statistical significance.” Id. & n. 15. The rest of His Honor’s discussion of the challenged opinion, however, makes clear that of 0.0654 must have been a two-sided value. If it had been a one-sided p-value, then there would have been no way of invoking the mid-p to generate a two-sided p-value below 5%. The mid-p will always be larger than the one-tailed exact p-value generated by Fisher’s Exact Test.

The court noted that Dr. Jewell had testified that he believed that STATA generated this confidence interval by “flip[ping]” the Taylor series approximation. The STATA website notes that it calculates confidence intervals for odds ratios (which are different from the relative risk that Jewell testified he computed), by inverting the Fisher exact test.[3] Id. at *7 & n. 17. Of course, this description suggests that the confidence interval is not based upon exact methods.

STATA does not provide a mid p-value calculation, and so Jewell used an on-line calculator, to obtain a mid p-value of 0.04, which he declared statistically significant. The court took Jewell to task for using the mid p-value as though it were a different analysis or test. Id. at *8. Because the mid-p value will always be larger than the one-sided exact p-value from Fisher’s Exact Test, the court’s explanation does not really make sense:

“Instead, Dr. Jewell turned to the mid-p test, which would ‘[a]lmost surely’ produce a lower p-value than the Fisher exact test.”

Id. at *8. The mid-p test, however, is not different from the Fisher’s exact; rather it is simply a way of dealing with the asymmetrical distribution that underlies the Fisher’s exact, to arrive at a two-tailed p-value that more accurately captures the rate of Type I error.

The MDL court acknowledged that the mid-p approach, was not inherently unreliable, but questioned Jewell’s inconsistent, selective use of the approach for only one test.[4] Jewell certainly did not help the plaintiffs’ cause and his standing by having discarding the analyses that were not incorporated into his report, thus leaving the MDL court to guess at how much selection went on in his process of generating his opinions.. Id. at *9 & n. 19.

None of Jewell’s other calculated p-values involved the mid-p approach, but the court’s criticism begs the question whether the other p-values came from a Fisher’s Exact Test with small sample size, or other highly asymmetrical distribution. Id. at *8. Although Jewell had shown himself willing to engage in other dubious, result-oriented analyses, Jewell’s use of the mid-p for this one comparison may have been within acceptable bounds after all.

The court also noted that Jewell had obtained the “exact p-value and that this p-value was not significant.” Id. The court’s notation here, however, does not report the important detail whether that exact, unreported p-value was merely the doubled of the one-sided p-value given by the Fisher’s Exact Test. As the STATA website, cited by the MDL court, explains:

“The test naturally gives a one-sided p-value, and there are at least four different ways to convert it to a two-sided p-value (Agresti 2002, 93). One way, not implemented in Stata, is to double the one-sided p-value; doubling is simple but can result in p-values larger than one.”

Wesley Eddings, “Fisher’s exact test two-sided idiosyncrasy” (Jan. 2009) (citing Alan Agresti, Categorical Data Analysis 93 (2d ed. 2002)).

On plaintiffs’ motion for reconsideration, the MDL court reaffirmed its findings with respect to Jewell’s use of the mid-p. Lipitor Jewell Reconsidered at *3. In doing so, the court insisted that the one instance in which Jewell used the mid-p stood in stark contrast to all the other instances in which he had used Fisher’s Exact Test. The court then cited to the record to identify 21 other instances in which Jewell used a p-value rather than a mid-p value. The court, however, did not provide the crucial detail whether these 21 other instances actually involved small-sample applications of Fisher’s Exact Test. As result-oriented as Jewell can be, it seems safe to assume that not all his statistical analyses involved Fisher’s Exact Test, with its attendant ambiguity for how to calculate a two-tailed p-value.

Post-Script (Aug. 9, 2017)

The defense argument and the judicial error were echoed in a Washington Legal Foundation paper that pilloried Nicholas Jewell for the surfeit of many methodological flaws in his expert witness opinions in In re Lipitor. Unfortunately, the paper uncritically recited the defense’s theory about the Fisher’s Exact Test:

“In assessing Lipitor data, even after all of the liberties that [Jewell] took with selecting data, he still could not get a statistically-significant result employing a Fisher’s exact test, so he switched to another test called a mid-p test, which generated a (barely) statistically significant result.”

Kirby Griffis, “The Role of Statistical Significance in Daubert/Rule 702 Hearings,” at 19, Wash. Leg. Foundation Critical Legal Issues Working Paper No. 201 (Mar. 2017). See Kirby Griffis, “Beware the Weak Argument: The Rule of Thirteen,” For the Defense 72 (July 2013) (quoting Justice Frankfurter, “A bad argument is like the clock striking thirteen. It puts in doubt the others.”). The fallacy of Griffis’ argument is that it assumes that a mid-p calculation is a different statistical test from the Fisher’s Exact test, which yields a one-tailed significance probability. Unfortunately, Griffis’ important paper is marred by this and other misstatements about statistics.

[1] Sir Ronald A. Fisher, The Design of Experiments at chapter 2 (1935); see also Stephen Senn, “Tea for three: Of infusions and inferences and milk in first,” Significance 30 (Dec. 2012); David Salsburg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (2002).

[2] See, e.g., Dendy v. Washington Hosp. Ctr., 431 F. Supp. 873 (D.D.C. 1977) (denying preliminary injunction), rev’d, 581 F.2d 99 (D.C. Cir. 1978) (reversing denial of relief, and remanding for reconsideration). See also National Academies of Science, Reference Manual on Scientific Evidence 255 n.108 (3d ed. 2011) (“Well-known small sample techniques [for testing significance and calculating p-values] include the sign test and Fisher’s exact test.”).

[3] See Wesley Eddings, “Fisher’s exact test two-sided idiosyncrasy” (Jan. 2009), available at <http://www.stata.com/support/faqs/statistics/fishers-exact-test/>, last visited April 19, 2016 (“Stata’s exact confidence interval for the odds ratio inverts Fisher’s exact test.”). This article by Eddings contains a nice discussion of why the Fisher’s Exact Test attained significance probability disagrees with the calculated confidence interval. Eddings points out the asymmetry of the hypergeometric distribution, which complicates arriving at an exact p-value for a two-sided test.

[4] See Barber v. United Airlines, Inc., 17 Fed.Appx. 433, 437 (7th Cir. 2001) (“Because in formulating his opinion Dr. Hynes cherry-picked the facts he considered to render an expert opinion, the district court correctly barred his testimony because such a selective use of facts fails to satisfy the scientific method and Daubert.”).

Posted in Expert Witnesses, Rule 702, statistical evidence | Comments Off on Lipitor Diabetes MDL’s Inexact Analysis of Fisher’s Exact Test