TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Zoloft MDL Relieves Matrixx Depression

January 30th, 2015

When the Supreme Court delivered its decision in Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309 (2011), a colleague, David Venderbush from Alston & Bird LLP, and I wrote a Washington Legal Foundation Legal Backgrounder, in which we predicted that plaintiffs’ counsel would distort the holding, and inflate the dicta of the opinion. Schachtman & Venderbush, “Matrixx Unbounded: High Court’s Ruling Needlessly Complicates Scientific Evidence Principles,” 26 (14) Legal Backgrounder (June 17, 2011)[1]. Our prediction was sadly all-too accurate. Not only was the context of the Matrixx distorted, but several district courts appeared to adopt the dicta on statistical significance as though it represented the holding of the case[2].

The Matrixx decision, along with the few district court opinions that had embraced its dicta[3], was urged as the basis for denying a defense challenge to the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist, in the Zoloft MDL. The trial court, however, correctly discerned several methodological shortcomings and failures, including Dr. Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons. See In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (Rufe, J.).

Plaintiffs (through their Plaintiffs’ Steering Committee (PSC) in the Zoloft MDL) were undaunted and moved for reconsideration, asserting that the MDL trial court had failed to give appropriate weight to the Supreme Court’s decision in Matrixx, and a Third Circuit decision in DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941 (3d Cir. 1990). The MDL trial judge, however, deftly rebuffed the plaintiffs’ use of Matrixx, and their attempt to banish consideration of random error in the interpretation of epidemiologic studies. In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2015 WL 314149 (E.D. Pa. Jan. 23, 2015) (Rufe, J.) (denying PSC’s motion for reconsideration).

In rejecting the motion for reconsideration, the Zoloft MDL trial judge noted that the PSC had previously cited Matrixx, and that the Court had addressed the case in its earlier ruling. 2015 WL 314149, at *2-3. The MDL Court then proceeded to expand upon its earlier ruling, and to explain how Matrixx was largely irrelevant to the Rule 702 context of Pfizer’s challenge to Dr. Bérard. There were, to be sure, some studies with nominal statistically significant results, for some birth defects, among children of mothers who took Zoloft in their first trimester of pregnancy. As Judge Rufe explained, statistical significance, or the lack thereof, was only one item in a fairly long list of methodological deficiencies in Dr. Bérard’s causation opinions:

“The [original] opinion set forth a detailed and multi-faceted rationale for finding Dr. Bérard’s testimony unreliable, including her inattention to the principles of replication and statistical significance, her use of certain principles and methods without demonstrating either that they are recognized by her scientific community or that they should otherwise be considered scientifically valid, the unreliability of conclusions drawn without adequate hypothesis testing, the unreliability of opinions supported by a ‛cherry-picked’ sub-set of research selected because it was supportive of her opinions (without adequately addressing non-supportive findings), and Dr. Bérard’s failure to reconcile her currently expressed opinions with her prior opinions and her published, peer-reviewed research. Taking into account all these factors, as well as others discussed in the Opinion, the Court found that Dr. Bérard departed from well-established epidemiological principles and methods, and that her opinion on human causation must be excluded.”

Id. at *1.

In citing the multiple deficiencies of the proffered expert witness, the Zoloft MDL Court thus put its decision well within the scope of the Third Circuit’s recent precedent of affirming the exclusion of Dr. Bennet Omalu, in Pritchard v. Dow Agro Sciences, 430 F. App’x 102, 104 (3d Cir.2011). The Zoloft MDL Court further defended its ruling by pointing out that it had not created a legal standard requiring statistical significance, but rather had made a factual finding that epidemiologist, such as the challenged witness, Dr. Anick Bérard, would use some measure of statistical significance in reaching conclusions in her discipline of epidemiology. 2015 WL 314149, at *2[4].

On the plaintiffs’ motion for reconsideration, the Zoloft Court revisited the Matrixx case, properly distinguishing the case as a securities fraud case about materiality of non-disclosed information, not about causation. 2015 WL 314149, at *4. Although the MDL Court could and should have identified the Matrixx language as clearly obiter dicta, it did confidently distinguish the Supreme Court holding about pleading materiality from its own task of gatekeeping expert witness testimony on causation in a products liability case:

“Because the facts and procedural posture of the Zoloft MDL are so dissimilar from those presented in Matrixx, this Court reviewed but did not rely upon Matrixx in reaching its decision regarding Dr. Bérard. However, even accepting the PSC’s interpretation of Matrixx, the Court’s Opinion is consistent with that ruling, as the Court reviewed Dr. Bérard’s methodology as a whole, and did not apply a bright-line rule requiring statistically significant findings.”

Id. at *4.

In mounting their challenge to the MDL Court’s earlier ruling, the Zoloft plaintiffs asserted that the Court had failed to credit Dr. Bérard’s reliance upon what Dr. Bérard called the “Rothman approach.” This approach, attribution to Professor Kenneth Rothman had received some attention in the Bendectin litigation in the Third Circuit, where plaintiffs sought to be excused from their failure to show statistically significant associations when claiming causation between maternal use of Bendectin and infant birth defects. DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941 (3d Cir. 1990). The Zoloft MDL Court pointed out that the Circuit, in DeLuca, had never affirmatively endorsed Professor Rothman’s “approach,” but had reversed and remanded the Bendectin case to the district court for a hearing under Rule 702:

“by directing such an overall evaluation, however, we do not mean to reject at this point Merrell Dow’s contention that a showing of a .05 level of statistical significance should be a threshold requirement for any statistical analysis concluding that Bendectin is a teratogen regardless of the presence of other indicial of reliability. That contention will need to be addressed on remand. The root issue it poses is what risk of what type of error the judicial system is willing to tolerate. This is not an easy issue to resolve and one possible resolution is a conclusion that the system should not tolerate any expert opinion rooted in statistical analysis where the results of the underlying studies are not significant at a .05 level.”

2015 WL 314149, at *4 (quoting from DeLuca, 911 F.2d at 955). After remand, the district court excluded the DeLuca plaintiffs’ expert witnesses, and granted summary judgment, based upon the dubious methods employed by plaintiffs’ expert witnesses in cherry picking data, recalculating risk ratios in published studies, and ignoring bias and confounding in studies. The Third Circuit affirmed the judgment for Merrell Dow. DeLuca v. Merrell Dow Pharma., Inc., 791 F. Supp. 1042 (3d Cir. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993).

In the Zoloft MDL, the plaintiffs not only offered an erroneous interpretation of the Third Circuit’s precedents in DeLuca, they also failed to show that the “Rothman” approach had become generally accepted in over two decades since DeLuca. 2015 WL 314149, at *4. Indeed, the hearing record was quite muddled about what the “Rothman” approach involved, other than glib, vague suggestions that the approach would have countenanced Dr. Bérard’s selective, over-reaching analysis of the extant epidemiologic studies. The plaintiffs did not call Rothman as an expert witness; nor did they offer any of Rothman’s publications as exhibits at the Zoloft hearing. Although Professor Rothman has criticized the overemphasis upon p-values and significance testing, he has never suggested that researchers and scientists should ignore random error in interpreting research data. Nevertheless, plaintiffs attempted to invoke some vague notion of a Rothman approach that would ignore confidence intervals, attained significance probability, multiplicity, bias, and confounding. Ultimately, the MDL Court would have none of it. The Court held that the Rothman Approach (whatever that is), as applied by Dr. Bérard, did not satisfy Rule 702.

The testimony at the Rule 702 hearing on the so-called “Rothman approach” had been sketchy at best. Dr. Bérard protested, perhaps too much, when asked about her having ignored p-values:

“I’m not the only one saying that. It’s really the evolution of the thinking of the importance of statistical significance. One of my professors and also a friend of mine at Harvard, Ken Rothman, actually wrote on it – wrote on the topic. And in his book at the end he says obviously what I just said, validity should not be confused with precision, but the third bullet point, it’s saying that the lack of statistical significance does not invalidate results because sometimes you are in the context of rare events, few cases, few exposed cases, small sample size, exactly – you know even if you start with hundreds of thousands of pregnancies because you are looking at rare events and if you want to stratify by exposure category, well your stratum becomes smaller and smaller and your precision decreases. I’m not the only one saying that. Ken Rothman says it as well, so I’m not different from the others. And if you look at many of the studies published nowadays, they also discuss that as well.”

Notes of Testimony of Dr. Anick Bérard, at 76:21- 77:14 (April 9, 2014). See also Notes of Testimony of Dr. Anick Bérard, at 211 (April 11, 2014) (discussing non-statistically significant findings as a “trend,” and asserting that the lack of a significant finding does not mean that there is “no effect”). Bérard’s invocation of Rothman here is accurate but unhelpful. Rothman and Bérard are not alone in insisting that confidence intervals provide a measure of precision of an estimate, and that we should be careful not to interpret the lack of significance to mean no effect. But the lack of significance cannot be used to interpret data to show an effect.

At the Rule 702 hearing, the PSC tried to bolster Dr. Bérard’s supposed reliance upon the “Rothman approach” in cross-examining Pfizer’s expert witness, Dr. Stephen Kimmel:

“Q. You know who Dr. Rothman is, the epidemiologist?
A. Yes.
Q. You actually took a course from Dr. Rothman, didn’t you?
A. I did when I was a student way back.
Q. He is a well-known epidemiologist, isn’t he?
A. Yes, he is.
Q. He has published this book, Modern Epidemiology. Do you have a copy of this?
A. I do.
Q. Do you – Have you ever read it?
A. I read his earlier edition. I have not read the most recent edition.
Q. There’s two other authors, Sander Greenland and Tim Lash. Do you know either one of them?
A. I know Sander. I don’t know Tim.
Q. Dr. Rothman has some – he has written about confidence intervals and statistical significance for some time, hasn’t he?
A. He has.
Q. Do you agree with him that statistical significance is a not matter of validity. It’s a matter of precision?
A. It’s a matter of – well, confidence intervals are matters of precision. P-values are not.
Q. Okay. I want to put up a table and see if you are in agreement with Dr. Rothman. This is the third edition of Modern Epidemiology. And he has – and ignore my brother’s handwriting. But there is an hypothesized rate ratio under 10-3. It says: p-value function from which one can find all confidence limits for a hypothetical study with a rate ratio estimate of 3.1 Do you see that there?
A. Yes. I don’t see the top of the figure, not that it matters.
Q. I want to make sure. The way I understand this, he is giving us a hypothesis that we have a relative risk of 3.1 and it [presumably a 95% confidence interval] crosses 1, meaning it’s not statistically significant. Is that fair?
A. Well, if you are using a value of .05, yes. And again, if this is a single test and there’s a lot of things that go behind it. But, yes, so this is a total hypothetical.
Q. Yes.
A. I’ sorry. He’s saying here is a hypothetical based on math. And so here is – this is what we would propose.
Q. Yes, I want to highlight what he says about this figure and get your thoughts on it. He says:
The message of figure 10-3 is that the example data are more compatible with a moderate to strong association than with no association, assuming the statistical model used to construct the function is correct.
A. Yes.
Q. Would you agree with that statement?
A. Assuming the statistical model is correct. And the problem is, this is a hypothetical.
Q. Sure. So let’s just assume. So what this means to sort of put some meat on the bone, this means that although we cross 1 and therefore are statistically
significant [sic, non-significant], he says the more likely truth here is that there is a moderate to strong effect rather than no effect?
A. Well, you know he has hypothesized this. This is not used in common methods practice in pharmacoepi. Dr. Rothman has lots of ideas but it’s not part of our standard scientific method.

Notes of Testimony of Dr. Stephen Kimmel, at 126:2 to 128:20.

Nothing very concrete about the “Rothman approach” is put before the MDL Court, either through Dr. Bérard or Dr. Kimmel. There are, however, other instructive aspects to the plaintiff’s counsel’s examination. First, the referenced portion of the text, Modern Epidemiology, is a discussion of p-value functions, not of p-values or of confidence intervals per se. Modern Epidemiology at 158-59 (3d ed. 2008). Dr. Bérard never discussed p-value functions in her report or in her testimony, and Dr. Kimmel testified, without contradiction, that such p-value functions are “not used in common methods practice.” Second, the plaintiff’s counsel never marked and offered the Rothman text as an exhibit for the MDL Court to consider. Third, the cross-examiner first asked about the implication for a hypothetical association, and then, when he wanted to “put some meat on the bone” changed the word used in Rothman’s text, “association,” to “effect.” The word “effect” does not appear in Rothman’s text at the referenced discussion about p-value functions. Fortunately, the MDL Court was not poisoned by the “meat on the bone.”

The Pit and the Pendulum

Another document glibly referenced but not provided to the MDL Court was the publication of Sir Austin Bradford Hill’s presidential address to the Royal Society of Medicine on causation. The MDL Court acknowledged that the PSC had argued that the emphasis upon statistical significance was contrary to Hill’s work and teaching. 2015 WL 314149, at *5. In the Court’s words:

“the PSC argues that the Court’s finding regarding the importance of statistical significance in the field of epidemiology is inconsistent with the work of Bradford Hill. The PSC points to a 1965 address by Sir Austin Bradford Hill, which it has not previously presented to the Court, except in opening statements of the Daubert hearings.20 The PSC failed to put forth evidence establishing that Bradford Hill’s statement that ‛I wonder whether the pendulum has not swung too far [in requiring statistical significance before drawing conclusions]’ has, in the decades since that 1965 address, altered the importance of statistical significance to scientists in the field of epidemiology.”

Id. This failure, identified by the Court, is hardly surprising. The snippet of a quotation from Hill would not sustain the plaintiffs’ sweeping generalization. The quoted language in context may help to explain why Hill’s paper was not provided:

“I wonder whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse the glitter of the t table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contracted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers.’ The writer, the editor and the reader are unmoved. The magic formulae are there.”

Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 299 (1965).

In the Zoloft cases, no expert witness was prepared to state that the disparity was “grotesquely obvious,” or “negligible.” And Bradford Hill’s larger point was that bias and confounding often dwarf considerations of random error, and that there are many instances in which significance testing is unavailing or unhelpful. And in some studies, with large “effect sizes,” statistical significance testing may be beside the point.

Hill’s presidential address to the Royal Society of Medicine commemorated his successes in epidemiology, and we need only turn to Hill’s own work to see how prevalent was his use of measurements of significance probability. See, e.g., Richard Doll & Austin Bradford Hill, “Smoking and Carcinoma of the Lung: Preliminary Report,” Brit. Med. J. 740 (Sept. 30, 1950); Medical Research Council, “Streptomycin Treatment of Pulmonary Tuberculosis,” Brit. Med. J. 769 (Oct. 30, 1948).

Considering the misdirection on Rothman and on Hill, the Zoloft MDL Court did an admirable job in unraveling the Matrixx trap set by counsel. The Court insisted upon parsing the Bradford Hill factors[5], over Pfizer’s objection, despite the plaintiffs’ failure to show “an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance,” which Bradford Hill insisted was the prerequisite for the exploration of the nine factors he set out in his classic paper. Austin Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295, 295 (1965). Given the outcome, the Court’s questionable indulgence of plaintiffs’ position was ultimately harmless.


[1] See alsoThe Matrixx – A Comedy of Errors,” and “Matrixx Unloaded,” (Mar. 29, 2011), “The Matrixx Oversold,” and “De-Zincing the Matrixx.”

[2] SeeSiracusano Dicta Infects Daubert Decisions” (Sept. 22, 2012).

[3] See, e.g., In re Chantix (Varenicline) Prods. Liab. Litig., 2012 U.S. Dist. LEXIS 130144, at *22 (N.D. Ala. 2012); Cheek v. Wyeth Pharm. Inc., 2012 U.S. Dist. LEXIS 123485 (E.D. Pa. Aug. 30, 2012); In re Celexa & Lexapro Prods. Liab. Litig.,  ___ F.3d ___, 2013 WL 791780 (E.D. Mo. 2013).

[4] The Court’s reasoning on this point begged the question whether an ordinary clinician, ignorant of the standards, requirements, and niceties of statistical reasoning and inference, would be allowed to testify, unconstrained by any principled epidemiologic reasoning about random or systematic error. It is hard to imagine that Rule 702 would countenance such an end-run around the requirements of sound science.

[5] Adhering to Bradford Hill’s own admonition might have saved the Court the confusion of describing statistical significance as a measure of strength of association. 2015 WL 314149, at *2.

The Lie Detector and Wonder Woman – Quirks and Quacks of Legal History

January 27th, 2015

From 1923, until the United States Supreme Court decided the Daubert case in 1993, Frye was cited as “controlling authority” on questions of the admissibility of scientific opinion testimony and test results. The decision is infuriatingly cryptic and unhelpful as to background or context of the specific case, as well as how it might be applied to future controversies. Of the 669 words, these are typically cited as the guiding “rule” with respect to expert witness opinion testimony:

“Just when a scientific principle or discovery crosses the line between the experimental and demonstrable stages is difficult to define. Somewhere in this twilight zone the evidential force of the principle must be recognized, and while the courts will go a long way in admitting expert testimony deduced from a well recognized scientific principle or discovery, the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs.”

Frye v. United States, 293 F. 1013, 1014 (D.C. Cir. 1923).

As most scholars of evidence realize, the back story of the Frye case is rich and bizarre. The expert witness involved, William Marston, was a lawyer and scientist, who had made advances in a systolic blood pressure cuff to be used as a “lie detector.” Marston was also an advocate of free love and, with his wife and his mistress, the inventor of Wonder Woman and her lasso of truth.

Jill Lepore, a professor of history in Harvard University, has written an historical account of Marston and his colleagues. Jill Lepore, The Secret History of Wonder Woman (N.Y. 2014). More recently, Lepore has written an important law review on the historical and legal record of the Frye case, which is concealed in the terse 669 words of the Court of Appeals’ opinion. Jill Lepore, “On Evidence: Proving Frye as a Matter of Law, Science, and History,” 124 Yale L.J. 1092 (2015).

Lepore’s history is an important gloss on the Frye case, but her paper points to a larger, more prevalent, chronic problem in the law, which especially afflicts judicial decisions of scientific or technical issues. As an historian, Lepore is troubled, as we all should be, by the censoring, selecting, suppressing, and distorting of facts that go into judicial decisions. From cases and their holdings, lawyers are taught to infer rules that guide their conduct at the bar, and their clients’ conduct and expectations, but everyone requires fair access to the evidence to determine what facts are material to decision.

As Professor Lepore puts it:

“Marston is missing from Frye because the law of evidence, case law, the case method, and the conventions of legal scholarship — together, and relentlessly — hide facts.”

Id. at 1097. Generalizing from Marston and the Frye case, Lepore notes that:

“Case law is like that, too, except that it doesn’t only fail to notice details; it conceals them.”

Id. at 1099.

Lepore documents that Marston’s psychological research was rife with cherry picking and data dredging. Id. at 1113-14. Despite his degree magna cum laude in philosophy from Harvard College, his L.L.B from Harvard Law School (with no particular distinction), and his Ph.D. from Harvard University, Marston was not a rigorous scientist. In exploring the historical and legal record, not recounted in the Frye decision, Lepore’s history provides a wonderful early example, of what has become a familiar phenomenon of modern litigation: an expert witness who seeks to achieve acceptance for a dubious opinion or device in the courtroom rather than in the court of scientific opinion. Id. at 1122. The trial judge in Frye’s murder case, Justice McCoy, was an astute judge, and quite modest in his ability to evaluate the validity of Marston’s opinions, but he had more than sufficient perspicacity to discern that Marston’s investigation was “wildly unscientific,” with no control groups. Id. at 1135. The trial record of defense counsel’s proffer, and Justice McCoy’s rulings and comments from the bench, reproduced in Lepore’s article, anticipate and predict much of the scholarship surrounding both Frye and Daubert cases.

Lepore complains that the important historical record, including Marston’s correspondence with Professor Wigmore, the criminal fraud charges against Marston, and the correspondence of Frye’s lawyers, lies “filed, undigitized” in various archives. Id. at 1150. Although Professor Lepore tirelessly cites to internet URL sources when available, she could have easily made the primary historical materials available for all, using readily available internet technology. Lepore’s main thesis should encourage lawyers and law scholars to look beyond appellate decisions as the data for legal analysis.

The Rhetoric of Playing Dumb on Statistical Significance – Further Comments on Oreskes

January 17th, 2015

As a matter of policy, I leave the comment field turned off on this blog. I don’t have the time or patience to moderate discussions, but that is not to say that I don’t value feedback. Many readers have written, with compliments, concurrences, criticisms, and corrections. Some correspondents have given me valuable suggestions and materials. I believe I can say that aside from a few scurrilous emails, the feedback generally has been constructive, and welcomed.

My last post was on Naomi Oreskes’ opinion piece in the Sunday New York Times[1]. Professor Deborah Mayo asked me for permission to re-post the substance of this post, and to link to the original[2]. Mayo’s blog does allow for comments, and much to my surprise, the posts drew a great deal of attention, links, comment, and twittering. The number and intensity of the comments, as well as the other blog posts and tweets, seemed out of proportion to the point I was trying to make about misinterpreting confidence intervals and other statistical concepts. I suspect that some climate skeptics received my criticisms of Oreskes with a degree of schadenfreude, and that some who criticized me did so because they fear any challenge to Oreskes as a climate-change advocate. So be it. As I made clear in my post, I was not seeking to engage Oreskes on climate change or her judgments on that issue. What I saw in Oreskes’ article was the same rhetorical move made in the courtroom, and in scientific publications, in which plaintiffs environmentalists attempt to claim a scientific imprimatur for their conclusions without adhering to the rigor required for scientific judgments[3].

Some of the comments about Professor Oreskes caused me to take a look at her recent book, Naomi Oreskes & Erik M. Conway, Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming (N.Y. 2010). Interestingly, much of the substance of Oreskes’ newspaper article comes directly from this book. In the context of reporting on the dispute over the EPA’s meta-analysis of studies on passive smoking and lung cancer, Oreskes addressed the 95 percent issue:

“There’s nothing magic about 95 percent. It could be 80 percent. It could be 51 percent. In Vegas if you play a game with 51 percent odds in your favor, you’ll still come out ahead if you play long enough. The 95 percent confidence level is a social convention, a value judgment. And the value it reflects is one that says that the worst mistake a scientist can make is to fool herself: to think an effect is real when it is not. Statisticians call this a type I error. You can think of it as being gullible, naive, or having undue faith in your own ideas.89 To avoid it, scientists place the burden of proof on the person claiming a cause and effect. But there’s another kind of error-type 2-where you miss effects that are really there. You can think of that as being excessively skeptical or overly cautious. Conventional statistics is set up to be skeptical and avoid type I errors. The 95 percent confidence standard means that there is only 1 chance in 20 that you believe something that isn’t true. That is a very high bar. It reflects a scientific worldview in which skepticism is a virtue, credulity is not.90 As one Web site puts it, ‘A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error’.91 In fact, some statisticians claim that type 2 errors aren’t really errors at all,  just missed opportunities.92

Id. at 156-57 (emphasis added). Oreskes’ statement of the confidence interval, from her book, advances more ambiguity by not specifying what the “something” you don’t believe to be true. Of course, if it is the assumed parameter, then she has made the same error as she did in the Times. Oreskes’ further discussion of the EPA environmental tobacco smoke meta-analysis issue makes her meaning clearer, and her interpretation of statistical significance, less defensible:

“Even if 90 percent is less stringent than 95 percent, it still means that there is a 9 in 10 chance that the observed results did not occur by chance. Think of it this way. If you were nine-tenths sure about a crossword puzzle answer, wouldn’t you write it in?94

Id.  Throughout her discussion, Oreskes fails to acknowledge that the p-value assumes the correctness of the null hypothesis in order to assess the strength of the specific data as evidence against the null. As I have pointed out elsewhere, this misinterpretation of significance testing is a rhetorical strategy to evade significance testing, as well as to obscure the role of bias and confounding in accounting for data that differs from an expected value.

Oreskes also continues to maintain that a failure to reject the null is playing “dumb” and placing:

“the burden of proof on the victim, rather than, for example, the manufacturer of a harmful product-and we may fail to protect some people who are really getting hurt.”

Id. So again, the same petitio principii as we saw in the Times. Victimhood is exactly what remains to be established. Oreskes cannot assume it, and then criticize time-tested methods that fail to deliver a confirmatory judgment.

There are endnotes in her book, but the authors fail to cite any serious statistics text. The only reference of dubious relevance is another University of Michigan book, Stephen T. Ziliak & Deidre N. McCloskey, The Cult of Statistical Significance (2008). Enough said[4].

With a little digging, I learned that Oreskes and Conway are science fiction writers, and perhaps we should judge them by literary rather than scientific standards. See Naomi Oreskes & Erik M. Conway, “The Collapse of Western Civilization: A View from the Future,” 142 Dædalus 41 (2013). I do not imply any pejorative judgment of Oreskes for advancing her apocalyptic vision of the future of Earth’s environment as a work of fiction. Her literary work is a worthy thought experiment that has the potential to lead us to accept her precautionary judgments; and at least her publication, in Dædalus, is clearly labeled science fiction.

Oreskes’ future fantasy is, not surprisingly, exactly what Oreskes, the historian of science, now predicts in terms of catastrophic environmental change. Looking back from the future, the science fiction authors attempt to explore the historical origins of the catastrophe, only to discover that it is the fault of everyone who disagreed with Naomi Oreskes in the early 21st century. Heavy blame is laid at the feet of the ancestor scientists (Oreskes’ contemporaries) who insisted upon scientific and statistical standards for inferring conclusions from observational data. Implicit in the science fiction tale is the welcome acknowledgment that science should make accurate predictions.

In Oreskes’ science fiction, these scientists of yesteryear, today’s adversaries of climate-change advocates, were “almost childlike,” in their felt-need to adopt “strict” standards, and their adherence to severe tests derived from their ancestors’ religious asceticism. In other words, significance testing is a form of self-flagellation. Lest you think, I exaggerate, consider the actual words of Oreskes and Conway:

“In an almost childlike attempt to demarcate their practices from those of older explanatory traditions, scientists felt it necessary to prove to themselves and the world how strict they were in their intellectual standards. Thus, they placed the burden of proof on novel claims, including those about climate. Some scientists in the early twenty-first century, for example, had recognized that hurricanes were intensifying, but they backed down from this conclusion under pressure from their scientific colleagues. Much of the argument surrounded the concept of statistical significance. Given what we now know about the dominance of nonlinear systems and the distribution of stochastic processes, the then-dominant notion of a 95 percent confidence limit is hard to fathom. Yet overwhelming evidence suggests that twentieth-century scientists believed that a claim could be accepted only if, by the standards of Fisherian statistics, the possibility that an observed event could have happened by chance was less than 1 in 20. Many phenomena whose causal mechanisms were physically, chemically, or biologically linked to warmer temperatures were dismissed as “unproven” because they did not adhere to this standard of demonstration.

Historians have long argued about why this standard was accepted, given that it had no substantive mathematical basis. We have come to understand the 95 percent confidence limit as a social convention rooted in scientists’ desire to demonstrate their disciplinary severity. Just as religious orders of prior centuries had demonstrated moral rigor through extreme practices of asceticism in dress, lodging, behavior, and food–in essence, practices of physical self-denial–so, too, did natural scientists of the twentieth century attempt to demonstrate their intellectual rigor through intellectual self-denial.14 This practice led scientists to demand an excessively stringent standard for accepting claims of any kind, even those involving imminent threats.”

142 Dædalus at 44.

The science fiction piece in Dædalus has now morphed into a short book, which is billed within as a “haunting, provocative work of science-based fiction.” Naomi Oreskes & Erik M. Conway, The Collapse of Western Civilization: A View from the Future (N.Y. 2014). Under the cover of fiction, Oreskes and Conway provide their idiosyncratic, fictional definition of statistical significance, in a “Lexicon of Archaic Terms,” at the back of the book:

statistical significance  The archaic concept that an observed phenomenon could only be accepted as true if the odds of it happening by chance were very small, typically taken to be no more than 1 in 20.”

Id. at 61-62. Of course, in writing fiction, you can make up anything you like. Caveat lector.


 

[1] SeePlaying Dumb on Statistical Significance” (Jan. 4, 2015).

[2] SeeSignificance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)” (Jan. 4, 2015).

[3] SeeRhetorical Strategy in Characterizing Scientific Burdens of Proof” (Nov. 15, 2014).

[4] SeeThe Will to Ummph” (Jan. 10, 2012).

Rhetorical Strategy in Characterizing Scientific Burdens of Proof

November 15th, 2014

The recent opinion piece by Kevin Elliott and David Resnik exemplifies a rhetorical strategy that idealizes and elevates a burden of proof in science, and then declares it is different from legal and regulatory burdens of proof. Kevin C. Elliott and David B. Resnik, “Science, Policy, and the Transparency of Values,” 122 Envt’l Health Persp. 647 (2014) [Elliott & Resnik]. What is astonishing about this strategy is the lack of support for the claim that “science” imposes such a high burden of proof that we can safely ignore it when making “practical” legal or regulatory decisions. Here is how the authors state their claim:

“Very high standards of evidence are typically expected in order to infer causal relationships or to approve the marketing of new drugs. In other social contexts, such as tort law and chemical regulation, weaker standards of evidence are sometimes acceptable to protect the public (Cranor 2008).”

Id.[1] Remarkably, the authors cite no statute, no case law, and no legal treatise for the proposition that the tort law standard for causation is somehow lower than for a scientific claim of causality. Similarly, the authors cite no support for their claim that regulatory pronouncements are judged under a lower burden. One only need consider the burden a sponsor faces in establishing medication efficacy and safety in a New Drug Application before the Food and Drug Administration.  Of course, when agencies engage in assessing causal claims regarding safety, they often act under regulations and guidances that lessen the burden of proof from what we would be required in a tort action.[2]

And most important, Elliott and Resnik fail to cite to any work of scientists for the claim that scientists require a greater burden of proof before accepting a causal claim. When these authors’ claims of differential burdens of proof were challenged by a scientist, Dr. David Schwartz, in a letter to the editors, the authors insisted that they were correct, again citing to Carl Cranor, a non-lawyer, non-scientist:

“we caution against equating the standards of evidence expected in tort law with those expected in more traditional scientific contexts. The tort system requires only a preponderance of evidence (> 50% likelihood) to win a case; this is much weaker evidence than scientists typically demand when presenting or publishing results, and confusion about these differing standards has led to significant legal controversies (Cranor 2006).”

Reply to Dr. Schwartz. The only thing the authors added to the discussion was to cite to the same work by Carl Cranor[3], but change the date of the book.

Whence comes the assertion that science has a heavier burden of proof? Elliott and Resnik cite Cranor for their remarkable proposition, and so where did Cranor find support for the proposition at issue here? In his 1993 book, Cranor suggests that we “can think of type I and II error rates as “standards of proof,” which begs the question whether they are appropriately used to assess significance or posterior probabilities[4]. Cranor goes so far in his 1993 as to describe the usual level of alpha as the “95%” rule, and that regulatory agencies require something akin to proof “beyond a reasonable doubt,” when they require two “statistically significant” studies[5]. Thus Cranor’s opinion has its origins in his commission of the transposition fallacy[6].

Cranor has persisted in his fallacious analysis in his later books. In his 2006 book, he erroneously equates the 95% coefficient of statistical confidence with 95% certainty of knowledge[7]. Later in the text, he asserts that agency regulations are written when supported by “beyond a reasonable doubt.[8]

To be fair, it is possible to find regulators stating something close to what Cranor asserts, but only when they themselves are committing the transposition fallacy:

“Statistical significance is a mathematical determination of the confidence in the outcome of a test. The usual criterion for establishing statistical significance is the p-value (probability value). A statistically significant difference in results is generally indicated by p < 0.05, meaning there is less than a 5% probability that the toxic effects observed were due to chance and were not caused by the chemical. Another way of looking at it is that there is a 95% probability that the effect is real, i.e., the effect seen was the result of the chemical exposure.”

U.S. Dep’t of Labor, Guidance for Hazard Determination for Compliance with the OSHA Hazard Communication Standard (29 CFR § 1910.1200) Section V (July 6, 2007).

And it is similarly possible to find policy wonks expressing similar views. In 1993, the Carnegie Commission published a report in which it tried to explain away junk science as simply the discrepancy in burdens of proof between law and science, but its reasoning clearly points to the Commission’s commission of the transposition fallacy:

“The reality is that courts often decide cases not on the scientific merits, but on concepts such as burden of proof that operate differently in the legal and scientific realms. Scientists may misperceive these decisions as based on a misunderstanding of the science, when in actuality the decision may simply result from applying a different norm, one that, for the judiciary, is appropriate.  Much, for instance, has been written about ‘junk science’ in the courtroom. But judicial decisions that appear to be based on ‘bad’ science may actually reflect the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences.”

The Carnegie Commission on Science, Technology, and Government, Report on Science and Technology in Judicial Decision Making 28 (1993)[9].

Resnik and Cranor’s rhetoric is a commonplace in the courtroom. Here is how the rhetorical strategy plays out in courtroom. Plaintiffs’ counsel elicits concessions from defense expert witnesses that they are using the “norms” and standards of science in presenting their opinions. Counsel then argue to the finder of fact that the defense experts are wonderful, but irrelevant because the fact finder must decide the case on a lower standard. This stratagem can be found supported by the writings of plaintiffs’ counsel and their expert witnesses[10]. The stratagem also shows up in the writings of law professors who are critical of the law’s embrace of scientific scruples in the courtroom[11].

The cacophony of error, from advocates and commentators, have led the courts into frequent error on the subject. Thus, Judge Pauline Newman, who sits on the United States Court of Appeals for the Federal Circuit, and who was a member of the Committee on the Development of the Third Edition of the Reference Manual on Scientific Evidence, wrote in one of her appellate opinions[12]:

“Scientists as well as judges must understand: ‘the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences’.”

Reaching back even further into the judiciary’s wrestling with the issue of the difference between legal and scientific standards of proof, we have one of the clearest and clearly incorrect statements of the matter[13]:

“Petitioners demand sole reliance on scientific facts, on evidence that reputable scientific techniques certify as certain. Typically, a scientist will not so certify evidence unless the probability of error, by standard statistical measurement, is less than 5%. That is, scientific fact is at least 95% certain.  Such certainty has never characterized the judicial or the administrative process. It may be that the ‘beyond a reasonable doubt’ standard of criminal law demands 95% certainty.  Cf. McGill v. United States, 121 U.S.App. D.C. 179, 185 n.6, 348 F.2d 791, 797 n.6 (1965). But the standard of ordinary civil litigation, a preponderance of the evidence, demands only 51% certainty. A jury may weigh conflicting evidence and certify as adjudicative (although not scientific) fact that which it believes is more likely than not. ***”

The 95% certainty appears to derive from 95% confidence intervals, although “confidence” is a technical term in statistics, and it most certainly does not mean the probability of the alternative hypothesis under consideration.  Similarly, the probability that is less than 5% is not the probability that the null hypothesis is correct. The United States Court of Appeals for the District of Columbia thus fell for the rhetorical gambit in accepting the strawman that scientific certainty is 95%, whereas civil and administrative law certainty is a smidgeon above 50%.

We should not be too surprised that courts have erroneously described burdens of proof in the realm of science. Even within legal contexts, judges have a very difficult time articulating exactly how different verbal formulations of the burden of proof translate into probability statements. In one of his published decisions, Judge Jack Weinstein reported an informal survey of judges of the Eastern District of New York, on what they believed were the correct quantizations of legal burdens of proof. The results confirm that judges, who must deal with burdens of proof as lawyers and then as “umpires” on the bench, have no idea of how to translate verbal formulations into mathematical quantities: Fatico

U.S. v. Fatico, 458 F.Supp. 388 (E.D.N.Y. 1978). Thus one judge believed that “clear, unequivocal and convincing” required a higher level of proof (90%) than “beyond a reasonable doubt,” and no judge placed “beyond a reasonable doubt” above 95%. A majority of the judges polled placed the criminal standard below 90%.

In running down Elliott, Resnik, and Cranor’s assertions about burdens of proof, all I could find was the commonplace error involved in moving from 95% confidence to 95% certainty. Otherwise, I found scientists declaring that the burden of proof should rest with the scientist who is making the novel causal claim. Carl Sagan famously declaimed, “extraordinary claims require extraordinary evidence[14],” but he appears never to have succumbed to the temptation to provide a quantification of the posterior probability that would cinch the claim.

If anyone has any evidence leading to support for Resnik’s claim, other than the transposition fallacy or the confusion between certainty and coefficient of statistical confidence, please share.


 

[1] The authors citation is to Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice (NY 2008). Professor Cranor teaches philosophy at one of the University of California campuses. He is neither a lawyer nor a scientist, but he does participate with some frequency as a consultant, and as an expert witness, in lawsuits, on behalf of claimants.

[2] See, e.g., In re Agent Orange Product Liab. Litig., 597 F. Supp. 740, 781 (E.D.N.Y. 1984) (Weinstein, J.) (“The distinction between avoidance of risk through regulation and compensation for injuries after the fact is a fundamental one.”), aff’d 818 F.2d 145 (2d Cir. 1987) (approving district court’s analysis), cert. denied sub nom. Pinkney v. Dow Chemical Co., 487 U.S. 1234 (1988).

[3] Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice (NY 2006).

[4] Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34 (Oxford 1993) (One can think of α, β (the chances of type I and type II errors, respectively and 1- β as measures of the “risk of error” or “standards of proof.”) See also id. at 44, 47, 55, 72-76.

[5] Id. (squaring 0.05 to arrive at “the chances of two such rare events occurring” as 0.0025).

[6] Michael D. Green, “Science Is to Law as the Burden of Proof is to Significance Testing: Book Review of Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law,” 37 Jurimetrics J. 205 (1997) (taking Cranor to task for confusing significance and posterior (burden of proof) probabilities). At least one other reviewer was not as discerning as Professor Green and fell for Cranor’s fallacious analysis. Steven R. Weller, “Book Review: Regulating Toxic Substances: A Philosophy of Science and Law,” 6 Harv. J. L. & Tech. 435, 436, 437-38 (1993) (“only when the statistical evidence gathered from studies shows that it is more than ninety-five percent likely that a test substance causes cancer will the substance be characterized scientifically as carcinogenic … to determine legal causality, the plaintiff need only establish that the probability with which it is true that the substance in question causes cancer is at least fifty percent, rather than the ninety-five percent to prove scientific causality”).

[7] Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice 100 (2006) (incorrectly asserting, without further support, that “[t]he practice of setting α =.05 I call the “95% rule,” for researchers want to be 95% certain that when knowledge is gained [a study shows new results] and the null hypothesis is rejected, it is correctly rejected.”).

[8] Id. at 266.

[9] There were some scientists on the Commission’s Task Force, but most of the members were lawyers.

[10] Jan Beyea & Daniel Berger, “Scientific misconceptions among Daubert gatekeepers: the need for reform of expert review procedures,” 64 Law & Contemporary Problems 327, 328 (2001) (“In fact, Daubert, as interpreted by ‛logician’ judges, can amount to a super-Frye test requiring universal acceptance of the reasoning in an expert’s testimony. It also can, in effect, raise the burden of proof in science-dominated cases from the acceptable “more likely than not” standard to the nearly impossible burden of ‛beyond a reasonable doubt’.”).

[11] Lucinda M. Finley, “Guarding the Gate to the Courthouse:  How Trial Judges Are Using Their Evidentiary Screening Role to Remake Tort Causation Rules,” 336 DePaul L. Rev. 335, 348 n. 49 (1999) (“Courts also require that the risk ratio in a study be ‘statistically significant,’ which is a statistical measurement of the likelihood that any detected association has occurred by chance, or is due to the exposure. Tests of statistical significance are intended to guard against what are called ‘Type I’ errors, or falsely ascribing a relationship when there in fact is not one (a false positive).” Finley erroneously ignores the conditioning of the significance probability on the null hypothesis, and she suggests that statistical significance is sufficient for ascribing causality); Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 42 n. 30, 61 (2007) (“Another way of explaining this is that it describes the probability that the procedure produced the observed effect by chance.”) (“Statistical significance is a statement about the frequency with which a particular finding is likely to arise by chance.″).

[12] Hodges v. Secretary Dep’t Health & Human Services, 9 F.3d 958, 967 (Fed. Cir. 1993) (Newman, J., dissenting) (citing and quoting from the Report of the Carnegie Commission on Science, Technology, and Government, Science and Technology in Judicial Decision Making 28 (1993).

[13] Ethyl Corp. v. EPA, 541 F.2d 1, 28 n.58 (D.C. Cir.), cert. denied, 426 U.S. 941 (1976).

[14] Carl Sagan, Broca’s Brain: Reflections on the Romance of Science 93 (1979).

 THE STANDARD OF APPELLATE REVIEW FOR RULE 702 DECISIONS

November 12th, 2014

Back in the day, some Circuits of the United States Court of Appeal embraced an asymmetric standard of review of district court decisions concerning the admissibility of expert witness opinion evidence. If the trial court’s decision was to exclude an expert witness, and that exclusion resulted in summary judgment, then the appellate court would take a “hard look” at the trial court’s decision. If the trial court admitted the expert witness’s opinions, and the case proceeded to trial, with opponent of the challenged expert witness losing the verdict, then the appellate court would take a not-so “hard look” the trial court’s decision to admit the opinion. In re Paoli RR Yard PCB Litig., 35 F.3d 717, 750 (3d Cir.1994) (Becker, J.), cert. denied, 115 S.Ct.1253 (1995).

In Kumho Tire, the 11th Circuit followed this asymmetric approach, only to have the Supreme Court reverse and render. Unlike the appellate procedure followed in Daubert, the high Court took the extra step of applying the symmetrical standard of review, presumably for the didactic purpose of showing the 11th Circuit how to engage in appellate review. Carmichael v. Kumho Tire Co., 131 F.3d 1433 (11th Cir. 1997), rev’d sub nom. Kumho Tire Co. v. Carmichael, 526 U.S. 137, 158-59 (1999).

If anything is clear from the Kumho Tire decision, courts do not have discretion to apply an asymmetric standard to their evaluation of a challenge, under Federal Rule of Evidence 702, to a proffered expert witness opinion. Justice Stephen Breyer, in his opinion for the Court, in Kumho Tire, went on to articulate the requirement that trial courts must inquire whether an expert witness ‘‘employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’ Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999). Again, trial courts do not have the discretion to abandon this inquiry.

The “same intellectual rigor” test may have some ambiguities that make application difficult. For instance, identifying the “relevant” field or discipline may be contested. Physicians traditionally have not been trained in statistical analyses, yet they produce, and rely extensively upon, clinical research, the proper conduct and interpretation of which requires expertise in study design and data analysis. Is the relevant field biostatistics or internal medicine? Given that the validity and reliability of the relied upon studies come from biostatistics, courts need to acknowledge that the rigor test requires identification of the “appropriate” field — the field that produces the criteria or standards of validity and interpretation.

Justice Breyer did grant that trial courts must have some latitude in determining how to conduct their gatekeeping inquiries. Some cases may call for full-blown hearings and post-hearing proposed findings of fact and conclusions of law; some cases may be easily decided upon the moving papers. Justice Breyer’s grant of “latitude,” however, wanders off target:

“The trial court must have the same kind of latitude in deciding how to test an expert’s reliability, and to decide whether or when special briefing or other proceedings are needed to investigate reliability, as it enjoys when it decides whether that expert’s relevant testimony is reliable. Our opinion in Joiner makes clear that a court of appeals is to apply an abuse-of-discretion standard when it ‛review[s] a trial court’s decision to admit or exclude expert testimony’. 522 U. S. at 138-139. That standard applies as much to the trial court’s decisions about how to determine reliability as to its ultimate conclusion. Otherwise, the trial judge would lack the discretionary authority needed both to avoid unnecessary ‛reliability’ proceedings in ordinary cases where the reliability of an expert’s methods is properly taken for granted, and to require appropriate proceedings in the less usual or more complex cases where cause for questioning the expert’s reliability arises. Indeed, the Rules seek to avoid ‛unjustifiable expense and delay’ as part of their search for ‛truth’ and the ‛jus[t] determin[ation]’ of proceedings. Fed. Rule Evid. 102. Thus, whether Daubert ’s specific factors are, or are not, reasonable measures of reliability in a particular case is a matter that the law grants the trial judge broad latitude to determine. See Joiner, supra, at 143. And the Eleventh Circuit erred insofar as it held to the contrary.”

Kumho, 526 U.S. at 152-53.

Now the segue from discretion to fashion the procedural mechanism for gatekeeping review to discretion to fashion the substantive criteria or standards for determining “intellectual rigor in the relevant field” represents a rather abrupt shift. The leap from discretion to fashion procedure to discretion to fashion substantive criteria of validity has no basis in prior law, in linguistics, or in science. For instance, Justice Breyer would be hard pressed to uphold a trial court’s refusal to consider bias and confounding in assessing whether epidemiologic studies established causality in a given case, notwithstanding the careless language quoted above.

The troubling nature of Justice Breyer’s language did not go unnoticed at the time of the Kumho Tire case. Indeed, three of the Justices in Kumho Tire concurred to clarify:

“I join the opinion of the Court, which makes clear that the discretion it endorses—trial-court discretion in choosing the manner of testing expert 1reliability—is not discretion to abandon the gatekeeping function. I think it worth adding that it is not discretion to perform the function inadequately.”

Kumho Tire Co. v. Carmichael, 526 U.S. 137, 158-59 (1999) (Scalia, J., concurring, with O’Connor, J., and Thomas, J.)

Of course, this language from Kumho Tire really cannot be treated as binding after the statute interpreted, Rule 702, was modified in 2000. The judges of the inferior federal courts have struggled with Rule 702, sometimes more to evade its reach than to perform gatekeeping in an intelligent way. Quotations of passages from cases decided before the statute was amended and revised should be treated with skepticism.

Recently, the Sixth Circuit quoted Justice Breyer’s language about latitude from Kumho Tire, in the Circuit’s decision involving GE Healthcare’s radiographic contrast medium, Omniscan. Decker v. GE Healthcare Inc., 2014 U.S. App. LEXIS 20049, at *29 (6th Cir. Oct. 20, 2014). Although the Decker case is problematic in many ways, the defendant did not challenge general causation between gadolinium and nephrogenic systemic fibrosis, a painful, progressive connective tissue disease, which afflicted the plaintiff. It is unclear exactly what sort of latitude in applying the statute, the Sixth Circuit was hoping to excuse.

Expert Witness Mining – Antic Proposals for Reform

November 4th, 2014

Law Reviews and Altered States of Reality

In 2008, Justice Breyer observed wryly that “there is evidence that law review articles have left terra firma to soar into outer space”; and Judge Posner has criticized law review articles for the “silly titles, the many opaque passages, the antic proposals, the rude polemics, [and] the myriad pretentious citations.” In 2010, Justice Scalia, who was a law-review-producing law professor for the University of Virginia for several years, responded to a lawyer’s oral argument, in McDonald v. City of Chicago, by suggesting that the argument had no support in Supreme Court precedent, but the unsupported argument would make the lawyer the “the darling of the professoriate.” At the June 2011 Fourth Circuit Judicial Conference, Chief Justice Roberts opined that law reviews are generally not “particularly helpful for practitioners and judges.”  In his words:

“Pick up a copy of any law review that you see and the first article is likely to be, you know, the influence of Immanuel Kant on evidentiary approaches in 18th-century Bulgaria, or something, which I’m sure was of great interest to the academic that wrote it, but isn’t of much help to the bar.”

See Debra Cassens Weiss, “Law Prof Responds After Chief Justice Roberts Disses Legal ScholarshipAm. Bar Ass’n J. (July 07, 2011). Lawyers would think the Justices view law review scholarship as a useless but generally harmless activity. Sometimes, however, law review articles can actually be harmful.

Selection Effects in the Retention and Presentation of Expert Witnesses

The complaints about law review scholarship are obviously based upon extremes and travesties. Interestingly, Judge Posner himself has been no slacker when it comes to producing law review articles with “antic proposals.” See, e.g., Richard A. Posner, “An Economic Approach to the Law of Evidence,” 51 Stan. L. Rev. 1477, 1541–42 (1999). In the tradition of non-traditional, rationalist proposals that ignore experience and make up something completely untested, Judge Richard Posner has advocated rule changes that would require lawyers

“to disclose the name of all the experts whom they approached as possible witnesses before settling on the one testifying. This would alert the jury to the problem of ‘witness shopping’.”

Posner, 51 Stan. L. Rev. at 1541. The point of Judge Posner’s radical reform is to alert triers of fact to whether the expert witness testifying is the first, or the umpteenth expert witness interviewed before a suitable opinion had been “procured,” so that the fact finder can draw the“ reasonable inference” that the case must be weaker than presented if the party went through so many expert witnesses before coming up with one who would testify in the case. If one party disclosed but one expert witness, the one that actually testified, and the other party disclosed X such witnesses (where X >1), then the fact finder could find in favor of the first party upon the basis of the so-called reasonable inference.

Posner’s proposal is at best a proxy for accuracy and validity in expert witness opinion testimony, and one for which Posner presents no evidence to support his hoped-for improvement in juridical accuracy. Not only does Judge Posner present no evidence that his proposed reform and suggested inference would be in the least bit reasonable and probative of the truth, he fails to address the obvious incentives that would be created by his proposal. Fearing the prejudicial inference from having consulted with “too many” expert witnesses, lawyers, operating under the Posner Rule, would have strong incentives to go to the expert witness “one-stop-shopping” mall, where they know they can obtain expert witnesses guaranteed to align themselves with the needed litigation positions and claims. The Posner Rule would also give a strong advantage to lawyers more skilled in vetting and selecting expert witnesses, to the detriment of less experienced lawyers. Of course, lawyers who are willing to go shopping at the meretricious mall or to employ a “cleaner” who brokers the selection without footprints might escape the bite of the Posner adverse inference.

Posner’s proposed rule ignores what is at the heart of identifying and selecting expert witnesses to testify. Obviously lawyers must identify potential witnesses with suitable expertise to address the issues raised by the litigation. Database searches, such as PubMed and Google Scholar searches for bio-medical experts, can go a long way towards identifying candidates, but interviews are important as well. Posner would chill lawyers’ effective representation by placing an adverse inference upon their diligence in any contact with the person other than the “one” who will be anointed to be the party’s designated testifier.

Meetings and interviews with prospective expert witnesses to ascertain whether the witness candidate has sufficient time and interest in fulfill the litigation assignment. Expertise in the area is hardly a guarantee that the candidate will be interested in answering the specific questions that are contested in the litigation. The lawyers must also ascertain whether the witness candidate has the stamina, patience, and aptitude for the litigation context. Not all real experts do, and the consequences of engaging an expert who does not have the qualities to make a good expert witness can be disastrous. Witness candidates must also be screened for their communication skills, their appearance, and even basic hygiene. The most brilliant expert who mumbles, or who is unkempt, is useless in litigation.

Lawyers must evaluate witness candidates for conflicts of interest, many of which are unknowable until there is a face-to-face meeting. Does the witness candidate have a significant other or child who works for the litigation industry (plaintiffs’ bar) or for the defendant industry under assault in the litigation at hand? Either way, the candidate may be compromised. Was the candidate mentored by an expert witness on the other side? Is the candidate on an editorial board with the adversary’s witnesses? Is the candidate close personal friends of the adversaries or their witnesses, such that he will be less than enthusiastic in showing the infirmities of the other side’s positions? Any of these questions could lead to answers that practically disqualify a witness candidate from consideration. Proceeding without such vetting could be catastrophic for the client and counsel. Burdening the vetting process with the threat of an adverse inference is deeply unfair to diligent counsel trying to represent and serve their clients.

And there are yet additional considerations that require exploration with any witness candidate. Expert witnesses are not equally able to deal with adverse authority in the form of a noted scientist who has taken a stand on the litigation issue, or a superficially appearing authoritative author who has published an adverse opinion. As well trained as they might be, some real experts are “sheep,” who are most comfortable following the herd, and not independent thinkers. Not all experts are willing or able to read studies as critically as needed for the litigation situation, which can sometimes be more demanding than the scientific arena. Lawyers charged with retaining expert witnesses must assess their clients’ positions and determine how well their expert witnesses will perform under all the circumstances of the case.

Professor Christopher Robertson proposes an even more radical reform of the law of expert witness by removing the selection and control of expert witnesses from parties and their counsel, completely. Robertson would somehow create a pool of expert witnesses on the issues in each case, and assign them to parties in a double-blinded randomized fashion. Christopher Tarver Robertson, “Blind Expertise,” 85 N.Y. Univ. L. Rev. 174, 211 (2010). Aside from depriving litigants of autonomy and control over their cases, this approach has even greater potential for generating false results. How do the expert witness come to be retained for this process? Any two expert witnesses may very well come to an incorrect analysis precisely because they do not have the benefit of each other’s report to develop the full range of data to be considered. What if the expert assigned to plaintiff concludes that there is no case, but the expert assigned to the defendant concludes that the plaintiff’s case is meritorious? Normally, plaintiffs’ expert witnesses must file their reports in advance of the defense witnesses, who then have the opportunity to rebut but also the benefit of all the data included. Simultaneous reports risk major omissions of data to be considered on both sides. The adversarial cauldron works to ensure completeness in what data and studies are considered.

Now comes Jonah Gelbach to attempt a probabilistic, theoretical defense of reforms in the Posner-Robertson mold. Jonah B. Gelbach, “Expert Mining and Required Disclosure,” 81 U. Chicago L. Rev. 131 (2014). Professor Gelbach is a well-trained economist, and a recently minted lawyer (Yale 2013), who is now an Associate Professor at the University of Pennsylvania Law School. Gelbach’s experience with the practice of law is limited to working as a law-school intern at David Rosen & Associates, in New Haven, Connecticut, before joining the Penn faculty. His proposals may need to be taken with a 100 grains of aspirin.

Although Gelbach disagrees with particulars of the Posner-Robertson proposals, Gelbach joins with them to opine that “[t]o the extent that additional fully disclosed expert testimony increases the fact finder’s information, we can expect a beneficial increase in accuracy.” Gelbach at 133. Gelbach’s dictum, however, is an ipse dixit, and he offers only a limited hypothetical case in which full disclosure of data should be required to solve the problem. And even in his hypothetical case, the disclosure of the identities of the testers is unnecessary to correct the error that Gelbach predicts. Gelbach’s call for the disclosure of consulting expert witnesses introduces only a collateral issue that has nothing to do with the accuracy of the scientific reasoning.

Gelbach analogizes “witness shopping” to data dredging and multiple testing, with a known inflation in the rate of false positive outcomes. If a party directs multiple to conduct single outcome measurements or tests, then that party can recreate the results of multiple testing without having to disclosure the number of independent tests. Gelbach’s argument is at its strongest for a simplistic model of a simple measurement, with errors normally distributed, with accuracy of the measurement tied to the outcome of the case. Gelbach at 136. Gelbach analogizes expert witness mining with data mining, and goes so far as to provide a calculation of false positive rates from multiple testing.

The sort of multiple testing Gelbach condemns is even more obvious when something other than random error is involved. Consider the need of litigants to have chest radiograph interpreted for the presence or absence of a pneumoconiosis in occupational dust disease litigation. Not only is there an intra-observer variability, there are potential or known subjective biases in radiograph interpretations. Gelbach need not worry about multiple testing because the need for economic efficiency already encourages many lawyers to employ radiologists who are must biased in favor of their clients’ positions. The bigger problem would be to encourage lawyers to obtain an honest second opinion, which might make them less strident about their litigation positions when discussing possible settlement.

Gelbach appears to believe that mandatory disclosure of the number of expert witnesses hired as well as the contents of the written and oral reports issued by the party’s nontestifying expert witnesses is needed to abate the potential harm from “expert mining.” By introducing the probabilistic modeling of Type I and Type II errors, however, Gelbach elevates proofiness over clear thinking about the issue. The simple solution to Gelbach’s soil measurement hypothetical is to require disclosure of all testing data, regardless whether conducted by expert witnesses designated as testifying or as consulting. All are agents of the party for purposes of creating data in the form of the hypothesized soil measurement. Indeed, Gelbach’s hypothetical envisions a technical laboratory that conducts such measurements, and the lab might not even be associated with a person designated to serve as an expert witness on the litigation issues.

Gelbach’s soil-measurement case is thus, for the most part, a straw-person case. In the vast majority of cases, multiple expert witness interviews leading up to selection and retention is, however, not at all like multiple testing, either in its ability to generate deliberate false positive or false negative opinions. The evidence remains what it is, and the parameter unchanged, whatever the qualitative judgments of the witness candidates. In most litigation contexts, the data upon which the expert witnesses will rely comes from published studies, and not from a single measurement under either side’s control and ability to resample many times through the agency of multiple expert witnesses. The Rules need to help the triers of fact discern the truth, not irrelevant proxies for the truth. If the triers of fact are incompetent to adjudge the actual evidence, then we may need to find triers who are competent.

The extension of the soil hypothetical to all of expert witness opinion testimony is unwarranted. Accuracy and validity of expert opinion is not “independent and identically distributed.” Truth and accuracy in scientific judgment as applied to litigation scientific questions are not random variables with known distributions.

A party may have to comb through dozens of potential expert witnesses before arriving at an expert witness with an appropriate, accurate answer to the litigation issue. When confronted with a pamphlet entitled “100 Authors against Einstein,” Albert Einstein quipped “if I were wrong, one would have been enough.”  See Remigio Russo, 18 Mathematical Problems in Elasticity 125 (1996) (quoting Einstein). Legal counsel should not have their clients’ cause compromised because they had the misfortune of consulting the “100 Authors” before arriving at Einstein’s door. The Posner-Robertson-Gelbach proposals all suffer the same flaw: they defer unduly to conformism and ignore the truth, validity, and accuracy of procured opinions.

Disputes in science are resolved with data, from high-quality, reproducible experimental or observational studies, not with appeals to the number of speakers. The number of expert witness candidates who were interviewed or who offered preliminary opinions is irrelevant to the task assigned to the finder of fact in a case involving scientific evidence. The final, proffered opinion of the testifying expert witness is only as good as the evidence and analysis upon which it rests, which under the current rules, should be fully disclosed.

Transparency, Confusion, and Obscurantism

October 31st, 2014

In NIEHS Transparency? We Can See Right Through You (July 10, 2014), I chastised authors Kevin C. Elliott and David B. Resnik for their confusing and confused arguments about standards of proof, the definition of risk, and conflicts of interest (COIs). See Kevin C. Elliott and David B. Resnik, “Science, Policy, and the Transparency of Values,” 122 Envt’l Health Persp. 647 (2014) [Elliott & Resnik]. In their focus on environmentalism and environmental policy, Elliott and Resnik seem intent upon substituting various presumptions, leaps of faith, and unproven extrapolations for actual evidence and valid inference, in the hope of improving the environment and reducing risk to life. But to get to their goal, Elliott and Resnik engage in various equivocations and ambiguities in their use of “risk,” and they compound the muddle by introducing a sliding scale of “standards of evidence,” for legal, regulatory, and scientific conclusions.

Dr. David H. Schwartz is a scientist, who received his doctoral degree in Neuroscience from Princeton University, and his postdoctoral training in Neuropharmacology and Neurophysiology at the Center for Molecular and Behavioral Neuroscience, in Rutgers University. Dr. Schwartz has since gone to found one of the leading scientific consulting firms, Innovative Science Solutions (ISS), which supports both regulatory and litigation claims and defenses, as may scientifically appropriate. Given his experience, Dr. Schwartz is well positioned to address the standards of scientific evidentiary conclusions across regulatory, litigation, and scientific communities.

In this month’s issue of Environmental Health Perspectives (EHP), Dr. David Schwartz adds to the criticism of Elliott and Resnik’s tendentious editorial. David H. Schwartz, “Policy and the Transparency of Values in Science,” 122 Envt’l Health Persp. A291 (2014). Schwartz points out that “[a]lthough … different venues or contexts require different standards of evidence, it is important to emphasize that the actual scientific evidence remains constant.” Id.

Dr. Schwartz points out transparency is needed in how standards and evidence are represented in scientific and legal discourse, and he takes Elliott and Resnik to task for arguing, from ignorance, that litigation burdens are different from scientific standards. At times some writers misrepresent the nature of their evidence, or its weakness, and when challenged, attempt to excuse the laxness in standards by adverting to the regulatory or litigation contexts in which they are speaking. In some regulatory contexts, the burdens of proof are deliberately reduced, or shifted to the regulated industry. In litigation, the standard or burden of proof is rarely different from the scientific enterprise itself. As the United States Supreme Court made clear, trial courts must inquire whether an expert witness ‘‘employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’ Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999). Expert witnesses who fail to exercise the same intellectual rigor in the courtroom as in the laboratory, are eminently disposable or excludable from the legal process.

Schwartz also points out, as I had in my blog post, that “[w]hen using science to inform policy, transparency is critical. However, this transparency should include not only financial ties to industry but also ties to advocacy organizations and other strongly held points of view.”

In their Reply to Dr. Schwartz, Elliott and Resnik concede the importance of non-financial conflicts of interest, but they dig in on the supposed lower standard for scientific claims:

“we caution against equating the standards of evidence expected in tort law with those expected in more traditional scientific contexts. The tort system requires only a preponderance of evidence (> 50% likelihood) to win a case; this is much weaker evidence than scientists typically demand when presenting or publishing results, and confusion about these differing standards has led to significant legal controversies (Cranor 2006).”

Rather than citing any pertinent or persuasive legal authority, Elliott and Resnik cite an expert witness, Carl Cranor, neither a lawyer nor a scientist, who has worked steadfastly for the litigation industry (the plaintiffs’ bar) on various matters. The “caution” of Elliott and Resnik is directly contradicted by the Supreme Court’s pronouncement in Kumho Tire, and is fueled by a ignoratio elenchi that is based upon a confusion between the probability of repeated sampling with confidence intervals (usually 95%) and the posterior probability of a claim: namely, the probability of the claim given the admissible evidence. As the Reference Manual for Scientific Evidence makes clear, these are very different probabilities, which Cranor and others have consistently confused. Elliott and Resnik ought to know better.

Frank Advocacy on Welding Health “Effects”

September 1st, 2014

Arthur L. Frank is a professor and the chair of Drexel University School of Public Health’s program in Environmental and Occupational Health. Frank testifies fairly extensively for plaintiffs in asbestos litigation. See, e.g., Frank Affidavit. He also has testified for plaintiffs in manganese fume litigation brought by welders, although he has no specialty training in movement disorder neurology.

Although Arthur Frank has testified for plaintiffs in asbestos cases, he does not appear to comply with professional association disclosure of conflicts, when he presents on asbestos issues. See, e.g.,More Hypocrisy Over Conflicts of Interest” (Dec. 4, 2010). The same disregard for conflict disclosures seems to hold for his work on welding health outcomes.

Last year, in September 2013, Frank presented a poster at the 11th Inhaled Particles XI (IPXI) conference, organized by the British Occupational Hygiene Society (BOHS). Frank’s poster was entitled,Health Effects of Welding Fumes,” although the poster reported only a cross-sectional study without controls from Qingdao City, China. Frank provided no conflict-of-interest disclosure in the poster. Arthur Frank, Huanqiang Zhang, and Chunsheng Xu, “Health Effects of Welding Fumes,” Inhaled Particle XI Conference, Nottingham, U.K. (Sept. 2013). His current C.V., available online, reports this as an abstract from the Inhaled Articles conference. One does not need an epidemiologic study to see how scientists could choke on this article if inhaled.

Frank’s study was based upon an anonymous questionnaire to 505 steel welders, at state-owned, foreign-funded, and privately owned companies, in Qingdao City, China. No industrial hygiene exposure measurements or use of personal respiratory protection is reported. No physical examinations are reported. No statistical tests are reported. Frank reported a “symptom” of unsteady gait/difficulty walking of 1%, but he and his Chinese colleagues did not report whether this 1% had musculo-skeletal or neurological problems. The former would be fairly common among tradesmen such as welders whose work often puts them at risk of traumatic injury.

According to Frank, a “striking” finding was an 18% prevalence hand tremor among those working at least 15 years, with lower prevalence reported for less than 5 years (4%), at 5 years (3%), and at 10 years (5%). Frank offered no explanation of how these prevalence rates were ascertained in a cross-sectional questionnaire based study; nor did Frank offer any qualification as to whether these hand tremors were rest or action tremor, bilateral or unilateral, or of any particular kind. Notwithstanding the severe limitations of this “study,” Frank offers a conclusion that “[m]anganese from inhaled particles from welding fumes cause serious outcomes in welders. Welding fumes also cause other health effects. Workplace hygiene correlates with health outcome.”

Shortly after the Inhaled Particles conference, a colleague reported that Frank had testified in an asbestos case that he had submitted a “manganese in welding” manuscript to the Annals of Industrial Hygiene, which is the journal of the (BOHS). To date, no article by Frank has appeared in this, or any, journal.

Perhaps the only interesting aspect of this little cross-sectional study, based upon self-reported symptoms, is that workers employed by the communist state report a higher rate of symptoms than those employed by privately owned companies. More evidence that worker illness, if any there should be, is not necessarily the result of the “profit motive” of private corporations. The abstract also shows that at some scientific conferences, anything goes with respect to conflicts-of-interest disclosures and shameless advocacy.

Peer Review, PubPeer, PubChase, and Rule 702 – Candles in the Ear

August 28th, 2014

In deciding the Daubert case, the Supreme Court identified several factors to assess whether “the reasoning or methodology underlying the testimony is scientifically valid and of whether that reasoning or methodology properly can be applied to the facts in issue.” One of those factors was whether the proffered opinion had been “peer reviewed” and published. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 593-94 (1993). The Court explained the publication factor:

“Another pertinent consideration is whether the theory or technique has been subjected to peer review and publication. Publication (which is but one element of peer review) is not a sine qua non of admissibility; it does not necessarily correlate with reliability, and in some instances well-grounded but innovative theories will not have been published. Some propositions, moreover, are too particular, too new, or of too limited interest to be published. But submission to the scrutiny of the scientific community is a component of ‘good science,’ in part because it increases the likelihood that substantive flaws in methodology will be detected. The fact of publication (or lack thereof) in a peer reviewed journal thus will be a relevant, though not dispositive, consideration in assessing the scientific validity of a particular technique or methodology on which an opinion is premised.”

Daubert, 509 U.S. at 593-94 (internal citations omitted). See, e.g., Lust v. Merrell Dow Pharms., Inc., 89 F. 3d 594, 597 (9th Cir. 1996) (affirming exclusion of Dr. Alan Done, plaintiffs’ expert witness in Chlomid birth defects case, in part because of the lack of peer review and publication of his litigation-driven opinions); Hall v. Baxter Healthcare Corp., 947 F. Supp. 1387, 1406 (1996)  (noting that “the lack of peer review for [epidemiologist] Dr. Swan’s theories weighs heavily against the admissibility of Dr. Swan’s testimony”).

Case law since Daubert has made clear that peer review is neither necessary nor sufficient for the admissibility of an opinion. United States v. Mikos, 539 F.3d 706, 711 (7th Cir. 2008) (noting that the absence of peer-reviewed studies on subject of bullet grooving did not render opinion, based upon FBI database, inadmissible); In re Zoloft Prods. Liab. Litig. MDL No. 2342; 12-md-2342,  2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (excluding proffered testimony of epidemiologist Anick Bérard for arbitrarily selecting some point estimates and ignoring others in published studies).

As Susan Haack has noted, “peer review” has taken on mythic proportions in the adjudication of expert witness opinion admissibility.  Susan Haack, “Peer Review and Publication: Lessons for Lawyers,” 36 Stetson L. Rev. 789 (2007), republished in Susan Haack, Evidence Matters: Science, Proof, and Truth in the Law 156 (2014). Peer review, at best, is a weak proxy for the study validity, which is what is really needed in judicial proceedings. Proxies avoid the labor of independent, original thought, and so they are much favored by many judges.

In the past, some litigants oversold peer review as a touchstone of reliable, admissible expert witness testimony only to find that some very shoddy opinions show up in ostensibly peer-reviewed journals. SeeMisplaced Reliance On Peer Review to Separate Valid Science From Nonsense” (Aug. 14, 2011). Scientists often claim that science is “self-correcting,” but in some areas of research, there are few severe tests and little critical review, and mostly glib confirmations from acolytes.

Letters to the editor are sometimes held out as a remedy to peer-review screw ups, but such letters, which are not themselves peer reviewed, are subject to the whims of imperious editors who might wish to silence the views of those who would be critical of their judgment in publishing the article under discussion. Most journals have space only for a few letters, and unpopular but salient points of view can go unreported. Many scientists will not write letters to the editors, even when the published article is terribly wrong in its methods, data analyses, conclusions, or discussion.  Letters to the editor are often frowned upon in academic circles as not advancing affirmative research and scholarship agenda.

Letters to the editor often must be sent within a short time window of initial publication, often too short for busy academics to analyze a paper carefully and comment.  Furthermore, letters  and are often limited to a few hundred words, which length is often inadequate to develop a careful critique or exposition of the issues in the paper.  Moreover, such letters suffer from an additional procedural problem:  authors are permitted a response, and the letter writers are not permitted a reply. Authors thus get the last word, which they can often use to deflect or diffuse important criticisms.  The authors’ response can be sufficiently self-serving and misleading, with immunity from further criticism, that many would-be correspondents abandon the project altogether. See, e.g., PubPeer – “Example case showing why letters to the editor can be a waste of time” (Oct. 8, 2013).

Websites and blogs provide for dynamic content, with the potential for critical reviews that can be identified by search engines. See, e.g., Paul S. Brookes, “Our broken academic journal corrections system,” PSBLAB: Cardiac Mitochondrial Research in the Lab (Jan. 14, 2014). Mostly, the internet holds untapped potential for analysis, discussion, and debate on published studies.  To be sure, some journals provide “comment fields,” on their websites, with an opportunity for open discussion.  Often, full critiques must be developed and presented elsewhere. See, e.g., Androgen Study Group, “Letter to JAMA Asking for Retraction of Misleading Article on Testosterone Therapy” (Mar. 25, 2014).

PubPeer

Kate Yandell, in TheScientist, reports on the creation of PubPeer a few years ago, as a forum for post-publication review and discussion published scientific papers. Kate Yandell, “Concerns Raised Online Linger” (Aug. 25, 2014).  Billing itself as an “online journal club,” PubPeer has pointed out potentially serious problems, some of which have led to retractions and corrections. Another internet site of interest is PubChase, which monitors discussion of particular articles, as well as generating email alerts and recommendations for related articles.

One journal editor has taken notice and given notice that he will not pay attention to post-publication peer review.  Eric J. Murphy, the editor in chief of Lipids, posting a comment at PubPeer, illustrates that there will be a good deal of resistance to post-publication open peer review, out of the control of journal editors:

“As an Editor-in-Chief of a society journal, I have never examined PubPeer nor will I do so. First, there is the crowd or group mentality that may over emphasize some point in an irrational manner.  Just as using the marble theory of officiating is bad, one should never base a decision on the quantity of negative or positive comments. Second, if the concerned individual sent an e-mail or letter to me, then I would be duty bound to examine the issue.  It is not my duty to monitor PubPeer or any other such site, but rather to respond to queries sent to me.  So, with regards to Hugh’s point, I don’t support that position at all.

Mistakes happen, although frankly we try to limit these mistakes and do take steps to prevent publishing papers with FFP, it does happen.  Also, honest mistakes happen in science all the time, so[me] of these result in an erratum, while others go unnoticed by editors and reviewers.  In such a case, someone who does notice should contact the editor to put them on notice regarding the issue so that it may be resolved.  Resolution does not necessarily mean correction, but rather the editor taking a close look at the situation, discussing the situation with the original authors, and then reaching a decision.  Most of the time a correction will be made, but not always.”

Murphy’s comments are remarkable.  PubPeer provides a forum for post-publication comment, but it hardly requires editors, investigators, and consumers of scientific studies to evaluate published works by “nose counts” of favorable and unfavorable comments.  This is not, and never has been, a democratic enterprise.  Somehow, we might expect Murphy and others to evaluate the comments, on the merits, not on their prevalence.  Murphy’s declaration that he is duty-bound to investigate and evaluate letters or emails sent to him about published articles is encouraging, but the editors’ ability to ratify publication, in the face of a private communication, without comment to the scientific community, strips the community of making a principled decision on its own.  Murphy’s way, which seems largely the way of contemporary scientific publishing, ignores the important social dimension of scientific debate and resolution of issues.  Leaving control of the discussion in the hands of the editors who approved and published studies may be asking too much of editors. Nemo iudex in causa sua.

PubPeer has already tested the limits of free speech. Kate Yandell, “PubPeer Threatened with Legal Action” (Aug. 19, 2014). A scientist whose works were receiving unfavorable attention on PubPeer threatened a lawsuit.  Let’s hope that scientists can learn to be sufficiently thick skinned that there can be open discourse of the merits of their research, their data, and their conclusions.

Pritchard v. Dow Agro – Gatekeeping Exemplified

August 25th, 2014

Robert T. Pritchard was diagnosed with Non-Hodgkin’s Lymphoma (NHL) in August 2005; by fall 2005, his cancer was in remission. Mr. Pritchard had been a pesticide applicator, and so, of course, he and his wife sued the deepest pockets around, including Dow Agro Sciences, the manufacturer of Dursban. Pritchard v. Dow Agro Sciences, 705 F.Supp. 2d 471 (W.D.Pa. 2010).

The principal active ingredient of Dursban is chlorpyrifos, along with some solvents, such as xylene, cumene, and ethyltoluene. Id. at 474.  Dursban was licensed for household insecticide use until 2000, when the EPA phased out certain residential applications.  The EPA’s concern, however, was not carcinogenicity:  the EPA categorizes chlorpyrifos as “Group E,” non-carcinogenetic in humans. Id. at 474-75.

According to the American Cancer Society (ACS), the cause or causes of NHL cases are unknown.  Over 60,000 new cases are diagnosed annually, in people from all walks of life, occupations, and lifestyles. The ACS identifies some risk factors, such as age, gender, race, and ethnicity, but the ACS emphasizes that chemical exposures are not proven risk factors or causes of NHL.  See Pritchard, 705 F.Supp. 2d at 474.

The litigation industry does not need scientific conclusions of causal connections; their business is manufacturing certainty in courtrooms. Or at least, the appearance of certainty. The Pritchards found their way to the litigation industry in Pittsburgh, Pennsylvania, in the form of Goldberg, Persky & White, P.C. The Goldberg Persky firm sued Dow Agro, and then put the Pritchards in touch with Dr. Bennet Omalu, to serve as their expert witness.  A lawsuit ensued.

Alas, the Pritchards’ lawsuit ran into a wall, or at least a gate, in the form of Federal Rule of Evidence 702. In the capable hands of Judge Nora Barry Fischer, Rule 702 became an effective barrier against weak and poorly considered expert witness opinion testimony.

Dr. Omalu, no stranger to lost causes, was the medical examiner of San Joaquin County, California, at the time of his engagement in the Pritchard case. After careful consideration of the Pritchards’ claims, Omalu prepared a four page report, with a single citation, to Harrison’s Principles of Internal Medicine.  Id. at 477 & n.6.  This research, however, sufficed for Omalu to conclude that Dursban caused Mr. Pritchard to develop NHL, as well as a host of ailments he had never even sued Dow Agro for, including “neuropathy, fatigue, bipolar disorder, tremors, difficulty concentrating and liver disorder.” Id. at 478. Dr. Omalu did not cite or reference any studies, in his report, to support his opinion that Dursban caused Mr. Pritchard’s ailments.  Id. at 480.

After counsel objected to Omalu’s report, plaintiffs’ counsel supplemented the report with some published articles, including the “Lee” study.  See Won Jin Lee, Aaron Blair, Jane A. Hoppin, Jay H. Lubin, Jennifer A. Rusiecki, Dale P. Sandler, Mustafa Dosemeci, and Michael C. R. Alavanja, “Cancer Incidence Among Pesticide Applicators Exposed to Chlorpyrifos in the Agricultural Health Study,” 96 J. Nat’l Cancer Inst. 1781 (2004) [cited as Lee].  At his deposition, and in opposition to defendants’ 702 motion, Omalu became more forthcoming with actual data and argument.  According to Omalu, the Lee study “the 2004 Lee Study strongly supports a conclusion that high-level exposure to chlorpyrifos is associated with an increased risk of NHL.’’ Id. at 480.

This opinion put forward by Omalu bordered on scientific malpractice.  No; it was malpractice.  The Lee study looked at many different cancer end points, without adjustment for multiple comparisons.  The lack of adjustment means at the very least that any interpretation of p-values or confidence intervals would have to modified to acknowledge the higher rate of random error.  Now for NHL, the overall relative risk (RR) for chlorpyrifos exposure was 1.03, with a 95% confidence interval, 0.62 to 1.70.  Lee at 1783.  In other words, the study that Omalu claimed supported his opinion was about as null a study as can be, with reasonably tight confidence interval that made a doubling of the risk rather unlikely given the sample RR.

If the multiple endpoint testing were not sufficient to dissuade a scientist, intent on supporting the Pritchards’ claims, then the exposure subgroup analysis would have scared any prudent scientist away from supporting the plaintiffs’ claims.  The Lee study authors provided two different exposure-response analyses, one with lifetime exposure and the other with an intensity-weighted exposure, both in quartiles.  Neither analysis revealed an exposure-response trend.  For the lifetime exposure-response trend, the Lee study reported an NHL RR of 1.01, for the highest quartile of chloripyrifos exposure. For the intensity-weighted analysis, for the highest quartile, the authors reported RR = 1.61, with a 95% confidence interval, 0.74 to 3.53).

Although the defense and the district court did not call out Omalu on his fantasy statistical inference, the district judge certainly appreciated that Omalu had no statistically significant associations between chloripyrifos and NHL, to support his opinion. Given the weakness of relying upon a single epidemiologic study (and torturing the data therein), the district court believed that a showing of statistical significance was important to give some credibility to Omalu’s claims.  705 F.Supp. 2d at 486 (citing General Elec. Co. v. Joiner, 522 U.S. 136, 144-46 (1997);  Soldo v. Sandoz Pharm. Corp., 244 F.Supp. 2d 434, 449-50 (W.D. Pa. 2003)).

Figure 3 adapted from Lee

Figure 3 adapted from Lee

What to do when there is really no evidence supporting a claim?  Make up stuff.  Here is how the trial court describes Omalu’s declaration opposing exclusion:

 “Dr. Omalu interprets and recalculates the findings in the 2004 Lee Study, finding that ‘an 80% confidence interval for the highly-exposed applicators in the 2004 Lee Study spans a relative risk range for NHL from slightly above 1.0 to slightly above 2.5.’ Dr. Omalu concludes that ‘this means that there is a 90% probability that the relative risk within the population studied is greater than 1.0’.”

705 F.Supp. 2d at 481 (internal citations omitted); see also id. at 488. The calculations and the rationale for an 80% confidence interval were not provided, but plaintiffs’ counsel assured Judge Fischer at oral argument that the calculation was done using high school math. Id. at 481 n.12. Judge Fischer seemed unimpressed, especially given that there was no record of the calculation.  Id. at 481, 488.

The larger offense, however, was that Omalu’s interpretation of the 80% confidence interval as a probability statement of the true relative risk’s exceeding 1.0, was bogus. Dr. Omalu further displayed his lack of statistical competence when he attempted to defend his posterior probability derived from his 80% confidence interval by referring to a power calculation of a different disease in the Lee study:

“He [Omalu] further declares that ‘‘the authors of the 2004 Lee Study themselves endorse the probative value of a finding of elevated risk with less than a 95% confidence level when they point out that ‘this analysis had a 90% statistical power to detect a 1.5–fold increase in lung cancer incidence’.”

Id. at 488 (court’s quoting of Omalu’s quoting from the Lee study). To quote Wolfgang Pauli, Omalu is so far off that he is “not even wrong.” Lee and colleagues were offering a pre-study power calculation, which they used to justify their looking at the cohort for lung cancer, not NHL, outcomes.  Lee at 1787. The power calculation does not apply to the data observed for lung cancer; and the calculation has absolutely nothing to do with NHL. The power calculation certainly has nothing to do with Omalu’s misguided attempt to offer a calculation of a posterior probability for NHL based upon a subgroup confidence interval.

Given that there were epidemiologic studies available, Judge Fischer noted that expert witnesses were obligated to factor such studies into their opinions. See 705 F.Supp. 2d at 483 (citing Soldo, 244 F.Supp. 2d at 532).  Omalu sins against Rule 702 included his failure to consider any studies other than the Lee study, regardless of how unsupportive the Lee study was of his opinion.  The defense experts pointed to several studies that found lower NHL rates among exposed workers than among controls, and Omalu completely failed to consider and to explain his opinion in the face of the contradictory evidence.  See 705 F.Supp. 2d at 485 (citing Perry v. Novartis Pharm. Corp. 564 F.Supp. 2d 452, 465 (E.D. Pa. 2008)). In other words, Omalu was shown to have been a cherry picker. Id. at 489.

In addition to the abridged epidemiology, Omalu relied upon an analogy between the ethyl-toluene and other solvents that contained benzene rings and benzene itself to argue that these chemicals, supposedly like benzene, cause NHL.  Id. at 487. The analogy was never supported by any citations to published studies, and, of course, the analogy is seriously flawed. Many chemicals, including chemicals made and used by the human body, have benzene rings, without the slightest propensity to cause NHL.  Indeed, the evidence that benzene itself causes NHL is weak and inconsistent.  See, e.g., Knight v. Kirby Inland Marine Inc., 482 F.3d 347 (2007) (affirming the exclusion of Dr. B.S. Levy in a case involving benzene exposure and NHL).

Looking at all the evidence, Judge Fischer found Omalu’s general causation opinions unreliable.  Relying upon a single, statistically non-significant epidemiologic study (Lee), while ignoring contrary studies, was not sound science.  It was not even science; it was courtroom rhetoric.

Omalu’s approach to specific causation, the identification of what caused Mr. Pritchard’s NHL, was equally spurious. Omalu purportedly conducted a “differential diagnosis” or a “differential etiology,” but he never examined Mr. Pritchard; nor did he conduct a thorough evaluation of Mr. Pritchard’s medical records. 705 F.Supp. 2d at 491. Judge Fischer found that Omalu had not conducted a thorough differential diagnosis, and that he had made no attempt to rule out idiopathic or unknown causes of NHL, despite the general absence of known causes of NHL. Id. at 492. The one study identified by Omalu reported a non-statistically significant 60% increase in NHL risk, for a subgroup in one of two different exposure-response analyses.  Although Judge Fischer treated the relative risk less than two as a non-dispositive factor in her decision, she recognized that

“The threshold for concluding that an agent was more likely than not the cause of an individual’s disease is a relative risk greater than 2.0… . When the relative risk reaches 2.0, the agent is responsible for an equal number of cases of disease as all other background causes. Thus, a relative risk of 2.0 … implies a 50% likelihood that an exposed individual’s disease was caused by the agent. A relative risk greater than 2.0 would permit an inference that an individual plaintiff’s disease was more likely than not caused by the implicated agent.”

Id. at 485-86 (quoting from Reference Manual on Scientific Evidence at 384 (2d ed. 2000)).

Left with nowhere to run, plaintiffs’ counsel swung for the bleachers by arguing that the federal court, sitting in diversity, was required to apply Pennsylvania law of evidence because the standards of Rule 702 constitute “substantive,” not procedural law. The argument, which had been previously rejected within the Third Circuit, was as legally persuasive as Omalu’s scientific opinions.  Judge Fischer excluded Omalu’s proffered opinions and granted summary judgment to the defendants. The Third Circuit affirmed in a per curiam decision. 430 Fed. Appx. 102, 2011 WL 2160456 (3d Cir. 2011).

Practical Evaluation of Scientific Claims

The evaluative process that took place in the Pritchard case missed some important details and some howlers committed by Dr. Omalu, but it was more than good enough for government work. The gatekeeping decision in Pritchard was nonetheless the target of criticism in a recent book.

Kristin Shrader-Frechette (S-F) is a professor of science who wants to teach us how to expose bad science. S-F has published, or will soon publish, a book that suggests that philosophy of science can help us expose “bad science.”  See Kristin Shrader-Frechette, Tainted: How Philosophy of Science Can Expose Bad Science (Oxford U.P. 2014)[cited below at Tainted; selections available on Google books]. S-F’s claim is intriguing, as is her move away from the demarcation problem to the difficult business of evaluation and synthesis of scientific claims.

In her introduction, S-F tells us that her book shows “how practical philosophy of science” can counteract biased studies done to promote special interests and PROFITS.  Tainted at 8. Refreshingly, S-F identifies special-interest science, done for profit, as including “individuals, industries, environmentalists, labor unions, or universities.” Id. The remainder of the book, however, appears to be a jeremiad against industry, with a blind eye towards the litigation industry (plaintiffs’ bar) and environmental zealots.

The book promises to address “public concerns” in practical, jargon-free prose. Id. at 9-10. Some of the aims of the book are to provide support for “rejecting demands for only human evidence to support hypotheses about human biology (chapter 3), avoiding using statistical-significance tests with observational data (chapter 12), and challenging use of pure-science default rules for scientific uncertainty when one is doing welfare-affecting science (chapter 14).”

Id. at 10. Hmmm.  Avoiding statistical significance tests for observational data?!?  If avoided, what does S-F hope to use to assess random error?

And then S-F refers to plaintiffs’ hired expert witness (from the Milward case), Carl Cranor, as providing “groundbreaking evaluations of causal inferences [that] have helped to improve courtroom verdicts about legal liability that otherwise put victims at risk.” Id. at 7. Whether someone is a “victim” and has been “at risk” turns on assessing causality. Cranor is not a scientist, and his philosophy of science turns of “weight of the evidence” (WOE), a subjective, speculative approach that is deaf, dumb, and blind to scientific validity.

There are other “teasers,” in the introduction to Tainted.  S-F advertises that her Chapter 5 will teach us that “[c]ontrary to popular belief, animal and not human data often provide superior evidence for human-biological hypotheses.”  Tainted at 11. Chapter 6 will show that“[c]ontrary to many physicists’ claims, there is no threshold for harm from exposure to ionizing radiation.” Id.  S-F tells us that her Chapter 7 will criticize “a common but questionable way of discovering hypotheses in epidemiology and medicine—looking at the magnitude of some effect in order to discover causes. The chapter shows instead that the likelihood, not the magnitude, of an effect is the better key to causal discovery.” Id. at 13. Discovering hypotheses — what is that about? You might have thought that hypotheses were framed from observations and then tested.

Which brings us to the trailer for Chapter 8, in which S-F promises to show that “[c]ontrary to standard statistical and medical practice, statistical-significance tests are not causally necessary to show medical and legal evidence of some effect.” Tainted at 11. Again, the teaser raises lots of questions such as what could S-F possibly mean when she says statistical tests are not causally necessary to show an effect.  Later in the introduction, S-F says that her chapter on statistics “evaluates the well-known statistical-significance rule for discovering hypotheses and shows that because scientists routinely misuse this rule, they can miss discovering important causal hypotheses. Id. at 13. Discovering causal hypotheses is not what courts and regulators must worry about; their task is to establish such hypotheses with sufficient, valid evidence.

Paging through the book reveals that a rhetoric that is thick and unremitting, with little philosophy of science or meaningful advice on how to evaluate scientific studies.  The statistics chapter calls out, and lo, it features a discussion of the Pritchard case. See Tainted, Chapter 8, “Why Statistics Is Slippery: Easy Algorithms Fail in Biology.”

The chapter opens with an account of German scientist Fritz Haber’s development of organophosphate pesticides, and the Nazis use of related compounds as chemical weapons.  Tainted at 99. Then, in a fevered non-sequitur and rhetorical flourish, S-F states, with righteous indignation, that although the Nazi researchers “clearly understood the causal-neurotoxic effects of organophosphate pesticides and nerve gas,” chemical companies today “claim that the causal-carcinogenic effects of these pesticides are controversial.” Is S-F saying that a chemical that is neurotoxic must be carcinogenic for every kind of human cancer?  So it seems.

Consider the Pritchard case.  Really, the Pritchard case?  Yup; S-F holds up the Pritchard case as her exemplar of what is wrong with civil adjudication of scientific claims.  Despite the promise of jargon-free language, S-F launches into a discussion of how the judges in Pritchard assumed that statistical significance was necessary “to hypothesize causal harm.”  Tainted at 100. In this vein, S-F tells us that she will show that:

“the statistical-significance rule is not a legitimate requirement for discovering causal hypotheses.”

Id. Again, the reader is left to puzzle why statistical significance is discussed in the context of hypothesis discovery, whatever that may be, as opposed to hypothesis testing or confirmation. And whatever it may be, we are warned that “unless the [statistical significance] rule is rejected as necessary for hypothesis-discovery, it will likely lead to false causal claims, questionable scientific theories, and massive harm to innocent victims like Robert Pritchard.”

Id. S-F is decidedly not adverting to Mr. Pritichard’s victimization by the litigation industry and the likes of Dr. Omalu, although she should. S-F not only believes that the judges in Pritchard bungled their gatekeeping wrong, she knows that Dr. Omalu was correct, and the defense experts wrong, and that Pritchard was a victim of Dursban and of questionable scientific theories that were used to embarrass Omalu and his opinions.

S-F promised to teach her readers how to evaluate scientific claims and detect “tainted” science, but all she delivers here is an ipse dixit.  There is no discussion of the actual measurements, extent of random error, or threats to validity, for studies cited either by the plaintiffs or the defendants in Pritchard.  To be sure, S-F cites the Lee study in her endnotes, but she never provides any meaningful discussion of that study or any other that has any bearing on chlorpyrifos and NHL.  S-F also cited two review articles, the first of which provides no support for her ipse dixit:

“Although mutagenicity and chronic animal bioassays for carcinogenicity of chlorpyrifos were largely negative, a recent epidemiological study of pesticide applicators reported a significant exposure response trend between chlorpyrifos use and lung and rectal cancer. However, the positive association was based on small numbers of cases, i.e., for rectal cancer an excess of less than 10 cases in the 2 highest exposure groups. The lack of precision due to the small number of observations and uncertainty about actual levels of exposure warrants caution in concluding that the observed statistical association is consistent with a causal association. This association would need to be observed in more than one study before concluding that the association between lung or rectal cancer and chlorpyrifos was consistent with a causal relationship.

There is no evidence that chlorpyrifos is hepatotoxic, nephrotoxic, or immunotoxic at doses less than those that cause frank cholinesterase poisoning.”

David L. Eaton, Robert B. Daroff, Herman Autrup, James Bridges, Patricia Buffler, Lucio G. Costa, Joseph Coyle, Guy McKhann, William C. Mobley, Lynn Nadel, Diether Neubert, Rolf Schulte-Hermann, and Peter S. Spencer, “Review of the Toxicology of Chlorpyrifos With an Emphasis on Human Exposure and Neurodevelopment,” 38 Critical Reviews in Toxicology 1, 5-6(2008).

The second cited review article was written by clinical ecology zealot[1], William J. Rea. William J. Rea, “Pesticides,” 6 Journal of Nutritional and Environmental Medicine 55 (1996). Rea’s article does not appear in Pubmed.

Shrader-Frechette’s Criticisms of Statistical Significance Testing

What is the statistical significance against which S-F rails? She offers several definitions, none of which is correct or consistent with the others.

“The statistical-significance level p is defined as the probability of the observed data, given that the null hypothesis is true.8

Tainted at 101 (citing D. H. Johnson, “What Hypothesis Tests Are Not,” 16 Behavioral Ecology 325 (2004). Well not quite; attained significance probability is the probability of data observed or those more extreme, given the null hypothesis.  A Tainted definition.

Later in Chapter 8, S-F discusses significance probability in a way that overtly commits the transposition fallacy, not a good thing to do in a book that sets out to teach how to evaluate scientific evidence:

“However, typically scientists view statistical significance as a measure of how confidently one might reject the null hypothesis. Traditionally they have used a 0.05 statistical-significance level, p < or = 0.05, and have viewed the probability of a false-positive (incorrectly rejecting a true null hypothesis), or type-1, error as 5 percent. Thus they assume that some finding is statistically significant and provides grounds for rejecting the null if it has at least a 95-percent probability of not being due to chance.

Tainted at 101. Not only does the last sentence ignore the extent of error due to bias or confounding, it erroneously assigns a posterior probability that is the complement of the significance probability.  This error is not an isolated occurrence; here is another example:

“Thus, when scientists used the rule to examine the effectiveness of St. John’s Wort in relieving depression,14 or when they employed it to examine the efficacy of flutamide to treat prostate cancer,15 they concluded the treatments were ineffective because they were not statistically significant at the 0.05 level. Only at p < or = 0.14 were the results statistically significant. They had an 86-percent chance of not being due to chance.16

Tainted at 101-02 (citing papers by Shelton (endnote 14)[2], by Eisenberger (endnote 15) [3], and Rothman’s text (endnote 16)[4]). Although Ken Rothman has criticized the use of statistical significance tests, his book surely does not interpret a p-value of 0.14 as an 86% chance that the results were not due to chance.

Although S-F previous stated that statistical significance is interpreted as the probability that the null is true, she actually goes on to correct the mistake, sort of:

“Requiring the statistical-significance rule for hypothesis-development also is arbitrary in presupposing a nonsensical distinction between a significant finding if p = 0.049, but a nonsignificant finding if p = 0. 051.26 Besides, even when one uses a 90-percent (p < or = 0.10), an 85-percent (p < or = 0.15), or some other confidence level, it still may not include the null point. If not, these other p values also show the data are consistent with an effect. Statistical-significance proponents thus forget that both confidence levels and p values are measures of consistency between the data and the null hypothesis, not measures of the probability that the null is true. When results do not satisfy the rule, this means merely that the null cannot be rejected, not that the null is true.”

Tainted at 103.

S-F’s repeats some criticisms of significance testing, most of which involve their own misunderstandings of the concept.  It hardly suffices to argue that evaluating the magnitude of random error is worthless because it does not measure the extent of bias and confounding.  The flaw lies in those who would interpret the p-value as the sole measure of error involved in a measurement.

S-F takes the criticisms of significance probability to be sufficient to justify an alternative approach: evaluating causal hypotheses “on a preponderance of evidence,47 whether effects are more likely than not.”[5] Here citations, however, do not support the notion that an overall assessment of the causal hypothesis is a true alternative of statistical testing, but rather only a later step in the causal assessment, which presupposes the previous elimination of random variability in the observed associations.

S-F compounds her confusion by claiming that this purported alternative is superior to significance testing or any evaluation of random variability, and by noting that juries in civil cases must decide causal claims on the preponderance of the evidence, not on attained significance probabilities:

“In welfare-affecting areas of science, a preponderance-of-evidence rule often is better than a statistical-significance rule because it could take account of evidence based on underlying mechanisms and theoretical support, even if evidence did not satisfy statistical significance. After all, even in US civil law, juries need not be 95 percent certain of a verdict, but only sure that a verdict is more likely than not. Another reason for requiring the preponderance-of-evidence rule, for welfare-related hypothesis development, is that statistical data often are difficult or expensive to obtain, for example, because of large sample-size requirements. Such difficulties limit statistical-significance applicability. ”

Tainted at 105-06. S-F’s assertion that juries need not have 95% certainty in their verdict is either a misunderstanding or a misrepresentation of the meaning of a confidence interval, and a conflation of two very kinds of probability or certainty.  S-F invites a reading that commits the transposition fallacy by confusing the probability involved in a confidence interval with that involved in a posterior probability.  S-F’s claim that sample size requirements often limit the ability to use statistical significance evaluations is obviously highly contingent upon the facts of case, but in civil cases, such as Pritchard, this limitation is rarely at play.  Of course, if the sample size is too small to evaluate the role of chance, then a scientist should probably declare the evidence too fragile to support a causal conclusion.

S-F also postulates that that a posterior probability rather than a significance probability approach would “better counteract conflicts of interest that sometimes cause scientists to pay inadequate attention to public-welfare consequences of their work.” Tainted at 106. This claim is a remarkable assertion, which is not supported by any empirical evidence.  The varieties of evidence that go into an overall assessment of a causal hypothesis are often quantitatively incommensurate.  The so-called preponderance-of-the-evidence described by S-F is often little more than a subjective overall assessment of weight of the evidence.  The approving citations to the work of Carl Cranor support interpreting S-F to endorse this subjective, anything-goes approach to weight of the evidence.  As for WOE eliminating inadequate attention to “public welfare,” S-F’s citations actually suggest the opposite. S-F’s citations to the 1961 reviews by Wynder and by Little illustrate how subjective narrative reviews can be, with diametrically opposed results.  Rather than curbing conflicts of interest, these subjective, narrative reviews illustrate how contrary results may be obtained by the failure to pre-specify criteria of validity, and inclusion and exclusion of admissible evidence. Still, S-F asserts that “up to 80 percent of welfare-related statistical studies have false-negative or type-II errors, failing to reject a false null.” Tainted at 106. The support for this assertion is a citation to a review article by David Resnik. See David Resnik, “Statistics, Ethics, and Research: An Agenda for Education and Reform,” 8 Accountability in Research 163, 183 (2000). Resnik’s paper is a review article, not an empirical study, but at the page cited by S-F, Resnik in turn cites to well-known papers that present actual data:

“There is also evidence that many of the errors and biases in research are related to the misuses of statistics. For example, Williams et al. (1997) found that 80% of articles surveyed that used t-tests contained at least one test with a type II error. Freiman et al. (1978)  * * *  However, empirical research on statistical errors in science is scarce, and more work needs to be done in this area.”

Id. The papers cited by Resnik, Williams (1997)[6] and Freiman (1978)[7] did identify previously published studies that over-interpreted statistically non-significant results, but the identified type-II errors were potential errors, not ascertained errors, because the authors made no claim that every non-statistically significant result actually represented a missed true association. In other words, S-F is not entitled to say that these empirical reviews actually identified failures to reject fall null hypotheses. Furthermore, the empirical analyses in the studies cited by Resnik, who was in turn cited by S-F, did not look at correlations between alleged conflicts of interest and statistical errors. The cited research calls for greater attention to proper interpretation of statistical tests, not for their abandonment.

In the end, at least in the chapter on statistics, S-F fails to deliver much if anything on her promise to show how to evaluate science from a philosophic perspective.  Her discussion of the Pritchard case is not an analysis; it is a harangue. There are certainly more readable, accessible, scholarly, and accurate treatments of the scientific and statistical issues in this book.  See, e.g., Michael B. Bracken, Risk, Chance, and Causation: Investigating the Origins and Treatment of Disease (2013).


[1] Not to be confused with the deceased federal judge by the same name, William J. Rea. William J. Rea, 1 Chemical Sensitivity – Principles and Mechanisms (1992); 2 Chemical Sensitivity – Sources of Total Body Load (1994),  3 Chemical Sensitivity – Clinical Manifestation of Pollutant Overload (1996), 4 Chemical Sensitivity – Tools of Diagnosis and Methods of Treatment (1998).

[2] R. C. Shelton, M. B. Keller, et al., “Effectiveness of St. John’s Wort in Major Depression,” 285 Journal of the American Medical Association 1978 (2001).

[3] M. A. Eisenberger, B. A. Blumenstein, et al., “Bilateral Orchiectomy With or Without Flutamide for Metastic [sic] Prostate Cancer,” 339 New England Journal of Medicine 1036 (1998).

[4] Kenneth J. Rothman, Epidemiology 123–127 (NY 2002).

[5] Endnote 47 references the following papers: E. Hammond, “Cause and Effect,” in E. Wynder, ed., The Biologic Effects of Tobacco 193–194 (Boston 1955); E. L. Wynder, “An Appraisal of the Smoking-Lung-Cancer Issue,”264  New England Journal of Medicine 1235 (1961); see C. Little, “Some Phases of the Problem of Smoking and Lung Cancer,” 264 New England Journal of Medicine 1241 (1961); J. R. Stutzman, C. A. Luongo, and S. A McLuckey, “Covalent and Non-Covalent Binding in the Ion/Ion Charge Inversion of Peptide Cations with Benzene-Disulfonic Acid Anions,” 47 Journal of Mass Spectrometry 669 (2012). Although the paper on ionic charges of peptide cations is unfamiliar, the other papers do not eschew traditional statistical significance testing techniques. By the time these early (1961) reviews were written, the association that was reported between smoking and lung cancer was clearly accepted as not likely explained by chance.  Discussion focused upon bias and potential confounding in the available studies, and the lack of animal evidence for the causal claim.

[6] J. L. Williams, C. A. Hathaway, K. L. Kloster, and B. H. Layne, “Low power, type II errors, and other statistical problems in recent cardiovascular research,” 42 Am. J. Physiology Heart & Circulation Physiology H487 (1997).

[7] Jennie A. Freiman, Thomas C. Chalmers, Harry Smith and Roy R. Kuebler, “The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 ‛negative’ trials,” 299 New Engl. J. Med. 690 (1978).