TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Reference Manual on Scientific Evidence – 3rd Edition is Past Its Expiry

October 17th, 2021

INTRODUCTION

The new, third edition of the Reference Manual on Scientific Evidence was released to the public in September 2011, as a joint production of the National Academies of Science, and the Federal Judicial Center. Within a year of its publication, I wrote that the Manual needed attention on several key issues. Now that there is a committee working on the fourth edition, I am reprising the critique, slightly modified, in the hope that it may make a difference for the fourth edition.

The Development Committee for the third edition included Co-Chairs, Professor Jerome Kassirer, of Tufts University School of Medicine, and the Hon. Gladys Kessler, who sits on the District Court for the District of Columbia.  The members of the Development Committee included:

  • Ming W. Chin, Associate Justice, The Supreme Court of California
  • Pauline Newman, Judge, Court of Appeals for the Federal Circuit
  • Kathleen O’Malley, Judge, Court of Appeals for the Federal Circuit (formerly a district judge on the Northern District of Ohio)
  • Jed S. Rakoff, Judge, Southern District of New York
  • Channing Robertson, Professor of Engineering, Stanford University
  • Joseph V. Rodricks, Principal, Environ
  • Allen Wilcox, Senior Investigator, Institute of Environmental Health Sciences
  • Sandy L. Zabell, Professor of Statistics and Mathematics, Weinberg College of Arts and Sciences, Northwestern University

Joe S. Cecil, Project Director, Program on Scientific and Technical Evidence, in the Federal Judicial Center’s Division of Research, who shepherded the first two editions, served as consultant to the Committee.

With over 1,000 pages, there was much to digest in the third edition of the Reference Manual on Scientific Evidence (RMSE 3d).  Much of what is covered was solid information on the individual scientific and technical disciplines covered.  Although the information is easily available from other sources, there is some value in collecting the material in a single volume for the convenience of judges and lawyers.  Of course, given that this information is provided to judges from an ostensibly neutral, credible source, lawyers will naturally focus on what is doubtful or controversial in the RMSE. To date, there have been only a few reviews and acknowledgments of the new edition.[1]

Like previous editions, the substantive scientific areas were covered in discrete chapters, written by subject matter specialists, often along with a lawyer who addresses the legal implications and judicial treatment of that subject matter.  From my perspective, the chapters on statistics, epidemiology, and toxicology were the most important in my practice and in teaching, and I have focused on issues raised by these chapters.

The strengths of the chapter on statistical evidence, updated from the second edition, remained, as did some of the strengths and flaws of the chapter on epidemiology.  In addition, there was a good deal of overlap among the chapters on statistics, epidemiology, and medical testimony.  This overlap was at first blush troubling because the RMSE has the potential to confuse and obscure issues by having multiple authors address them inconsistently.  This is an area where reviewers of the upcoming edition should pay close attention.

I. Reference Manual’s Disregard of Study Validity in Favor of the “Whole Tsumish”

There was a deep discordance among the chapters in the third Reference Manual as to how judges should approach scientific gatekeeping issues. The third edition vacillated between encouraging judges to look at scientific validity, and discouraging them from any meaningful analysis by emphasizing inaccurate proxies for validity, such as conflicts of interest.[2]

The Third Edition featured an updated version of the late Professor Margaret Berger’s chapter from the second edition, “The Admissibility of Expert Testimony.”[3]  Berger’s chapter criticized “atomization,” a process she describes pejoratively as a “slicing-and-dicing” approach.[4]  Drawing on the publications of Daubert-critic Susan Haack, Berger rejected the notion that courts should examine the reliability of each study independently.[5]  Berger contended that the “proper” scientific method, as evidenced by works of the International Agency for Research on Cancer, the Institute of Medicine, the National Institute of Health, the National Research Council, and the National Institute for Environmental Health Sciences, “is to consider all the relevant available scientific evidence, taken as a whole, to determine which conclusion or hypothesis regarding a causal claim is best supported by the body of evidence.”[6]

Berger’s contention, however, was profoundly misleading.  Of course, scientists undertaking a systematic review should identify all the relevant studies, but some of the “relevant” studies may well be insufficiently reliable (because of internal or external validity issues) to answer the research question at hand. All the cited agencies, and other research organizations and researchers, exclude studies that are fundamentally flawed, whether as a result of bias, confounding, erroneous data analyses, or related problems.  Berger cited no support for her remarkable suggestion that scientists do not make “reliability” judgments about available studies when assessing the “totality of the evidence.”

Professor Berger, who had a distinguished career as a law professor and evidence scholar, died in November 2010.  She was no friend of Daubert,[7] but remarkably her antipathy had outlived her.  Berger’s critical discussion of “atomization” cited the notorious decision in Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11, 26 (1st Cir. 2011), which was decided four months after her passing.[8]

Professor Berger’s contention about the need to avoid assessments of individual studies in favor of the whole “tsumish” must also be rejected because Federal Rule of Evidence 703 requires that each study considered by an expert witness “qualify” for reasonable reliance by virtue of the study’s containing facts or data that are “of a type reasonably relied upon by experts in the particular field forming opinions or inferences upon the subject.”  One of the deeply troubling aspects of the Milward decision is that it reversed the trial court’s sensible decision to exclude a toxicologist, Dr. Martyn Smith, who outran his headlights on issues having to do with a field in which he was clearly inexperienced – epidemiology.

Scientific studies, and especially epidemiologic studies, involve multiple levels of hearsay.  A typical epidemiologic study may contain hearsay leaps from patient to clinician, to laboratory technicians, to specialists interpreting test results, back to the clinician for a diagnosis, to a nosologist for disease coding, to a national or hospital database, to a researcher querying the database, to a statistician analyzing the data, to a manuscript that details data, analyses, and results, to editors and peer reviewers, back to study authors, and on to publication.  Those leaps do not mean that the final results are untrustworthy, only that the study itself is not likely admissible in evidence.

The inadmissibility of scientific studies is not problematic because Rule 703 permits testifying expert witnesses to formulate opinions based upon facts and data, which are not independently admissible in evidence. The distinction between relied upon and admissible studies is codified in the Federal Rules of Evidence, and in virtually every state’s evidence law.

Referring to studies, without qualification, as admissible in themselves is usually wrong as a matter of evidence law.  The error has the potential to encourage carelessness in gatekeeping expert witnesses’ opinions for their reliance upon inadmissible studies.  The error is doubly wrong if this approach to expert witness gatekeeping is taken as license to permit expert witnesses to rely upon any marginally relevant study of their choosing.  It is therefore disconcerting that the RMSE 3d failed to make the appropriate distinction between admissibility of studies and admissibility of expert witness opinion that has reasonably relied upon appropriate studies.

Consider the following statement from the chapter on epidemiology:

“An epidemiologic study that is sufficiently rigorous to justify a conclusion that it is scientifically valid should be admissible, as it tends to make an issue in dispute more or less likely.”[9]

Curiously, the advice from the authors of the epidemiology chapter, by speaking to a single study’s validity, was at odds with Professor Berger’s caution against slicing and dicing. The authors of the epidemiology chapter seemed to be stressing that scientifically valid studies should be admissible.  Their footnote emphasized and confused the point:

See DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 958 (3d Cir. 1990); cf. Kehm v. Procter & Gamble Co., 580 F. Supp. 890, 902 (N.D. Iowa 1982) (“These [epidemiologic] studies were highly probative on the issue of causation—they all concluded that an association between tampon use and menstrually related TSS [toxic shock syndrome] cases exists.”), aff’d, 724 F.2d 613 (8th Cir. 1984). Hearsay concerns may limit the independent admissibility of the study, but the study could be relied on by an expert in forming an opinion and may be admissible pursuant to Fed. R. Evid. 703 as part of the underlying facts or data relied on by the expert. In Ellis v. International Playtex, Inc., 745 F.2d 292, 303 (4th Cir. 1984), the court concluded that certain epidemiologic studies were admissible despite criticism of the methodology used in the studies. The court held that the claims of bias went to the studies’ weight rather than their admissibility. Cf. Christophersen v. Allied-Signal Corp., 939 F.2d 1106, 1109 (5th Cir. 1991) (“As a general rule, questions relating to the bases and sources of an expert’s opinion affect the weight to be assigned that opinion rather than its admissibility. . . .”).”[10]

This footnote, however, that studies relied upon by an expert in forming an opinion may be admissible pursuant to Rule 703, was unsupported by and contrary to Rule 703 and the overwhelming weight of case law interpreting and applying the rule.[11] The citation to a pre-Daubert decision, Christophersen, was doubtful as a legal argument, and managed to engender much confusion

Furthermore, Kehm and Ellis, the cases cited in this footnote by the authors of the epidemiology chapter, both involved “factual findings” in public investigative or evaluative reports, which were independently admissible under Federal Rule of Evidence 803(8)(C). See Ellis, 745 F.2d at 299-303; Kehm, 724 F.2d at 617-18.  As such, the cases hardly support the chapter’s suggestion that Rule 703 is a rule of admissibility for epidemiologic studies.

Here the RMSE 3d, in one sentence, confused Rule 703 with an exception to the rule against hearsay, which would prevent the statistically based epidemiologic studies from being received in evidence.  The point is reasonably clear, however, that the studies “may be offered” in testimony to explain an expert witness’s opinion. Under Rule 705, that offer may also be refused. The offer, however, is to “explain,” not to have the studies admitted in evidence.  The RMSE 3d was certainly not alone in advancing this notion that studies are themselves admissible.  Other well-respected evidence scholars have lapsed into this error.[12]

Evidence scholars should not conflate admissibility of the epidemiologic (or other) studies with the ability of an expert witness to advert to a study to explain his or her opinion.  The testifying expert witness really should not be allowed to become a conduit for off-hand comments and opinions in the introduction or discussion section of relied upon articles, and the wholesale admission of such hearsay opinions undermines the trial court’s control over opinion evidence.  Rule 703 authorizes reasonable reliance upon “facts and data,” not every opinion that creeps into the published literature.

II. Toxicology for Judges

The toxicology chapter, “Reference Guide on Toxicology,” in RMSE 3d was written by Professor Bernard D. Goldstein, of the University of Pittsburgh Graduate School of Public Health, and Mary Sue Henifin, a partner in the Princeton, New Jersey office of Buchanan Ingersoll, P.C.

  1. Conflicts of Interest

At the question and answer session of the Reference Manual’s public release ceremony, in September 2011, one gentleman rose to note that some of the authors were lawyers with big firm affiliations, which he supposed must mean that they represent mostly defendants.  Based upon his premise, he asked what the review committee had done to ensure that conflicts of interest did not skew or distort the discussions in the affected chapters.  Dr. Kassirer and Judge Kessler responded by pointing out that the chapters were peer reviewed by outside reviewers, and reviewed by members of the supervising review committee.  The questioner seemed reassured, but now that I have looked at the toxicology chapter, I am not so sure.

The questioner’s premise that a member of a large firm will represent mostly defendants and thus have a pro-defense bias was probably a common perception among unsophisticated lay observers.  For instance, some large firms represent insurance companies intent upon denying coverage to product manufacturers.  These counsel for insurance companies often take the plaintiffs’ side of the underlying disputed issue in order to claim an exclusion to the contract of insurance, under a claim that the harm was “expected or intended.”  Similarly, the common perception ignores the reality of lawyers’ true conflict:  although gatekeeping helps the defense lawyers’ clients, it takes away legal work from firms that represent defendants in the litigations that are pretermitted by effective judicial gatekeeping.  Erosion of gatekeeping concepts, however, inures to the benefit of plaintiffs, their counsel, as well as the expert witnesses engaged on behalf of plaintiffs in litigation.

The questioner’s supposition in the case of the toxicology chapter, however, is doubly flawed.  If he had known more about the authors, he would probably not have asked his question.  First, the lawyer author, Ms. Henifin, despite her large firm affiliation, has taken some aggressive positions contrary to the interests of manufacturers.[13]  As for the scientist author of the toxicology chapter, Professor Goldstein, the casual reader of the chapter may want to know that he has testified in any number of toxic tort cases, almost invariably on the plaintiffs’ side.  Unlike the defense lawyer, who loses business revenue, when courts shut down unreliable claims, plaintiffs’ testifying or consulting expert witnesses stand to gain by minimalist expert witness opinion gatekeeping.  Given the economic asymmetries, the reader must thus want to know that Professor Goldstein was excluded as an expert witness in some high-profile toxic tort cases.[14]  There do not appear to be any disclosures of Professor Goldstein’s (or any other scientist author’s) conflicts of interests in RMSE 3d.  Having pointed out this conflict, I would note that financial conflicts of interest are nothing really compared with ideological conflicts of interest, which often propel scientists into service as expert witnesses to advance their political agenda.

  1. Hormesis

One way that ideological conflicts might be revealed is to look for imbalances in the presentation of toxicologic concepts.  Most lawyers who litigate cases that involve exposure-response issues are familiar with the “linear no threshold” (LNT) concept that is used frequently in regulatory risk assessments, and which has metastasized to toxic tort litigation, where LNT often has no proper place.

LNT is a dubious assumption because it claims to “know” the dose response at very low exposure levels in the absence of data.  There is a thin plausibility for LNT for genotoxic chemicals claimed to be carcinogens, but even that plausibility evaporates when one realizes that there are DNA defense and repair mechanisms to genotoxicity, which must first be saturated, overwhelmed, or inhibited, before there can be a carcinogenic response. The upshot is that low exposures that do not swamp DNA repair and tumor suppression proteins will not cause cancer.

Hormesis is today an accepted concept that describes a dose-response relationship that shows a benefit at low doses, but harm at high doses. The toxicology chapter in the Reference Manual has several references to LNT but none to hormesis.  That font of all knowledge, Wikipedia reports that hormesis is controversial, but so is LNT.  This is the sort of imbalance that may well reflect an ideological bias.

One of the leading textbooks on toxicology describes hormesis[15]:

“There is considerable evidence to suggest that some non-nutritional toxic substances may also impart beneficial or stimulatory effects at low doses but that, at higher doses, they produce adverse effects. This concept of “hormesis” was first described for radiation effects but may also pertain to most chemical responses.”

Similarly, the Encyclopedia of Toxicology describes hormesis as an important phenomenon in toxicologic science[16]:

“This type of dose–response relationship is observed in a phenomenon known as hormesis, with one explanation being that exposure to small amounts of a material can actually confer resistance to the agent before frank toxicity begins to appear following exposures to larger amounts.  However, analysis of the available mechanistic studies indicates that there is no single hormetic mechanism. In fact, there are numerous ways for biological systems to show hormetic-like biphasic dose–response relationship. Hormetic dose–response has emerged in recent years as a dose–response phenomenon of great interest in toxicology and risk assessment.”

One might think that hormesis would also be of great interest to federal judges, but they will not learn about it from reading the Reference Manual.

Hormesis research has come into its own.  The International Dose-Response Society, which “focus[es] on the dose-response in the low-dose zone,” publishes a journal, Dose-Response, and a newsletter, BELLE:  Biological Effects of Low Level Exposure.  In 2009, two leading researchers in the area of hormesis published a collection of important papers:  Mark P. Mattson and Edward J. Calabrese, eds., Hormesis: A Revolution in Biology, Toxicology and Medicine (2009).

A check in PubMed shows that LNT has more “hits” than “hormesis” or “hermetic,” but still the latter phrases exceed 1,267 references, hardly insubstantial.  In actuality, there are many more hermetic relationships identified in the scientific literature, which often fails to identify the relationship by the term hormesis or hermetic.[17]

The Reference Manual’s omission of hormesis was regrettable.  Its inclusion of references to LNT but not to hormesis suggests a biased treatment of the subject.

  1. Questionable Substantive Opinions

Readers and litigants would fondly hope that the toxicology chapter would not put forward partisan substantive positions on issues that are currently the subject of active litigation.  Fervently, we would hope that any substantive position advanced would at least be well documented.

For at least one issue, the toxicology chapter disappointed significantly.  Table 1 in the chapter presents a “Sample of Selected Toxicological End Points and Examples of Agents of Concern in Humans.” No documentation or citations are provided for this table.  Most of the exposure agent/disease outcome relationships in the table are well accepted, but curiously at least one agent-disease pair, which is the subject of current litigation, is wildly off the mark:

“Parkinson’s disease and manganese[18]

If the chapter’s authors had looked, they would have found that Parkinson’s disease is almost universally accepted to have no known cause, at least outside court rooms.  They would also have found that the issue has been addressed carefully and the claimed relationship or “concern” has been rejected by the leading researchers in the field (who have no litigation ties).[19]  Table 1 suggests a certain lack of objectivity, and its inclusion of a highly controversial relationship, manganese-Parkinson’s disease, suggests a good deal of partisanship.

  1. When All You Have Is a Hammer, Everything Looks Like a Nail

The substantive area author, Professor Goldstein, is not a physician; nor is he an epidemiologist.  His professional focus on animal and cell research appeared to color and bias the opinions offered in this chapter:[20]

“In qualitative extrapolation, one can usually rely on the fact that a compound causing an effect in one mammalian species will cause it in another species. This is a basic principle of toxicology and pharmacology.  If a heavy metal, such as mercury, causes kidney toxicity in laboratory animals, it is highly likely to do so at some dose in humans.”

Such extrapolations may make sense in regulatory contexts, where precauationary judgments are of interest, but they hardly can be said to be generally accepted in controversies in scientific communities, or in civil actions over actual causation.  There are too many counterexamples to cite, but consider crystalline silica, silicon dioxide.  Silica causes something resembling lung cancer in rats, but not in mice, guinea pigs, or hamsters.  It hardly makes sense to ask juries to decide whether the plaintiff is more like a rat than a mouse.

For a sober second opinion to the toxicology chapter, one may consider the views of some well-known authors:

“Whereas the concordance was high between cancer-causing agents initially discovered in humans and positive results in animal studies (Tomatis et al., 1989; Wilbourn et al., 1984), the same could not be said for the reverse relationship: carcinogenic effects in animals frequently lacked concordance with overall patterns in human cancer incidence (Pastoor and Stevens, 2005).”[21]

III. New Reference Manual’s Uneven Treatment of Causation and of Conflicts of Interest

The third edition of the Reference Manual on Scientific Evidence (RMSE) appeared to get off to a good start in the Preface by Judge Kessler and Dr. Kassirer, when they noted that the Supreme Court mandated federal courts to:

“examine the scientific basis of expert testimony to ensure that it meets the same rigorous standard employed by scientific researchers and practitioners outside the courtroom.”

RMSE at xiii.  The preface faltered, however, on two key issues, causation and conflicts of interest, which are taken up as an introduction to the third edition.

  1. Causation

The authors reported in somewhat squishy terms that causal assessments are judgments:

“Fundamentally, the task is an inferential process of weighing evidence and using judgment to conclude whether or not an effect is the result of some stimulus. Judgment is required even when using sophisticated statistical methods. Such methods can provide powerful evidence of associations between variables, but they cannot prove that a causal relationship exists. Theories of causation (evolution, for example) lose their designation as theories only if the scientific community has rejected alternative theories and accepted the causal relationship as fact. Elements that are often considered in helping to establish a causal relationship include predisposing factors, proximity of a stimulus to its putative outcome, the strength of the stimulus, and the strength of the events in a causal chain.”[22]

The authors left the inferential process as a matter of “weighing evidence,” but without saying anything about how the scientific community does its “weighing.” Language about “proving” causation is also unclear because “proof” in scientific parlance connotes a demonstration, which we typically find in logic or in mathematics. Proving empirical propositions suggests a bar set so high such that the courts must inevitably acquiesce in a very low threshold of evidence.  The question, of course, is how low can and will judges go to admit evidence.

The authors thus introduced hand waving and excuses for why evidence can be weighed differently in court proceedings from the world of science:

“Unfortunately, judges may be in a less favorable position than scientists to make causal assessments. Scientists may delay their decision while they or others gather more data. Judges, on the other hand, must rule on causation based on existing information. Concepts of causation familiar to scientists (no matter what stripe) may not resonate with judges who are asked to rule on general causation (i.e., is a particular stimulus known to produce a particular reaction) or specific causation (i.e., did a particular stimulus cause a particular consequence in a specific instance). In the final analysis, a judge does not have the option of suspending judgment until more information is available, but must decide after considering the best available science.”[23]

But the “best available science” may be pretty crummy, and the temptation to turn desperation into evidence (“well, it’s the best we have now”) is often severe.  The authors of the Preface thus remarkable signalled that “inconclusive” is not a judgment open to judges charged with expert witness gatekeeping.  If the authors truly meant to suggest that judges should go with whatever is dished out as “the best available science,” then they have overlooked the obvious:  Rule 702 opens the door to “scientific, technical, or other specialized knowledge,” not to hunches, suggestive but inconclusive evidence, and wishful thinking about how the science may turn out when further along.  Courts have a choice to exclude expert witness opinion testimony that is based upon incomplete or inconclusive evidence. The authors went fairly far afield to suggest, erroneously, that the incomplete and the inconclusive are good enough and should be admitted.

  1. Conflicts of Interest

Surprisingly, given the scope of the scientific areas covered in the RMSE, the authors discussed conflicts of interest (COI) at some length.  Conflicts of interest are a fact of life in all endeavors, and it is understandable counsel judges and juries to try to identify, assess, and control them.  COIs, however, are weak proxies for unreliability.  The emphasis given here was, however, undue because federal judges were enticed into thinking that they can discern unreliability from COI, when they should be focused on the data, inferences, and analyses.

What becomes fairly clear is that the authors of the Preface set out to use COI as a basis for giving litigation plaintiffs a pass, and for holding back studies sponsored by corporate defendants.

“Conflict of interest manifests as bias, and given the high stakes and adversarial nature of many courtroom proceedings, bias can have a major influence on evidence, testimony, and decisionmaking. Conflicts of interest take many forms and can be based on religious, social, political, or other personal convictions. The biases that these convictions can induce may range from serious to extreme, but these intrinsic influences and the biases they can induce are difficult to identify. Even individuals with such prejudices may not appreciate that they have them, nor may they realize that their interpretations of scientific issues may be biased by them. Because of these limitations, we consider here only financial conflicts of interest; such conflicts are discoverable. Nonetheless, even though financial conflicts can be identified, having such a conflict, even one involving huge sums of money, does not necessarily mean that a given individual will be biased. Having a financial relationship with a commercial entity produces a conflict of interest, but it does not inevitably evoke bias. In science, financial conflict of interest is often accompanied by disclosure of the relationship, leaving to the public the decision whether the interpretation might be tainted. Needless to say, such an assessment may be difficult. The problem is compounded in scientific publications by obscure ways in which the conflicts are reported and by a lack of disclosure of dollar amounts.

Judges and juries, however, must consider financial conflicts of interest when assessing scientific testimony. The threshold for pursuing the possibility of bias must be low. In some instances, judges have been frustrated in identifying expert witnesses who are free of conflict of interest because entire fields of science seem to be co-opted by payments from industry. Judges must also be aware that the research methods of studies funded specifically for purposes of litigation could favor one of the parties. Though awareness of such financial conflicts in itself is not necessarily predictive of bias, such information should be sought and evaluated as part of the deliberations.”[24]

All in all, rather misleading advice.  Financial conflicts are not the only conflicts that can be “discovered.”  Often expert witnesses will have political and organizational alignments, which will show deep-seated ideological alignments with the party for which they are testifying.  For instance, in one silicosis case, an expert witness in the field of history of medicine testified, at an examination before trial, that his father suffered from a silica-related disease.  This witness’s alignment with Marxist historians and his identification with radical labor movements made his non-financial conflicts obvious, although these COI would not necessarily have been apparent from his scholarly publications alone.

How low will the bar be set for discovering COI?  If testifying expert witnesses are relying upon textbooks, articles, essays, will federal courts open the authors/hearsay declarants up to searching discovery of their finances? What really is at stake here is that the issues of accuracy, precision, and reliability are lost in the ad hominem project of discovery COIs.

Also misleading was the suggestion that “entire fields of science seem to be co-opted by payments from industry.”  Do the authors mean to exclude the plaintiffs’ lawyer lawsuit industry, which has become one of the largest rent-seeking organizations, and one of the most politically powerful groups in this country?  In litigations in which I have been involved, I have certainly seen plaintiffs’ counsel, or their proxies – labor unions, federal agencies, or “victim support groups” provide substantial funding for studies.  The Preface authors themselves show an untoward bias by their pointing out industry payments without giving balanced attention to other interested parties’ funding of scientific studies.

The attention to COI was also surprising given that one of the key chapters, for toxic tort practitioners, was written by Dr. Bernard D. Goldstein, who has testified in toxic tort cases, mostly (but not exclusively) for plaintiffs.[25]  In one such case, Makofsky, Dr. Goldstein’s participation was particularly revealing because he was forced to explain why he was willing to opine that benzene caused acute lymphocytic leukemia, despite the plethora of published studies finding no statistically significant relationship.  Dr. Goldstein resorted to the inaccurate notion that scientific “proof” of causation requires 95 percent certainty, whereas he imposed only a 51 percent certainty for his medico-legal testimonial adventures.[26] Dr. Goldstein also attempted to justify the discrepancy from the published literature by adverting to the lower standards used by federal regulatory agencies and treating physicians.  

These explanations were particularly concerning because they reflect basic errors in statistics and in causal reasoning.  The 95 percent derives from the use of the coefficient of confidence in confidence intervals, but the probability involved there is not the probability of the association’s being correct, and it has nothing to do with the probability in the belief that an association is real or is causal.  (Thankfully the RMSE chapter on statistics got this right, but my fear is that judges will skip over the more demanding chapter on statistics and place undue weight on the toxicology chapter.)  The reference to federal agencies (OSHA, EPA, etc.) and to treating physicians was meant, no doubt, to invoke precautionary principle concepts as a justification for some vague, ill-defined, lower standard of causal assessment.  These references were really covert invitations to shift the burden of proof.

The Preface authors might well have taken their own counsel and conducted a more searching assessment of COI among authors of Reference Manual.  Better yet, the authors might have focused the judiciary on the data and the analysis.

  1. Reference Manual on Scientific Evidence (3d edition) on Statistical Significance

How does the new Reference Manual on Scientific Evidence treat statistical significance?  Inconsistently and at times incoherently.

  1. Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raised the question what role statistical significance should play in evaluating a study’s support for causal conclusions[27]:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. Certainly the failure to rule out causation is not probative of causation. How can that tautology support the claim that inconclusive studies “therefore” have some probative value? Berger’s argument seems obviously invalid, or perhaps text that badly needed a posthumous editor.  And what epidemiologic studies are conclusive?  Are the studies individually or collectively conclusive?  Berger introduced a tantalizing concept, which was not spelled out anywhere in the Manual.

Berger’s chapter raised other, serious problems. If the relied-upon studies are not statistically significant, how should we understand the testifying expert witness to have ruled out random variability as an explanation for the disparity observed in the study or studies?  Berger did not answer these important questions, but her rhetoric elsewhere suggested that trial courts should not look too hard at the statistical support (or its lack) for what expert witness testimony is proffered.

Berger’s citations in support were curiously inaccurate.  Footnote 62 cites the Cook case:

“62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).”

Berger’s citation was disturbingly incomplete.[28] The expert witness, Dr. Clapp, in Cook did rely upon his own study, which did not obtain a statistically significant result, but the trial court admitted the expert witness’s testimony; the court denied the Rule 702 challenge to Clapp, and permitted him to testify about a statistically non-significant ecological study. Given that the judgment of the district court was reversed

Footnote 63 is no better:

“63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert evidence proffered in multidistricted product liability case).”

With respect to the concept of statistical significance, the Viagra case centered around the motion to exclude plaintiffs’ expert witness, Gerald McGwin, who relied upon three studies, none of which obtained a statistically significant result in its primary analysis.  The Viagra court’s review was hardly extensive; the court did not report, discuss, or consider the appropriate point estimates in most of the studies, the confidence intervals around those point estimates, or any aspect of systematic error in the three studies.  At best, the court’s review was perfunctory.  When the defendant brought to light the lack of data integrity in McGwin’s own study, the Viagra MDL court reversed itself, and granted the motion to exclude McGwin’s testimony.[29]  Berger’s chapter omitted the cautionary tale of McGwin’s serious, pervasive errors, and how they led to his ultimate exclusion. Berger’s characterization of the review was incorrect, and her failure to cite the subsequent procedural history, misleading.

  1. Chapter on Statistics

The Third Edition’s chapter on statistics was relatively free of value judgments about significance probability, and, therefore, an improvement over Berger’s introduction.  The authors carefully described significance probability and p-values, and explain[30]:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

Although the chapter conflated the positions often taken to be Fisher’s interpretation of p-values and Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment was unfortunately fairly standard in introductory textbooks.  The authors may have felt that presenting multiple interpretations of p-values was asking too much of judges and lawyers, but the oversimplification invited a false sense of certainty about the inferences that can be drawn from statistical significance.

Kaye and Freedman, however, did offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome[31]:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113

This important qualification to statistical significance was omitted from the overlapping discussion in the chapter on epidemiology, where it was very much needed.

  1. Chapter on Multiple Regression

The chapter on regression did not add much to the earlier and later discussions.  The author asked rhetorically what is the appropriate level of statistical significance, and answers:

“In most scientific work, the level of statistical significance required to reject the null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47

Daniel Rubinfeld, “Reference Guide on Multiple Regression,” in RMSE3d 303, 320.

  1. Chapter on Epidemiology

The chapter on epidemiology[32] mostly muddled the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for ‘significance’ is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

The suggestion that a statistically significant study has results unlikely due to chance, without reminding the reader that the finding is predicated on the assumption that there is no association, and that the probability distribution was correct, and came close to crossing the line in committing the transposition fallacy so nicely described and warned against in the statistics chapter. The problem was that “results” is ambiguous as between the data as extreme or more so than what was observed, and the point estimate of the mean or proportion in the sample, and the assumptions that lead to a p-value were not disclosed.

The suggestion that alpha is “arbitrary,” was “somewhat” correct, but this truncated discussion was distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.”  The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary.  Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off.  Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly called out judicial decisions that confuse “effect size” with statistical significance[33]:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d 621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y. 1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds, 52 F.3d 1124 (2d Cir. 1995).”

Actually this confusion is not understandable at all.  The distinction has been the subject of teaching since the first edition of the Reference Manual, and two of the cited cases post-date the second edition.  The Southern District of New York asbestos case, of course, predated the first Manual.  To be sure, courts have on occasion badly misunderstood significance probability and significance testing.   The authors of the epidemiology chapter could well have added In re Viagra, to the list of courts that confused effect size with statistical significance.[34]

The epidemiology chapter appropriately chastised courts for confusing significance probability with the probability that the null hypothesis, or its complement, is correct[35]:

“A common error made by lawyers, judges, and academics is to equate the level of alpha with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of a preponderance of the evidence (i.e., greater than 50%).  See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D. Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while expressing his understanding that the legal standard merely required more probable than not). But see Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of this reference guide).

Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s disease was caused by exposure to the agent; the magnitude of the association bears on that question. See infra Section VII. Second, significance testing only bears on whether the observed magnitude of association arose  as a result of random chance, not on whether the null hypothesis is true. Third, using stringent significance testing to avoid false-positive error comes at a complementary cost of inducing false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.”

The footnotes went on to explain further the difference between alpha probability and burden of proof probability, but somewhat misleadingly asserted that “significance testing only bears on whether the observed magnitude of association arose as a result of random chance, not on whether the null hypothesis is true.”[36]  The significance probability does not address the probability that the observed statistic is the result of random chance; rather it describes the probability of observing at least as large a departure from the expected value if the null hypothesis is true.  Of course, if this cumulative probability is sufficiently low, then the null hypothesis is rejected, and this would seem to bear upon whether the null hypothesis is true.  Kaye and Freedman’s chapter on statistics did much better at describing p-values and avoiding the transposition fallacy.

When they stayed on message, the authors of the epidemiology chapter were certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct.  What these authors omitted, however, was a clear statement that the many courts and counsel who have misstated this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately failed to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87  Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88

Id. at 578-79.  Mostly true, but again rather unhelpful to judges and lawyers.  Some of the controversy, to be sure, was instigated by statisticians and epidemiologists who would elevate Bayesian methods, and eliminate the use of significance probability and testing altogether. As for those scientists who still work within the dominant frequentist statistical paradigm, the chapter authors divided the world up into “strict” testers and those critical of “strict” testing.  Where, however, is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all?  I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis.  Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them.  But what of those scientists critical of strict testing?  This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in the early going in In re Viagra Products Liab. Litig.[37]  The epidemiology chapter’s reference to confidence intervals was correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from and conditioned on the sample.  Confidence intervals, however, do not eliminate the need to interpret the extent of random error; rather they provide a more direct assessment and measurement of the standard error.

V. Power in the Reference Manual for Scientific Evidence

The Third Edition treated statistical power in three of its chapters, those on statistics, epidemiology, and medical testimony.  Unfortunately, the treatments were not always consistent.

The chapter on statistics has been consistently among the most frequently ignored content of the three editions of the Reference Manual.  The third edition offered a good introduction to basic concepts of sampling, random variability, significance testing, and confidence intervals.[38]  Kaye and Freedman provided an acceptable non-technical definition of statistical power[39]:

“More precisely, power is the probability of rejecting the null hypothesis when the alternative hypothesis … is right. Typically, this probability will depend on the values of unknown parameters, as well as the preset significance level α. The power can be computed for any value of α and any choice of parameters satisfying the alternative hypothesis. Frequentist hypothesis testing keeps the risk of a false positive to a specified level (such as α = 5%) and then tries to maximize power. Statisticians usually denote power by the Greek letter beta (β). However, some authors use β to denote the probability of accepting the null hypothesis when the alternative hypothesis is true; this usage is fairly standard in epidemiology. Accepting the null hypothesis when the alternative holds true is a false negative (also called a Type II error, a missed signal, or a false acceptance of the null hypothesis).”

The definition was not, however, without problems.  First, it introduced a nomenclature issue likely to be confusing for judges and lawyers. Kaye and Freeman used β to denote statistical power, but they acknowledge that epidemiologists use β to denote the probability of a Type II error.  And indeed, both the chapters on epidemiology and medical testimony used β to reference Type II error rate, and thus denote power as the complement of β, or (1- β).[40]

Second, the reason for introducing the confusion about β was doubtful.  Kaye and Freeman suggested that statisticians usually denote power by β, but they offered no citations.  A quick review (not necessarily complete or even a random sample) suggests that many modern statistics texts denote power as (1- β).[41]   At the end of the day, there really was no reason for the conflicting nomenclature and the likely confusion it would engenders.  Indeed, the duplicative handling of statistical power, and other concepts, suggested that it is time to eliminate the repetitive discussions, in favor of one, clear, thorough discussion in the statistics chapter.

Third, Kaye and Freeman problematically refer to β as the probability of accepting the null hypothesis when elsewhere they more carefully instructed that a non-significant finding results in not rejecting the null hypothesis as opposed to accepting the null.  Id. at 253.[42]

Fourth, Kaye and Freeman’s discussion of power, unlike most of their chapter, offered advice that is controversial and unclear:

“On the other hand, when studies have a good chance of detecting a meaningful association, failure to obtain significance can be persuasive evidence that there is nothing much to be found.”[43]

Note that the authors left open what a legal or clinically meaningful association is, and thus offered no real guidance to judges on how to evaluate power after data are collected and analyzed.  As Professor Sander Greenland has argued, in legal contexts, this reliance upon observed power (as opposed to power as a guide in determining appropriate sample size in the planning stages of a study) was arbitrary and “unsalvageable as an analytic tool.”[44]

The chapter on epidemiology offered similar controversial advice on the use of power[45]:

“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity.93 The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.94  The power of a study is the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha (or statistical significance) specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.95  Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often, power curves are used in the design of a study to determine what size the study populations should be.96

Although the authors correctly emphasized the need to specify an alternative hypothesis, their discussion and advice were empty of how that alternative should be selected in legal contexts.  The suggestion that power curves can be constructed was, of course, true, but irrelevant unless courts know where on the power curve they should be looking.  The authors were also correct that power is used to determine adequate sample size under specified conditions; but again, the use of power curves in this setting is today rather uncommon.  Investigators select a level of power corresponding to an acceptable Type II error rate, and an alternative hypothesis that would be clinically meaningful for their research, in order to determine their sample size. Translating clinical into legal meaningfulness is not always straightforward.

In a footnote, the authors of the epidemiology chapter noted that Professor Rothman has been “one of the leaders in advocating the use of confidence intervals and rejecting strict significance testing.”[46] What the chapter failed, however, to mention is that Rothman has also been outspoken in rejecting post-hoc power calculations that the epidemiology chapter seemed to invite:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”[47]

The selective, incomplete scholarship of the epidemiology chapter on the issue of statistical power was not only unfortunate, but it distorted the authors’ evaluation of the sparse case law on the issue of power.  For instance, they noted:

“Even when a study or body of studies tends to exonerate an agent, that does not establish that the agent is absolutely safe. See Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767 (N.D. Ohio 2010). Epidemiology is not able to provide such evidence.”[48]

Here the authors, Green, Freedman, and Gordis, shifted the burden to the defendant and then go to an even further extreme of making the burden of proof one of absolute certainty in the product’s safety.  This is not, and never has been, a legal standard. The cases they cited amplified the error. In Cooley, for instance, the defense expert would have opined that welding fume exposure did not cause parkinsonism or Parkinson’s disease.  Although the expert witness had not conducted a meta-analysis, he had reviewed the confidence intervals around the point estimates of the available studies.  Many of the point estimates were at or below 1.0, and in some cases, the upper bound of the confidence interval excluded 1.0.  The trial court expressed its concern that the expert witness had inferred “evidence of absence” from “absence of evidence.”  Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767, 773 (N.D. Ohio 2010).  This concern, however, was misguided given that many studies had tested the claimed association, and that virtually every case-control and cohort study had found risk ratios at or below 1.0, or very close to 1.0.  What the court in Cooley, and the authors of the epidemiology chapter in the third edition have lost sight of, is that when the hypothesis is repeatedly tested, with failure to reject the null hypothesis, and with point estimates at or very close to 1.0, and with narrow confidence intervals, then the claimed association is probably incorrect.[49]

The Cooley court’s comments might have had some validity when applied to a single study, but not to the impressive body of exculpatory epidemiologic evidence that pertained to welding fume and Parkinson’s disease.  Shortly after the Cooley case was decided, a published meta-analysis of welding fume or manganese exposure demonstrated a reduced level of risk for Parkinson’s disease among persons occupationally exposed to welding fumes or manganese.[50]

VI. The Treatment of Meta-Analysis in the Third Edition

Meta-analysis is a statistical procedure for aggregating data and statistics from individual studies into a single summary statistical estimate of the population measurement of interest.  The first meta-analysis is typically attributed to Karl Pearson, circa 1904, who sought a method to overcome the limitations of small sample size and low statistical power.  Statistical methods for meta-analysis in epidemiology and the social sciences, however, did not mature until the 1970s.  Even then, the biomedical scientific community remained skeptical of, if not out rightly hostile to, meta-analysis until relatively recently.

The hostility to meta-analysis, especially in the context of observational epidemiologic studies, was colorfully expressed by two capable epidemiologists, Samuel Shapiro and Alvan Feinstein, as late as the 1990s:

“Meta-analysis begins with scientific studies….  [D]ata from these studies are then run through computer models of bewildering complexity which produce results of implausible precision.”

* * * *

“I propose that the meta-analysis of published non-experimental data should be abandoned.”[51]

The professional skepticism about meta-analysis was reflected in some of the early judicial assessments of meta-analysis in court cases.  In the 1980s and early 1990s, some trial judges erroneously dismissed meta-analysis as a flawed statistical procedure that claimed to make something out of nothing.[52]

In In re Paoli Railroad Yard PCB Litigation, Judge Robert Kelly excluded plaintiffs’ expert witness Dr. William Nicholson and his testimony based upon his unpublished meta-analysis of health outcomes among PCB-exposed workers.  Judge Kelly found that the meta-analysis was a novel technique, and that Nicholson’s meta-analysis was not peer reviewed.  Furthermore, the meta-analysis assessed health outcomes not experienced by any of the plaintiffs before the trial court.[53]

The Court of Appeals for the Third Circuit reversed the exclusion of Dr. Nicholson’s testimony, and remanded for reconsideration with instructions.[54]  The Circuit noted that meta-analysis was not novel, and that the lack of peer-review was not an automatic disqualification.  Acknowledging that a meta-analysis could be performed poorly using invalid methods, the appellate court directed the trial court to evaluate the validity of Dr. Nicholson’s work on his meta-analysis. On remand, however, it seems that plaintiffs chose – wisely – not to proceed with Nicholson’s meta-analysis.[55]

In one of many squirmishes over colorectal cancer claims in asbestos litigation, Judge Sweet in the Southern District of New York was unimpressed by efforts to aggregate data across studies.  Judge Sweet declared that:

“no matter how many studies yield a positive but statistically insignificant SMR for colorectal cancer, the results remain statistically insignificant. Just as adding a series of zeros together yields yet another zero as the product, adding a series of positive but statistically insignificant SMRs together does not produce a statistically significant pattern.”[56]

The plaintiffs’ expert witness who had offered the unreliable testimony, Dr. Steven Markowitz, like Nicholson, another foot soldier in Dr. Irving Selikoff’s litigation machine, did not offer a formal meta-analysis to justify his assessment that multiple non-significant studies, taken together, rule out chance as a likely explanation for an aggregate finding of an increased risk.

Judge Sweet was quite justified in rejecting this back of the envelope, non-quantitative meta-analysis.  His suggestion, however, that multiple non-significant studies could never collectively serve to rule out chance as an explanation for an overall increased rate of disease in the exposed groups is completely wrong.  Judge Sweet would have better focused on the validity issues in key studies, the presence of bias and confounding, and the completeness of the proffered meta-analysis.  The Second Circuit reversed the entry of summary judgment, and remanded the colorectal cancer claim for trial.[57]  Over a decade later, with even more accumulated studies and data, the Institute of Medicine found the evidence for asbestos plaintiffs’ colorectal cancer claims to be scientifically insufficient.[58]

Courts continue to go astray with an erroneous belief that multiple studies, all without statistically significant results, cannot yield a statistically significant summary estimate of increased risk.  See, e.g., Baker v. Chevron USA, Inc., 2010 WL 99272, *14-15 (S.D.Ohio 2010) (addressing a meta-analysis by Dr. Infante on multiple myeloma outcomes in studies of benzene-exposed workers).  There were many sound objections to Infante’s meta-analysis, but the suggestion that multiple studies without statistical significance could not yield a summary estimate of risk with statistical significance was not one of them.

In the last two decades, meta-analysis has emerged as an important technique for addressing random variation in studies, as well as some of the limitations of frequentist statistical methods.  In 1980s, articles reporting meta-analyses were rare to non-existent.  In 2009, there were over 2,300 articles with “meta-analysis” in their title, or in their keywords, indexed in the PubMed database of the National Library of Medicine.[59]

The techniques for aggregating data have been studied, refined, and employed extensively in thousands of methods and application papers in the last decade. Consensus guideline papers have been published for meta-analyses of clinical trials as well as observational studies.[60]  Meta-analyses, of observational studies and of randomized clinical trials, routinely are relied upon by expert witnesses in pharmaceutical and so-called toxic tort litigation.[61]

The second edition of the Reference Manual on Scientific Evidence gave very little attention to meta-analysis; the third edition did not add very much to the discussion.  The time has come for the next edition to address meta-analyses, and criteria for their validity or invalidity.

  1. Statistics Chapter

The statistics chapter of the third edition gave scant attention to meta-analysis.  The chapter noted, in a footnote, that there are formal procedures for aggregating data across studies, and that the power of the aggregated data will exceed the power of the individual, included studies.  The footnote then cautioned that meta-analytic procedures “have their own weakness,”[62] without detailing what that weakness is. The time has come to spell out the weaknesses so that trial judges can evaluate opinion testimony based upon meta-analyses.

The glossary at the end of the statistics chapter offers a definition of meta-analysis:

“meta-analysis. Attempts to combine information from all studies on a certain topic. For example, in the epidemiological context, a meta-analysis may attempt to provide a summary odds ratio and confidence interval for the effect of a certain exposure on a certain disease.”[63]

This definition was inaccurate in ways that could yield serious mischief.  Virtually all meta-analyses are, or should be, built upon a systematic review that sets out to collect all available studies on a research issue of interest.  It is a rare meta-analysis, however, that includes “all” studies in its quantitative analysis.  The meta-analytic process involves a pre-specification of inclusionary and exclusionary criteria for the quantitative analysis of the summary estimate of risk.  Those criteria may limit the quantitative analysis to randomized trials, or to analytical epidemiologic studies.  Furthermore, meta-analyses frequently and appropriately have pre-specified exclusionary criteria that relate to study design or quality.

On a more technical note, the offered definition suggests that the summary estimate of risk will be an odds ratio, which may or may not be true.  Meta-analyses of risk ratios may yield summary estimates of risk in terms of relative risk or hazard ratios, or even of risk differences.  The meta-analysis may combine data of means rather than proportions as well.

  1. Epidemiology Chapter

The chapter on epidemiology delved into meta-analysis in greater detail than the statistics chapter, and offered apparently inconsistent advice.  The overall gist of the chapter, however, can perhaps best be summarized by the definition offered in this chapter’s glossary:

“meta-analysis. A technique used to combine the results of several studies to enhance the precision of the estimate of the effect size and reduce the plausibility that the association found is due to random sampling error.  Meta-analysis is best suited to pooling results from randomly controlled experimental studies, but if carefully performed, it also may be useful for observational studies.”[64]

It is now time to tell trial judges what “careful” means in the context of conducting and reporting and relying upon meta-analyses.

The epidemiology chapter appropriately noted that meta-analysis can help address concerns over random error in small studies.[65]  Having told us that properly conducted meta-analyses of observational studies can be helpful, the chapter then proceeded to hedge considerably[66]:

“Meta-analysis is most appropriate when used in pooling randomized experimental trials, because the studies included in the meta-analysis share the most significant methodological characteristics, in particular, use of randomized assignment of subjects to different exposure groups. However, often one is confronted with nonrandomized observational studies of the effects of possible toxic substances or agents. A method for summarizing such studies is greatly needed, but when meta-analysis is applied to observational studies – either case-control or cohort – it becomes more controversial.174 The reason for this is that often methodological differences among studies are much more pronounced than they are in randomized trials. Hence, the justification for pooling the results and deriving a single estimate of risk, for example, is problematic.175

The stated objection to pooling results for observational studies was certainly correct, but many research topics have sufficient studies available to allow for appropriate selectivity in framing inclusionary and exclusionary criteria to address the objection.  The chapter went on to credit the critics of meta-analyses of observational studies.  As they did in the second edition of the Reference Manual, the authors in the third edition repeated their cites to, and quotes from, early papers by John Bailar, who was then critical of such meta-analyses:

“Much has been written about meta-analysis recently and some experts consider the problems of meta-analysis to outweigh the benefits at the present time. For example, John Bailar has observed:

‘[P]roblems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with the method. For the present . . . I still prefer the thoughtful, old-fashioned review of the literature by a knowledgeable expert who explains and defends the judgments that are presented. We have not yet reached a stage where these judgments can be passed on, even in part, to a formalized process such as meta-analysis.’

John Bailar, “Assessing Assessments,” 277 Science 528, 529 (1997).”[67]

Bailar’s subjective preference for “old-fashioned” reviews, which often cherry picked the included studies is, well, “old fashioned.”  More to the point, it is questionable science, and a distinctly minority viewpoint in the light of substantial improvements in the conduct and reporting of systematic reviews and meta-analyses of observational studies.  Bailar may be correct that some meta-analyses should have never left the protocol stage, but the third edition of the Reference Manual failed to provide the judiciary with the tools to appreciate the distinction between good and bad meta-analyses.

This categorical rejection, cited with apparent approval, is amplified by a recitation of some real or apparent problems with meta-analyses of observational studies.  What is missing is a discussion of how many of these problems can be and are dealt with in contemporary practice[68]:

“A number of problems and issues arise in meta-analysis. Should only published papers be included in the meta-analysis, or should any available studies be used, even if they have not been peer reviewed? Can the results of the meta-analysis itself be reproduced by other analysts? When there are several meta-analyses of a given relationship, why do the results of different meta-analyses often disagree? The appeal of a meta-analysis is that it generates a single estimate of risk (along with an associated confidence interval), but this strength can also be a weakness, and may lead to a false sense of security regarding the certainty of the estimate. A key issue is the matter of heterogeneity of results among the studies being summarized.  If there is more variance among study results than one would expect by chance, this creates further uncertainty about the summary measure from the meta-analysis. Such differences can arise from variations in study quality, or in study populations or in study designs. Such differences in results make it harder to trust a single estimate of effect; the reasons for such differences need at least to be acknowledged and, if possible, explained.176 People often tend to have an inordinate belief in the validity of the findings when a single number is attached to them, and many of the difficulties that may arise in conducting a meta-analysis, especially of observational studies such as epidemiologic ones, may consequently be overlooked.177

The epidemiology chapter authors were entitled to their opinion, but their discussion left the judiciary uninformed about current practice, and best practices, in epidemiology.  A categorical rejection of meta-analyses of observational studies is at odds with the chapter’s own claim that such meta-analyses can be helpful if properly performed. What was needed, and is missing, is a meaningful discussion to help the judiciary determine whether a meta-analysis of observational studies was properly performed.

  1. Medical Testimony Chapter

The chapter on medical testimony is the third pass at meta-analysis in the third edition of the Reference Manual.  The second edition’s chapter on medical testimony ignored meta-analysis completely; the new edition addresses meta-analysis in the context of the hierarchy of study designs[69]:

“Other circumstances that set the stage for an intense focus on medical evidence included

(1) the development of medical research, including randomized controlled trials and other observational study designs;

(2) the growth of diagnostic and therapeutic interventions;141

(3) interest in understanding medical decision making and how physicians reason;142 and

(4) the acceptance of meta-analysis as a method to combine data from multiple randomized trials.143

This language from the medical testimony chapter curiously omitted observational studies, but the footnote reference (note 143) then inconsistently discussed two meta-analyses of observational, rather than experimental, studies.[70]  The chapter then provided even further confusion by giving a more detailed listing of the hierarchy of medical evidence in the form of different study designs[71]:

3. Hierarchy of medical evidence

With the explosion of available medical evidence, increased emphasis has been placed on assembling, evaluating, and interpreting medical research evidence.  A fundamental principle of evidence-based medicine (see also Section IV.C.5, infra) is that the strength of medical evidence supporting a therapy or strategy is hierarchical.  When ordered from strongest to weakest, systematic review of randomized trials (meta-analysis) is at the top, followed by single randomized trials, systematic reviews of observational studies, single observational studies, physiological studies, and unsystematic clinical observations.150 An analysis of the frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized controlled trials in the medical evidence hierarchy.151 Although they are at the bottom of the evidence hierarchy, unsystematic clinical observations or case reports may be the first signals of adverse events or associations that are later confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless, subsequent studies may not confirm initial reports (e.g., the putative association between coffee consumption and pancreatic cancer).154

This discussion further muddied the water by using a parenthetical to suggest that meta-analyses of randomized clinical trials are equivalent to systematic reviews of such studies — “systematic review of randomized trials (meta-analysis).” Of course, systematic reviews are not meta-analyses, although they are usually a necessary precondition for conducting a proper meta-analysis.  The relationship between the procedures for a systematic review and a meta-analysis are in need of clarification, but the judiciary will not find it in the third edition of the Reference Manual.

CONCLUSION

The idea of the Reference Manual was important to support trial judge’s efforts to engage in gatekeeping in unfamiliar subject matter areas. In its third incarnation, the Manual has become a standard starting place for discussion, but on several crucial issues, the third edition was unclear, imprecise, contradictory, or muddled. The organizational committee and authors for the fourth edition have a fair amount of work on their hands. There is clearly room for improvement in the Fourth Edition.


[1] Adam Dutkiewicz, “Book Review: Reference Manual on Scientific Evidence, Third Edition,” 28 Thomas M. Cooley L. Rev. 343 (2011); John A. Budny, “Book Review: Reference Manual on Scientific Evidence, Third Edition,” 31 Internat’l J. Toxicol. 95 (2012); James F. Rogers, Jim Shelson, and Jessalyn H. Zeigler, “Changes in the Reference Manual on Scientific Evidence (Third Edition),” Internat’l Ass’n Def. Csl. Drug, Device & Biotech. Comm. Newsltr. (June 2012).  See Schachtman “New Reference Manual’s Uneven Treatment of Conflicts of Interest.” (Oct. 12, 2011).

[2] The Manual did not do quite so well in addressing its own conflicts of interest.  See, e.g., infra at notes 7, 20.

[3] RSME 3d 11 (2011).

[4] Id. at 19.

[5] Id. at 20 & n. 51 (citing Susan Haack, “An Epistemologist in the Bramble-Bush: At the Supreme Court with Mr. Joiner,” 26 J. Health Pol. Pol’y & L. 217–37 (1999).

[6] Id. at 19-20 & n.52.

[7] Professor Berger filed an amicus brief on behalf of plaintiffs, in Rider v. Sandoz Pharms. Corp., 295 F.3d 1194 (11th Cir. 2002).

[8] Id. at 20 n.51. (The editors noted misleadingly that the published chapter was Berger’s last revision, with “a few edits to respond to suggestions by reviewers.”). I have written elsewhere of the ethical cloud hanging over this Milward decision. SeeCarl Cranor’s Inference to the Best Explanation” (Feb. 12, 2021); “From here to CERT-ainty” (June 28, 2018); “The Council for Education & Research on Toxics” (July 9, 2013) (CERT amicus brief filed without any disclosure of conflict of interest). See also NAS, “Carl Cranor’s Conflicted Jeremiad Against Daubert” (Sept. 23, 2018).

[9] RMSE 3d at 610 (internal citations omitted).

[10] RMSE 3d at 610 n.184 (emphasis, in bold, added).

[11] Interestingly, the authors of this chapter seem to abandon their suggestion that studies relied upon “might qualify for the learned treatise exception to the hearsay rule, Fed. R. Evid. 803(18), or possibly the catchall exceptions, Fed. R. Evid. 803(24) & 804(5),” which was part of their argument in the Second Edition.  RMSE 2d at 335 (2000).  See also RMSE 3d at 214 (discussing statistical studies as generally “admissible,” but acknowledging that admissibility may be no more than permission to explain the basis for an expert’s opinion, which is hardly admissibility at all).

[12] David L. Faigman, et al., Modern Scientific Evidence:  The Law and Science of Expert Testimony v.1, § 23:1,at 206 (2009) (“Well conducted studies are uniformly admitted.”).

[13] See Richard M. Lynch and Mary S. Henifin, “Causation in Occupational Disease: Balancing Epidemiology, Law and Manufacturer Conduct,” 9 Risk: Health, Safety & Environment 259, 269 (1998) (conflating distinct causal and liability concepts, and arguing that legal and scientific causal criteria should be abrogated when manufacturing defendant has breached a duty of care).

[14]  See, e.g., Parker v. Mobil Oil Corp., 7 N.Y.3d 434, 857 N.E.2d 1114, 824 N.Y.S.2d 584 (2006) (dismissing leukemia (AML) claim based upon claimed low-level benzene exposure from gasoline), aff’g 16 A.D.3d 648 (App. Div. 2d Dep’t 2005).  No; you will not find the Parker case cited in the Manual‘s chapter on toxicology. (Parker is, however, cited in the chapter on exposure science even though it is a state court case.).

[15] Curtis D. Klaassen, Casarett & Doull’s Toxicology: The Basic Science of Poisons 23 (7th ed. 2008) (internal citations omitted).

[16] Philip Wexler, Bethesda, et al., eds., 2 Encyclopedia of Toxicology 96 (2005).

[17] See Edward J. Calabrese and Robyn B. Blain, “The hormesis database: The occurrence of hormetic dose responses in the toxicological literature,” 61 Regulatory Toxicology and Pharmacology 73 (2011) (reviewing about 9,000 dose-response relationships for hormesis, to create a database of various aspects of hormesis).  See also Edward J. Calabrese and Robyn B. Blain, “The occurrence of hormetic dose responses in the toxicological literature, the hormesis database: An overview,” 202 Toxicol. & Applied Pharmacol. 289 (2005) (earlier effort to establish hormesis database).

[18] Reference Manual at 653

[19] See e.g., Karin Wirdefeldt, Hans-Olaf Adami, Philip Cole, Dimitrios Trichopoulos, and Jack Mandel, “Epidemiology and etiology of Parkinson’s disease: a review of the evidence.  26 European J. Epidemiol. S1, S20-21 (2011); Tomas R. Guilarte, “Manganese and Parkinson’s Disease: A Critical Review and New Findings,” 118 Environ Health Perspect. 1071, 1078 (2010) (“The available evidence from human and non­human primate studies using behavioral, neuroimaging, neurochemical, and neuropathological end points provides strong sup­port to the hypothesis that, although excess levels of [manganese] accumulation in the brain results in an atypical form of parkinsonism, this clini­cal outcome is not associated with the degen­eration of nigrostriatal dopaminergic neurons as is the case in PD [Parkinson’s disease].”)

[20] RMSE3ed at 646.

[21] Hans-Olov Adami, Sir Colin L. Berry, Charles B. Breckenridge, Lewis L. Smith, James A. Swenberg, Dimitrios Trichopoulos, Noel S. Weiss, and Timothy P. Pastoor, “Toxicology and Epidemiology: Improving the Science with a Framework for Combining Toxicological and Epidemiological Evidence to Establish Causal Inference,” 122 Toxciological Sciences 223, 224 (2011).

[22] RMSE3d at xiv.

[23] RMSE3d at xiv.

[24] RMSE3d at xiv-xv.

[25] See, e.g., Parker v. Mobil Oil Corp., 7 N.Y.3d 434, 857 N.E.2d 1114, 824 N.Y.S.2d 584 (2006); Exxon Corp. v. Makofski, 116 SW 3d 176 (Tex. Ct. App. 2003).

[26] Goldstein here and elsewhere has confused significance probability with the posterior probability required by courts and scientists.

[27] Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

[28] Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1122 (D. Colo. 2006), rev’d and remanded on other grounds, 618 F.3d 1127 (10th Cir. 2010), cert. denied, ___ U.S. ___ (May 24, 2012).

[29] In re Viagra Products Liab. Litig., 658 F. Supp. 2d 936, 945 (D. Minn. 2009). 

[31] Id. at 256 -57.

[32] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573.

[33] Id. at 573n.68.

[34] See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

[35] RSME3d at 577 n81.

[36] Id.

[37] 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).

[38] David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in RMSE3ed 209 (2011).

[39] Id. at 254 n.106

[40] See Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3ed 549, 582, 626 ; John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, Abogado, “Reference Guide on Medical Testimony,” in RMSE3ed 687, 724.  This confusion in nomenclature is regrettable, given the difficulty many lawyers and judges seem have in following discussions of statistical concepts.

[41] See, e.g., Richard D. De Veaux, Paul F. Velleman, and David E. Bock, Intro Stats 545-48 (3d ed. 2012); Rand R. Wilcox, Fundamentals of Modern Statistical Methods 65 (2d ed. 2010).

[42] See also Daniel Rubinfeld, “Reference Guide on Multiple Regression,“ in RMSE3d 303, 321 (describing a p-value > 5% as leading to failing to reject the null hypothesis).

[43] RMSE3d at 254.

[44] See Sander Greenland, “Nonsignificance Plus High Power Does Not Imply Support Over the Alternative,” 22 Ann. Epidemiol. 364, 364 (2012).

[45] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” RMSE3ed 549, 582.

[46] RMSE3d at 579 n.88.

[47] Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008).  See also Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986) (“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”).

[48] RMSE3d at 582 n.93; id. at 582 n.94 (“Thus, in Smith v. Wyeth-Ayerst Labs. Co., 278 F.Supp. 2d 684, 693 (W.D.N.C. 2003), and Cooley v. Lincoln Electric Co., 693 F. Supp. 2d 767, 773 (N.D. Ohio 2010), the courts recognized that the power of a study was critical to assessing whether the failure of the study to find a statistically significant association was exonerative of the agent or inconclusive.”).

[49] See, e.g., Anthony J. Swerdlow, Maria Feychting, Adele C. Green, Leeka Kheifets, David A. Savitz, International Commission for Non-Ionizing Radiation Protection Standing Committee on Epidemiology, “Mobile Phones, Brain Tumors, and the Interphone Study: Where Are We Now?” 119 Envt’l Health Persp. 1534, 1534 (2011) (“Although there remains some uncertainty, the trend in the accumulating evidence is increasingly against the hypothesis that mobile phone use can cause brain tumors in adults.”).

[50] James Mortimer, Amy Borenstein, and Lorene Nelson, “Associations of welding and manganese exposure with Parkinson disease: Review and meta-analysis,” 79 Neurology 1174 (2012).

[51] Samuel Shapiro, “Meta-analysis/Smeta-analysis,” 140 Am. J. Epidem. 771, 777 (1994).  See also Alvan Feinstein, “Meta-Analysis: Statistical Alchemy for the 21st Century,” 48 J. Clin. Epidem. 71 (1995).

[52] Allen v. Int’l Bus. Mach. Corp., No. 94-264-LON, 1997 U.S. Dist. LEXIS 8016, at *71–*74 (suggesting that meta-analysis of observational studies was controversial among epidemiologists).

[53] 706 F. Supp. 358, 373 (E.D. Pa. 1988).

[54] In re Paoli R.R. Yard PCB Litig., 916 F.2d 829, 856-57 (3d Cir. 1990), cert. denied, 499 U.S. 961 (1991); Hines v. Consol. Rail Corp., 926 F.2d 262, 273 (3d Cir. 1991).

[55] SeeThe Shmeta-Analysis in Paoli,” (July 11, 2019).

[56] In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1042 (S.D.N.Y. 1993).

[57] 52 F.3d 1124 (2d Cir. 1995).

[58] Institute of Medicine, Asbestos: Selected Cancers (Wash. D.C. 2006).

[59] See Michael O. Finkelstein and Bruce Levin, “Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia CasesJurimetrics J. (2011).

[60] See Donna Stroup, et al., “Meta-analysis of Observational Studies in Epidemiology: A Proposal for Reporting,” 283 J. Am. Med. Ass’n 2008 (2000) (MOOSE statement); David Moher, Deborah Cook, Susan Eastwood, Ingram Olkin, Drummond Rennie, and Donna Stroup, “Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement,” 354 Lancet 1896 (1999).  See also Jesse Berlin & Carin Kim, “The Use of Meta-Analysis in Pharmacoepidemiology,” in Brian Strom, ed., Pharmacoepidemiology 681, 683–84 (4th ed. 2005); Zachary Gerbarg & Ralph Horwitz, “Resolving Conflicting Clinical Trials: Guidelines for Meta-Analysis,” 41 J. Clin. Epidemiol. 503 (1988).

[61] See Finkelstein & Levin, supra at note 59. See also In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1174, 1184 (N.D. Cal. 2007) (holding that reliance upon “[a] meta-analysis of all available published and unpublished randomized clinical trials” was reasonable and appropriate, and criticizing the expert witnesses who urged the complete rejection of meta-analysis of observational studies).

[62] RMSE 3d at 254 n.107.

[63] Id. at 289.

[64] Reference Guide on Epidemiology, RSME3d at 624.  See also id. at 581 n. 89 (“Meta-analysis is better suited to combining results from randomly controlled experimental studies, but if carefully performed it may also be helpful for observational studies, such as those in the epidemiologic field.”).

[65] Id. at 579; see also id. at 607 n. 171.

[66] Id. at 607.

[67] Id. at 607 n.177.

[68] Id. at 608.

[69] RMSE 3d at 722-23.

[70] Id. at 723 n.143 (“143. … Video Software Dealers Ass’n v. Schwarzenegger, 556 F.3d 950, 963 (9th Cir. 2009) (analyzing a meta-analysis of studies on video games and adolescent behavior); Kennecott Greens Creek Min. Co. v. Mine Safety & Health Admin., 476 F.3d 946, 953 (D.C. Cir. 2007) (reviewing the Mine Safety and Health Administration’s reliance on epidemiological studies and two meta-analyses).”).

[71] Id. at 723-24.

People Get Ready – There’s a Reference Manual a Comin’

July 16th, 2021

Science is the key …

Back in February, I wrote about a National Academies’ workshop that featured some outstanding members of the scientific and statistical world, and which gave participants to identify new potential subjects for inclusion in a proposed fourth edition of the Reference Manual on Scientific Evidence.[1] Funding for that new edition is now secured, and the National Academies has published a précis of the February workshop. National Academies of Sciences, Engineering, and Medicine, Emerging Areas of Science, Engineering, and Medicine for the Courts: Proceedings of a Workshop – in Brief (Washington, DC 2021). The Rapporteurs for these proceedings provide a helpful overview for this meeting, which was not generally covered in the legal media.[2]

The goal of the workshop, which was supported by a planning committee, the Committee on Science, Technology, and Law, the National Academies, the Federal Judicial Center, and the National Science Foundation, was, of course, to identify chapters for a new, fourth edition, of Reference Manual on Scientific Evidence. The workshop was co-chaired by Dr. Thomas D. Albright, of the Salk Institute for Biological Studies, and the Hon. Kathleen McDonald O’Malley, Judge on the U.S. Court of Appeals for the Federal Circuit.

The Rapporteurs duly noted Judge O’Malley’s Workshop comments that she hoped that the reconsideration of the Reference Manual can help close the gap between science and the law. It is thus encouraging that the Rapporteurs focused a large part of their summary on the presentation of Professor Xiao-Li Meng[3] on selection bias, which “can come from cherry picking data, which alters the strength of the evidence.” Meng identified the

“7 S’(ins)” of selection bias:

(1) selection of target/hypothesis (e.g., subgroup analysis);

(2) selection of data (e.g., deleting ‘outliers’ or using only ‘complete cases’);

(3) selection of methodologies (e.g., choosing tests to pass the goodness-of-fit); (4) selective due diligence and debugging (e.g., triple checking only when the outcome seems undesirable);

(5) selection of publications (e.g., only when p-value <0.05);

(6) selections in reporting/summary (e.g., suppressing caveats); and

(7) selections in understanding and interpretation (e.g., our preference for deterministic, ‘common sense’ interpretation).”

Meng also addressed the problem of analyzing subgroup findings after not finding an association in the full sample, dubious algorithms, selection bias in publishing “splashy” and nominally “statistically significant” results, and media bias and incompetence in disseminating study results. Meng discussed how these biases could affect the accuracy of research findings, and how these biases obviously affect the accuracy, validity, and reliability of research findings that are relied upon by expert witnesses in court cases.

The Rapporteurs’ emphasis on Professor Meng’s presentation was noteworthy because the current edition of the Reference Manual is generally lacking in a serious exploration of systematic bias and confounding. To be sure, the concepts are superficially addressed in the Manual’s chapter on epidemiology, but in a way that has allowed many district judges to shrug off serious questions of invalidity with the shibboleth that such questions “to to the weight, not the admissibility,” of challenged expert witness opinion testimony. Perhaps the pending revision to Rule 702 will help improve fidelity to the spirit and text of Rule 702.

Questions of bias and noise have come to receive more attention in the professional statistical and epidemiologic literature. In 2009, Professor Timothy Lash published an important book-length treatment of quantitative bias analysis.[4] Last year, statistician David Hand published a comprehensive, but readily understandable, book on “Dark Data,” and the ways statistical and scientific interference are derailed.[5] One of the presenters at the February workshop, nobel laureate, Daniel Kahneman, published a book on “noise,” just a few weeks ago.[6]

David Hand’s book, Dark Data, (Chapter 10) sets out a useful taxonomy of the ways that data can be subverted by what the consumers of data do not know. The taxonomy would provide a useful organizational map for a new chapter of the Reference Manual:

A Taxonomy of Dark Data

Type 1: Data We Know Are Missing

Type 2: Data We Don’t Know Are Missing

Type 3: Choosing Just Some Cases

Type 4: Self- Selection

Type 5: Missing What Matters

Type 7: Changes with Time

Type 8: Definitions of Data

Type 9: Summaries of Data

Type 11: Feedback and Gaming

Type 12: Information Asymmetry

Type 13: Intentionally Darkened Data

Type 14: Fabricated and Synthetic Data

Type 15: Extrapolating beyond Your Data

Providing guidance not only on “how we know,” but also on how we go astray, patho-epistemology, would be helpful for judges and lawyers. Hand’s book really just a beginning to helping gatekeepers appreciate how superficially plausible health-effects claims are invalidated by the data relied upon by proffered expert witnesses.

* * * * * * * * * * * *

“There ain’t no room for the hopeless sinner
Who would hurt all mankind, just to save his own, believe me now
Have pity on those whose chances grow thinner”


[1]Reference Manual on Scientific Evidence v4.0” (Feb. 28, 2021).

[2] Steven Kendall, Joe S. Cecil, Jason A. Cantone, Meghan Dunn, and Aaron Wolf.

[3] Prof. Meng is the Whipple V. N. Jones Professor of Statistics, in Harvard University. (“Seeking simplicity in statistics, complexity in wine, and everything else in fortune cookies.”)

[4] Timothy L. Lash, Matthew P. Fox, and Aliza K. Fink, Applying Quantitative Bias Analysis to Epidemiologic Data (2009).

[5] David J. Hand, Dark data : why what you don’t know matters (2020).

[6] Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein, Noise: A Flaw in Human Judgment (2021).

Reference Manual on Scientific Evidence v4.0

February 28th, 2021

The need for revisions to the third edition of the Reference Manual on Scientific Evidence (RMSE) has been apparent since its publication in 2011. A decade has passed, and the federal agencies involved in the third edition, the Federal Judicial Center (FJC) and the National Academies of Science Engineering and Medicine (NASEM), are assembling staff to prepare the long-needed revisions.

The first sign of life for this new edition came back on November 24, 2020, when the NASEM held a short, closed door virtual meeting to discuss planning for a fourth edition.[1] The meeting was billed by the NASEM as “the first meeting of the Committee on Emerging Areas of Science, Engineering, and Medicine for the Courts: Identifying Chapters for a Fourth Edition of The Reference Manual on Scientific Evidence.” The Committee members heard from John S. Cooke (FJC Director), and Alan Tomkins and Reggie Sheehan, both of the National Science Foundation (NSF). The stated purpose of the meeting was to review the third edition of the RMSE to identify “identify areas of science, technology, and medicine that may be candidates for new or updated chapters in a proposed new (fourth) edition of the manual.” The only public pronouncement from the first meeting was that the committee would sponsor a workshop on the topic of new chapters for the RMSE, in early 2021.

The Committee’s second meeting took place a week later, again in closed session.[2] The stated purpose of the Committee’s second meeting was to review the third edition of the RMSE, and to discuss candidate areas for inclusion as new and updated chapters for a fourth edition.

Last week saw the Committee’s third, public meeting. The meeting spanned two days (Feb. 24 and 25, 2021), and was open to the public. The meeting was sponsored by NASEM, FJC, along with the NSF, and was co-chaired by Thomas D. Albright, Professor and Conrad T. Prebys Chair at the Salk Institute for Biological Studies, and the Hon. Kathleen McDonald O’Malley, who sits on the United States Court of Appeals for the Federal Circuit. Identified members of the committee include:

Steven M. Bellovin, professor in the Computer Science department at Columbia University;

Karen Kafadar, Departmental Chair and Commonwealth Professor of Statistics at the University of Virginia, and former president of the American Statistical Association;

Andrew Maynard, professor, and director of the Risk Innovation Lab at the School for the Future of Innovation in Society, at Arizona State University;

Venkatachalam Ramaswamy, Director of the Geophysical Fluid Dynamics Laboratory of the National Oceanic and Atmospheric Administration (NOAA) Office of Oceanic and Atmospheric Research (OAR), studying climate modeling and climate change;

Thomas Schroeder, Chief Judge for the U.S. District Court for the Middle District of North Carolina;

David S. Tatel, United States Court of Appeals for the District of Columbia Circuit; and

Steven R. Kendall, Staff Officer

The meeting comprised five panel presentations, made up of remarkably accomplished and talented speakers. Each panel’s presentations were followed by discussion among the panelists, and the committee members. Some panels answered questions submitted from the public audience. Judge O’Malley opened the meeting with introductory remarks about the purpose and scope of the RMSE, and of the inquiry into additional possible chapters.

  1. Challenges in Evaluating Scientific Evidence in Court

The first panel consisted entirely of judges, who held forth on their approaches to judicial gatekeeping of expert witnesses, and their approach to scientific and technical issues. Chief Judge Schroeder moderated the presentations of panelists:

Barbara Parker Hervey, Texas Court of Criminal Appeals;

Patti B. Saris, Chief Judge of the United States District Court for the District of Massachusetts,  member of President’s Council of Advisors on Science and Technology (PCAST);

Leonard P. Stark, U.S. District Court for the District of Delaware; and

Sarah S. Vance, Judge (former Chief Judge) of the U.S. District Court for the Eastern District of Louisiana, chair of the Judicial Panel on Multidistrict Litigation.

  1. Emerging Issues in the Climate and Environmental Sciences

Paul Hanle, of the Environmental Law Institute moderated presenters:

Joellen L. Russell, the Thomas R. Brown Distinguished Chair of Integrative Science and Professor at the University of Arizona in the Department of Geosciences;

Veerabhadran Ramanathan, Edward A. Frieman Endowed Presidential Chair in Climate Sustainability at the Scripps Institution of Oceanography at the University of California, San Diego;

Benjamin D. Santer, atmospheric scientist at Lawrence Livermore National Laboratory; and

Donald J. Wuebbles, the Harry E. Preble Professor of Atmospheric Science at the University of Illinois.

  1. Emerging Issues in Computer Science and Information Technology

Josh Goldfoot, Principal Deputy Chief, Computer Crime & Intellectual Property Section, at U.S. Department of Justice, moderated panelists:

Jeremy J. Epstein, Deputy Division Director of Computer and Information Science and Engineering (CISE) and Computer and Network Systems (CNS) at the National Science Foundation;

Russ Housley, founder of Vigil Security, LLC;

Subbarao Kambhampati, professor of computer science at Arizona State University; and

Alice Xiang, Senior Research Scientist at Sony AI.

  1. Emerging Issues in the Biological Sciences

Panel four was moderated by Professor Ellen Wright Clayton, the Craig-Weaver Professor of Pediatrics, and Professor of Law and of Health Policy at Vanderbilt Law School, at Vanderbilt University. Her panelists were:

Dana Carroll, distinguished professor in the Department of Biochemistry at the University of Utah School of Medicine;

Yaniv Erlich, Chief Executive Officer of Eleven Therapeutics, Chief Science Officer of MyHeritage;

Steven E. Hyman, director of the Stanley Center for Psychiatric Research at Broad Institute of MIT and Harvard; and

Philip Sabes, Professor Emeritus in Physiology at the University of California, San Francisco (UCSF).

  1. Emerging areas in Psychology, Data, and Statistical Sciences

Gary Marchant, Lincoln Professor of Emerging Technologies, Law and Ethics, at Arizona State University’s Sandra Day O’Connor College of Law, moderated panelists:

Xiao-Li Meng, the Whipple V. N. Jones Professor of Statistics, Harvard University, and the Founding Editor-in-Chief of Harvard Data Science Review;

Rebecca Doerge, Glen de Vries Dean of the Mellon College of Science at Carnegie Mellon University, member of the Dietrich College of Humanities and Social Sciences’ Department of Statistics and Data Science, and of the Mellon College of Science’s Department of Biological Sciences;

Daniel Kahneman, Professor of Psychology and Public Affairs Emeritus at the Princeton School of Public and International Affairs, the Eugene Higgins Professor of Psychology Emeritus at Princeton University, and a fellow of the Center for Rationality at the Hebrew University in Jerusalem; and

Goodwin Liu, Associate Justice of the California Supreme Court.

The Proceedings of this two day meeting were recorded and will be published. The website materials are unclear whether the verbatim remarks will be included, but regardless, the proceedings should warrant careful reading.

Judge O’Malley, in her introductory remarks, emphasized that the RMSE must be a neutral, disinterested source of information for federal judges, an aspirational judgment from which there can be no dissent. More controversial will be Her Honor’s assessment that epidemiologic studies can “take forever,” and other judges’ suggestion that plaintiffs lack financial resources to put forward credible, reliable expert witnesses. Judge Vance corrected the course of the discussion by pointing out that MDL plaintiffs were not disadvantaged, but no one pointed out that plaintiffs’ counsel were among the wealthiest individuals in the United States, and that they have been known to sponsor epidemiologic and other studies that wind up as evidence in court.

Panel One was perhaps the most discomforting experience, as it involved revelations about how sausage is made in the gatekeeping process. The panel was remarkable for including a state court judge from Texas, Judge Barbara Parker Hervey, of the Texas Court of Criminal Appeals. Judge Hervey remarked that [in her experience] if we judges “can’t understand it, we won’t read it.” Her dictum raises interesting issues. No doubt, in some instances, the judicial failure of comprehension is the fault of the lawyers. What happens when the judges “can’t understand it”? Do they ask for further briefing? Or do they ask for a hearing with viva voce testimony from expert witnesses? The point was not followed up.

Leonard P. Stark’s insights were interesting in that his docket in the District of Delaware is flooded with patent and Hatch-Waxman Act litigation. Judge Stark’s extensive educational training is in politics and political science. The docket volume Judge Stark described, however, raised issues about how much attention he could give to any one case.

When the panel was asked how they dealt with scientific issues, Judge Saris discussed her presiding over In re Neurontin, which was a “big challenge for me to understand,” with no randomized trials or objective assessments by the litigants.[3] Judge Vance discussed her experience of presiding in a low-level benzene exposure case, in which plaintiff claimed that his acute myelogenous leukemia was caused by gasoline.[4]

Perhaps the key difference in approach to Rule 702 emerged when the judges were asked whether they read the underlying studies. Judge Saris did not answer directly, but stated she reads the reports. Judge Vance, on the other hand, noted that she reads the relied upon studies. In her gasoline-leukemia case, she read the relied-upon epidemiologic studies, which she described as a “hodge podge,” and which were misrepresented by the expert witnesses and counsel. She emphasized the distortions of the adversarial system and the need to moderate its excesses by validating what exactly the expert witnesses had relied upon.

This division in judicial approach was seen again when Professor Karen Kafadar asked how the judges dealt with peer review. Judge Saris seemed to suggest that the peer-reviewed published article was prima facie reliable. Others disagreed and noted that peer reviewed articles can have findings that are overstated, and wrong. One speaker noted that Jerome Kassirer had downplayed the significance of, and the validation provided by, peer review, in the RMSE (3rd ed 2011).

Curiously, there was no discussion of Rule 703, either in Judge O’Malley’s opening remarks on the RMSE, or in the first panel discussion. When someone from the audience submitted a question about the role of Rule 703 in the gatekeeping process, the moderator did not read it.

Panel Two. The climate change panel was a tour de force of the case for anthropogenic climate change. To some, the presentations may have seemed like a reprise of The Day After Tomorrow. Indeed, the science was presented so confidently, if not stridently, that one of the committee members asked whether there could be any reasonable disagreement. The panelists responded essentially by pointing out that there could be no good faith opposition. The panelists were much less convincing on the issue of attributability. None of the speakers addressed the appropriateness vel non of climate change litigation, when the federal and state governments encouraged, licensed, and regulated the exploitation and use of fossil fuel reserves.

Panel Four. Dr. Clayton’s panel was fascinating and likely to lead to new chapters. Professor Hyman presented on heritability, a subject that did not receive much attention in the RMSE third edition. With the advent of genetic claims of susceptibility and defenses of mutation-induced disease, courts will likely need some good advice on navigating the science. Dana Carroll presented on human genome editing (CRISPR). Philip Sabes presented on brain-computer interfaces, which have progressed well beyond the level of sci-fi thrillers, such as The Brain That Wouldn’t Die (“Jan in the Pan”).

In addition to the therapeutic applications, Sabes discussed some of potential forensic uses, such as lie detectors, pain quantification, and the like. Yaniv Erlich, of MyHeritage, discussed advances in forensic genetic genealogy, which have made a dramatic entrance to the common imagination through the apprehension of Joseph James DeAngelo, the Golden State killer. The technique of triangulating DNA matches from consumer DNA databases has other applications, of course, such as identifying lost heirs, and resolving paternity issues.

Panel Five. Professor Marchant’s panel may well have identified some of the most salient needs for the next edition of the RMSE. Nobel Laureate Daniel Kahneman presented some of the highlights from his forthcoming book about “noise” in human judgment.[5] Kahneman’s expansion upon his previous thinking about the sources of error in human – and scientific – judgment are a much needed addition to the RMSE. Along the same lines, Professor Xiao Li Meng, presented on selection bias, and how it pervades scientific work, and detracts from the strength of evidence in the form of:

  1. cherry picking
  2. subgroup analyses
  3. unprincipled handling of outliers
  4. selection in methodologies (different tests)
  5. selection in due diligence (check only when you don’t like results)
  6. publication bias that results from publishing only impressive or statistically significant results
  7. selection in reporting, not reporting limitations all analyses
  8. selection in understanding

Professor Meng’s insights are sorely lacking in the third edition of the RMSE, and among judicial gatekeepers generally.  All too often, undue selectivity in methodologies and in relied-upon data is treated by judges as an issue that “goes to the weight, not the admissibility” of expert witness opinion testimony. In actuality, the selection biases, and other systematic and cognitive biases, are as important as, if not more important than, random error assessments. Indeed a close look at the RMSE third edition reveals a close embrace of the amorphous, anything-goes “weight of the evidence” approach in the epidemiology chapter.  That chapter marginalizes meta-analyses and fails to mention systematic review techiniques altogether. The chapter on clinical medicine, however, takes a divergent approach, emphasizing the hierarchy of evidence inherent in different study types, and the need for principled and systematic reviews of the available evidence.[6]

The Committee co-chairs and panel moderators did a wonderful job to identify important new trends in genetics, data science, error assessment, and computer science, and they should be congratulated for their efforts. Judge O’Malley is certainly correct in saying that the RMSE must be a neutral source of information on statistical and scientific methodologies, and it needs to be revised and updated to address errors and omissions in the previous editions. The legal community should look for, and study, the published proceedings when they become available.

——————————————————————————————————

[1]  SeeEmerging Areas of Science, Engineering, and Medicine for the Courts: Identifying Chapters for a Fourth Edition of The Reference Manual on Scientific Evidence – Committee Meeting” (Nov. 24, 2020).

[2]  SeeEmerging Areas of Science, Engineering, and Medicine for the Courts: Identifying Chapters for a Fourth Edition of The Reference Manual on Scientific Evidence – Committee Meeting 2 (Virtual)” (Dec. 1, 2020).

[3]  In re Neurontin Marketing, Sales Practices & Prods. Liab. Litig., 612 F. Supp. 2d 116 (D. Mass. 2009) (Saris, J.).

[4]  Burst v. Shell Oil Co., 104 F.Supp.3d 773 (E.D.La. 2015) (Vance, J.), aff’d, ___ Fed. App’x ___, 2016 WL 2989261 (5th Cir. May 23, 2016), cert. denied, 137 S.Ct. 312 (2016). SeeThe One Percent Non-solution – Infante Fuels His Own Exclusion in Gasoline Leukemia Case” (June 25, 2015).

[5]  Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein, Noise: A Flaw in Human Judgment (anticipated May 2021).

[6]  See John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, “Reference Guide on Medical Testimony,” Reference Manual on Scientific Evidence 723-24 (3ed ed. 2011) (discussing hierarchy of medical evidence, with systematic reviews at the apex).

On Praising Judicial Decisions – In re Viagra

February 8th, 2021

We live in strange times. A virulent form of tribal stupidity gave us Trumpism, a personality cult in which it impossible to function in the Republican party and criticize der Führe. Even a diehard right-winger such as Liz Cheney, who dared to criticize Trump is censured, for nothing more than being disloyal to a cretin who fomented an insurrection that resulted in the murder of a Capital police officer and the deaths of several other people.[1]

Unfortunately, a similar, even if less extreme, tribal chauvinism affects legal commentary, from both sides of the courtroom. When Judge Richard Seeborg issued an opinion, early in 2020), in the melanoma – phosphodiesterase type 5 inhibitor (PDE5i) litigation,[2] I praised the decision for not shirking the gatekeeping responsibility even when the causal claim was based upon multiple, consistent statistically significant observational studies that showed an association between PDE5i medications and melanoma.[3] Although many of the plaintiffs’ relied-upon studies reported statistically significant associations between PDE5i use and melanoma occurrence, they also found similar size associations with non-melanoma skin cancers. Because skin carcinomas were not part of the hypothesized causal mechanism, the study findings strongly suggested a common, unmeasured confounding variable such as skin damage from ultraviolet light. The plaintiffs’ expert witnesses’ failure to account for confounding was fatal under Rule 702, and Judge Seeborg’s recognition of this defect, and his willingness to go beyond multiple, consistent, statistically significant associations was what made the decision important.

There were, however, problems and even a blatant error in the decision that required attention. Although the error was harmless in that its correction would not have required, or even suggested, a different result, Judge Seeborg, like many other judges and lawyers, tripped up over the proper interpretation of a confidence interval:

“When reviewing the results of a study it is important to consider the confidence interval, which, in simple terms, is the ‘margin of error’. For example, a given study could calculate a relative risk of 1.4 (a 40 percent increased risk of adverse events), but show a 95 percent ‘confidence interval’ of .8 to 1.9. That confidence interval means there is 95 percent chance that the true value—the actual relative risk—is between .8 and 1.9.”[4]

This statement about the true value is simply wrong. The provenance of this error is old, but the mistake was unfortunately amplified in the Third Edition of the Reference Manual on Scientific Evidence,[5] in its chapter on epidemiology.[6] The chapter, which is often cited, twice misstates the meaning of a confidence interval:

“A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”[7]

and

“A confidence interval is a range of possible values calculated from the results of a study. If a 95% confidence interval is specified, the range encompasses the results we would expect 95% of the time if samples for new studies were repeatedly drawn from the same population. Thus, the width of the interval reflects random error.”[8]

The 95% confidence interval does represent random error, 1.96 standard errors above and below the point estimate from the sample date. The confidence interval is not the range of possible values, which could well be anything, but the range of reasonable compatible estimates with this one, particular study sample statistic.[9] Intervals have lower and upper bounds, which are themselves random variables, with approximately normal (or some other specified) distributions. The essence of the interval is that no value within the interval would be rejected as a null hypothesis based upon the data collected for the particular sample. Although the chapter on statistics in the Reference Manual accurately describes confidence intervals, judges and many lawyers are misled by the misstatements in the epidemiology chapter.[10]

Given the misdirection created by the Federal Judicial Center’s manual, Judge Seeborg’s erroneous definition of a confidence interval is understandable, but it should be noted in the context of praising the important gatekeeping decision in In re Viagra. Certainly our litigation tribalism should not “allow us to believe” impossible things.[11] The time to revise the Reference Manual is long overdue.

_____________________________________________________________________

[1]  John Ruwitch, “Wyoming GOP Censures Liz Cheney For Voting To Impeach Trump,” Nat’l Pub. Radio (Feb. 6, 2021).

[2]  In re Viagra (Sildenafil Citrate) and Cialis (Tadalafil) Prods. Liab. Litig., 424 F. Supp. 3d 781 (N.D. Cal. 2020) [Viagra].

[3]  SeeJudicial Gatekeeping Cures Claims That Viagra Can Cause Melonoma” (Jan. 24, 2020).

[4]  Id. at 787.

[5]  Federal Judicial Center, Reference Manual on Scientific Evidence (3rd ed. 2011).

[6]  Michael D. Green, D. Michal Freedman, & Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, Reference Manual on Scientific Evidence 549 (3rd ed. 2011).

[7]  Id. at 573.

[8]  Id. at 580.

[9] Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers 171, 173-74 (3rd ed. 2015). See also Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman, “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” 31 Eur. J. Epidem. 337 (2016).

[10]  See, e.g., Derek C. Smith, Jeremy S. Goldkind, and William R. Andrichik, “Statistically Significant Association: Preventing the Misuse of the Bradford Hill Criteria to Prove Causation in Toxic Tort Cases,” 86 Defense Counsel J. 1 (2020) (mischaracterizing the meaning of confidence intervals based upon the epidemiology chapter in the Reference Manual).

[11]  See, e.g., James Beck, “Tort Pandemic Countermeasures? The Ten Best Prescription Drug/Medical Device Decisions of 2020,” Drug and Device Law Blog (Dec. 30, 2020) (suggesting that Judge Seeborg’s decision represented the rejection of plausibility and a single “association” as insufficient); Steven Boranian, “General Causation Experts Excluded In Viagra/Cialis MDL,” (Jan. 23, 2020).

Science Bench Book for Judges

July 13th, 2019

On July 1st of this year, the National Judicial College and the Justice Speakers Institute, LLC released an online publication of the Science Bench Book for Judges [Bench Book]. The Bench Book sets out to cover much of the substantive material already covered by the Federal Judicial Center’s Reference Manual:

Acknowledgments

Table of Contents

  1. Introduction: Why This Bench Book?
  2. What is Science?
  3. Scientific Evidence
  4. Introduction to Research Terminology and Concepts
  5. Pre-Trial Civil
  6. Pre-trial Criminal
  7. Trial
  8. Juvenile Court
  9. The Expert Witness
  10. Evidence-Based Sentencing
  11. Post Sentencing Supervision
  12. Civil Post Trial Proceedings
  13. Conclusion: Judges—The Gatekeepers of Scientific Evidence

Appendix 1 – Frye/Daubert—State-by-State

Appendix 2 – Sample Orders for Criminal Discovery

Appendix 3 – Biographies

The Bench Book gives some good advice in very general terms about the need to consider study validity,[1] and to approach scientific evidence with care and “healthy skepticism.”[2] When the Bench Book attempts to instruct on what it represents the scientific method of hypothesis testing, the good advice unravels:

“A scientific hypothesis simply cannot be proved. Statisticians attempt to solve this dilemma by adopting an alternate [sic] hypothesis – the null hypothesis. The null hypothesis is the opposite of the scientific hypothesis. It assumes that the scientific hypothesis is not true. The researcher conducts a statistical analysis of the study data to see if the null hypothesis can be rejected. If the null hypothesis is found to be untrue, the data support the scientific hypothesis as true.”[3]

Even in experimental settings, a statistical analysis of the data do not lead to a conclusion that the null hypothesis is untrue, as opposed to not reasonably compatible with the study’s data. In observational studies, the statistical analysis must acknowledge whether and to what extent the study has excluded bias and confounding. When the Bench Book turns to speak of statistical significance, more trouble ensues:

“The goal of an experiment, or observational study, is to achieve results that are statistically significant; that is, not occurring by chance.”[4]

In the world of result-oriented science, and scientific advocacy, it is perhaps true that scientists seek to achieve statistically significant results. Still, it seems crass to come right out and say so, as opposed to saying that the scientists are querying the data to see whether they are compatible with the null hypothesis. This first pass at statistical significance is only mildly astray compared with the Bench Book’s more serious attempts to define statistical significance and confidence intervals:

4.10 Statistical Significance

The research field agrees that study outcomes must demonstrate they are not the result of random chance. Leaving room for an error of .05, the study must achieve a 95% level of confidence that the results were the product of the study. This is denoted as p ≤ 05. (or .01 or .1).”[5]

and

“The confidence interval is also a way to gauge the reliability of an estimate. The confidence interval predicts the parameters within which a sample value will fall. It looks at the distance from the mean a value will fall, and is measured by using standard deviations. For example, if all values fall within 2 standard deviations from the mean, about 95% of the values will be within that range.”[6]

Of course, the interval speaks to the precision of the estimate, not its reliability, but that is a small point. These definitions are virtually guaranteed to confuse judges into conflating statistical significance and the coefficient of confidence with the legal burden of proof probability.

The Bench Book runs into problems in interpreting legal decisions, which would seem softer grist for the judicial mill. The authors present dictum from the Daubert decision as though it were a holding:[7]

“As noted in Daubert, ‘[t]he focus, of course, must be solely on principles and methodology, not on the conclusions they generate’.”

The authors fail to mention that this dictum was abandoned in Joiner, and that it is specifically rejected by statute, in the 2000 revision to the Federal Rule of Evidence 702.

Early in the Bench Book, it authors present a subsection entitled “The Myth of Scientific Objectivity,” which they might have borrowed from Feyerabend or Derrida. The heading appears misleading because the text contradicts it:

“Scientists often develop emotional attachments to their work—it can be difficult to abandon an idea. Regardless of bias, the strongest intellectual argument, based on accepted scientific hypotheses, will always prevail, but the road to that conclusion may be fraught with scholarly cul-de-sacs.”[8]

In a similar vein, the authors misleadingly tell readers that “the forefront of science is rarely encountered in court,” and so “much of the science mentioned there shall be considered established….”[9] Of course, the reality is that many causal claims presented in court have already been rejected or held to be indeterminate by the scientific community. And just when readers may think themselves safe from the goblins of nihilism, the authors launch into a theory of naïve probabilism that science is just placing subjective probabilities upon data, based upon preconceived biases and beliefs:

“All of these biases and beliefs play into the process of weighing data, a critical aspect of science. Placing weight on a result is the process of assigning a probability to an outcome. Everything in the universe can be expressed in probabilities.”[10]

So help the expert witness who honestly (and correctly) testifies that the causal claim or its rejection cannot be expressed as a probability statement!

Although I have not read all of the Bench Book closely, there appears to be no meaningful discussion of Rule 703, or of the need to access underlying data to ensure that the proffered scientific opinion under scrutiny has used appropriate methodologies at every step in its development. Even a 412 text cannot address every issue, but this one does little to help the judicial reader find more in-depth help on statistical and scientific methodological issues that arise in occupational and environmental disease claims, and in pharmaceutical products litigation.

The organizations involved in this Bench Book appear to be honest brokers of remedial education for judges. The writing of this Bench Book was funded by the State Justice Institute (SJI) Which is a creation of federal legislation enacted with the laudatory goal of improving the quality of judging in state courts.[11] Despite its provenance in federal legislation, the SJI is a a private, nonprofit corporation, governed by 11 directors appointed by the President, and confirmed by the Senate. A majority of the directors (six) are state court judges, one state court administrator, and four members of the public (no more than two from any one political party). The function of the SJI is to award grants to improve judging in state courts.

The National Judicial College (NJC) originated in the early 1960s, from the efforts of the American Bar Association, American Judicature Society and the Institute of Judicial Administration, to provide education for judges. In 1977, the NJC became a Nevada not-for-profit (501)(c)(3) educational corporation, which its campus at the University of Nevada, Reno, where judges could go for training and recreational activities.

The Justice Speakers Institute appears to be a for-profit company that provides educational resources for judge. A Press Release touts the Bench Book and follow-on webinars. Caveat emptor.

The rationale for this Bench Book is open to question. Unlike the Reference Manual for Scientific Evidence, which was co-produced by the Federal Judicial Center and the National Academies of Science, the Bench Book’s authors are lawyers and judges, without any subject-matter expertise. Unlike the Reference Manual, the Bench Book’s chapters have no scientist or statistician authors, and it shows. Remarkably, the Bench Book does not appear to cite to the Reference Manual or the Manual on Complex Litigation, at any point in its discussion of the federal law of expert witnesses or of scientific or statistical method. Perhaps taxpayers would have been spared substantial expense if state judges were simply encouraged to read the Reference Manual.


[1]  Bench Book at 190.

[2]  Bench Book at 174 (“Given the large amount of statistical information contained in expert reports, as well as in the daily lives of the general society, the ability to be a competent consumer of scientific reports is challenging. Effective critical review of scientific information requires vigilance, and some healthy skepticism.”).

[3]  Bench Book at 137; see also id. at 162.

[4]  Bench Book at 148.

[5]  Bench Book at 160.

[6]  Bench Book at 152.

[7]  Bench Book at 233, quoting Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 595 (1993).

[8]  Bench Book at 10.

[9]  Id. at 10.

[10]  Id. at 10.

[11] See State Justice Institute Act of 1984 (42 U.S.C. ch. 113, 42 U.S.C. § 10701 et seq.).

N.J. Supreme Court Uproots Weeds in Garden State’s Law of Expert Witnesses

August 8th, 2018

The United States Supreme Court’s decision in Daubert is now over 25 years old. The idea of judicial gatekeeping of expert witness opinion testimony is even older in New Jersey state courts. The New Jersey Supreme Court articulated a reliability standard before the Daubert case was even argued in Washington, D.C. See Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991). Articulating a standard, however, is something very different from following a standard, and in many New Jersey trial courts, until very recently, the standard was pretty much anything goes.

One counter-example to the general rule of dog-eat-dog in New Jersey was Judge Nelson Johnson’s careful review and analysis of the proffered causation opinions in cases in which plaintiffs claimed that their use of the anti-acne medication isotretinoin (Accutane) caused Crohn’s disease. Judge Johnson, who sits in the Law Division of the New Jersey Superior Court for Atlantic County held a lengthy hearing, and reviewed the expert witnesses’ reliance materials.1 Judge Johnson found that the plaintiffs’ expert witnesses had employed undue selectivity in choosing what to rely upon. Perhaps even more concerning, Judge Johnson found that these witnesses had refused to rely upon reasonably well-conducted epidemiologic studies, while embracing unpublished, incomplete, and poorly conducted studies and anecdotal evidence. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div., Atlantic Cty. Feb. 20, 2015). In response, Judge Johnson politely but firmly closed the gate to conclusion-driven duplicitous expert witness causation opinions in over 2,000 personal injury cases. “Johnson of Accutane – Keeping the Gate in the Garden State” (Mar. 28, 2015).

Aside from resolving over 2,000 pending cases, Judge Johnson’s judgment was of intense interest to all who are involved in pharmaceutical and other products liability litigation. Judge Johnson had conducted a pretrial hearing, sometimes called a Kemp hearing in New Jersey, after the New Jersey Supreme Court’s opinion in Kemp v. The State of New Jersey, 174 N.J. 412 (2002). At the hearing and in his opinion that excluded plaintiffs’ expert witnesses’ causation opinions, Judge Johnson demonstrated a remarkable aptitude for analyzing data and inferences in the gatekeeping process.

When the courtroom din quieted, the trial court ruled that the proffered testimony of Dr., Arthur Kornbluth and Dr. David Madigan did not meet the liberal New Jersey test for admissibility. In re Accutane, No. 271(MCL), 2015 WL 753674, 2015 BL 59277 (N.J.Super. Law Div. Atlantic Cty. Feb. 20, 2015). And in closing the gate, Judge Johnson protected the judicial process from several bogus and misleading “lines of evidence,” which have become standard ploys to mislead juries in courthouses where the gatekeepers are asleep. Recognizing that not all evidence is on the same analytical plane, Judge Johnson gave case reports short shrift.

[u]nsystematic clinical observations or case reports and adverse event reports are at the bottom of the evidence hierarchy.”

Id. at *16. Adverse event reports, largely driven by the very litigation in his courtroom, received little credit and were labeled as “not evidentiary in a court of law.” Id. at 14 (quoting FDA’s description of FAERS).

Judge Johnson recognized that there was a wide range of identified “risk factors” for irritable bowel syndrome, such as prior appendectomy, breast-feeding as an infant, stress, Vitamin D deficiency, tobacco or alcohol use, refined sugars, dietary animal fat, fast food. In re Accutane, 2015 WL 753674, at *9. The court also noted that there were four medications generally acknowledged to be potential risk factors for inflammatory bowel disease: aspirin, nonsteroidal anti-inflammatory medications (NSAIDs), oral contraceptives, and antibiotics. Understandably, Judge Johnson was concerned that the plaintiffs’ expert witnesses preferred studies unadjusted for potential confounding co-variables and studies that had involved “cherry picking the subjects.” Id. at *18.

Judge Johnson had found that both sides in the isotretinoin cases conceded the relative unimportance of animal studies, but the plaintiffs’ expert witnesses nonetheless invoked the animal studies in the face of the artificial absence of epidemiologic studies that had been created by their cherry-picking strategies. Id.

Plaintiffs’ expert witnesses had reprised a common claimants’ strategy; namely, they claimed that all the epidemiology studies lacked statistical power. Their arguments often ignored that statistical power calculations depend upon statistical significance, a concept to which many plaintiffs’ counsel have virulent antibodies, as well as an arbitrarily selected alternative hypothesis of association size. Furthermore, the plaintiffs’ arguments ignored the actual point estimates, most of which were favorable to the defense, and the observed confidence intervals, most of which were reasonably narrow.

The defense responded to the bogus statistical arguments by presenting an extremely capable clinical and statistical expert witness, Dr. Stephen Goodman, to present a meta-analysis of the available epidemiologic evidence.

Meta-analysis has become an important facet of pharmaceutical and other products liability litigation[1]. Fortunately for Judge Johnson, he had before him an extremely capable expert witness, Dr. Stephen Goodman, to explain meta-analysis generally, and two meta-analyses he had performed on isotretinoin and irritable bowel outcomes.

Dr. Goodman explained that the plaintiffs’ witnesses’ failure to perform a meta-analysis was telling when meta-analysis can obviate the plaintiffs’ hyperbolic statistical complaints:

the strength of the meta-analysis is that no one feature, no one study, is determinant. You don’t throw out evidence except when you absolutely have to.”

In re Accutane, 2015 WL 753674, at *8.

Judge Johnson’s judicial handiwork received non-deferential appellate review from a three-judge panel of the Appellate Division, which reversed the exclusion of Kornbluth and Madigan. In re Accutane Litig., 451 N.J. Super. 153, 165 A.3d 832 (App. Div. 2017). The New Jersey Supreme Court granted the isotretinoin defendants’ petition for appellate review, and the issues were joined over the appropriate standard of appellate review for expert witness opinion exclusions, and the appropriateness of Judge Johnson’s exclusions of Kornbluth and Madigan. A bevy of amici curiae joined in the fray.2

Last week, the New Jersey Supreme Court issued a unanimous opinion, which reversed the Appellate Division’s holding that Judge Johnson had “mistakenly exercised” discretion. Applying its own precedents from Rubanick, Landrigan, and Kemp, and the established abuse-of-discretion standard, the Court concluded that the trial court’s ruling to exclude Kornbluth and Madigan was “unassailable.” In re Accutane Litig., ___ N.J. ___, 2018 WL 3636867 (2018), Slip op. at 79.3

The high court graciously acknowledged that defendants and amici had “good reason” to seek clarification of New Jersey law. Slip op. at 67. In abandoning abuse-of-discretion as its standard of review, the Appellate Division had relied upon a criminal case that involved the application of the Frye standard, which is applied as a matter of law. Id. at 70-71. The high court also appeared to welcome the opportunity to grant review and reverse the intermediate court reinforce “the rigor expected of the trial court” in its gatekeeping role. Id. at 67. The Supreme Court, however, did not articulate a new standard; rather it demonstrated at length that Judge Johnson had appropriately applied the legal standards that had been previously announced in New Jersey Supreme Court cases.4

In attempting to defend the Appellate Division’s decision, plaintiffs sought to characterize New Jersey law as somehow different from, and more “liberal” than, the United States Supreme Court’s decision in Daubert. The New Jersey Supreme Court acknowledged that it had never formally adopted the dicta from Daubert about factors that could be considered in gatekeeping, slip op. at 10, but the Court went on to note what disinterested observers had long understood, that the so-called Daubert factors simply flowed from a requirement of sound methodology, and that there was “little distinction” and “not much light” between the Landrigan and Rubanick principles and the Daubert case or its progeny. Id at 10, 80.

Curiously, the New Jersey Supreme Court announced that the Daubert factors should be incorporated into the New Jersey Rules 702 and 703 and their case law, but it stopped short of declaring New Jersey a “Daubert” jurisdiction. Slip op. at 82. In part, the Court’s hesitance followed from New Jersey’s bifurcation of expert witness standards for civil and criminal cases, with the Frye standard still controlling in the criminal docket. At another level, it makes no sense to describe any jurisdiction as a “Daubert” state because the relevant aspects of the Daubert decision were dicta, and the Daubert decision and its progeny were superseded by the revision of the controlling statute in 2000.5

There were other remarkable aspects of the Supreme Court’s Accutane decision. For instance, the Court put its weight behind the common-sense and accurate interpretation of Sir Austin Bradford Hill’s famous articulation of factors for causal judgment, which requires that sampling error, bias, and confounding be eliminated before assessing whether the observed association is strong, consistent, plausible, and the like. Slip op. at 20 (citing the Reference Manual at 597-99), 78.

The Supreme Court relied extensively on the National Academies’ Reference Manual on Scientific Evidence.6 That reliance is certainly preferable to judicial speculations and fabulations of scientific method. The reliance is also positive, considering that the Court did not look only at the problematic epidemiology chapter, but adverted also to the chapters on statistical evidence and on clinical medicine.

The Supreme Court recognized that the Appellate Division had essentially sanctioned an anything goes abandonment of gatekeeping, an approach that has been all-too-common in some of New Jersey’s lower courts. Contrary to the previously prevailing New Jersey zeitgeist, the Court instructed that gatekeeping must be “rigorous” to “prevent[] the jury’s exposure to unsound science through the compelling voice of an expert.” Slip op. at 68-9.

Not all evidence is equal. “[C]ase reports are at the bottom of the evidence hierarchy.” Slip op. at 73. Extrapolation from non-human animal studies is fraught with external validity problems, and such studies “far less probative in the face of a substantial body of epidemiologic evidence.” Id. at 74 (internal quotations omitted).

Perhaps most chilling for the lawsuit industry will be the Supreme Court’s strident denunciation of expert witnesses’ selectivity in choosing lesser evidence in the face of a large body of epidemiologic evidence, id. at 77, and their unprincipled cherry picking among the extant epidemiologic publications. Like the trial court, the Supreme Court found that the plaintiffs’ expert witnesses’ inconsistent use of methodological criteria and their selective reliance upon studies (disregarding eight of the nine epidemiologic studies) that favored their task masters was the antithesis of sound methodology. Id. at 73, citing with approval, In re Lipitor, ___ F.3d ___ (4th Cir. 2018) (slip op. at 16) (“Result-driven analysis, or cherry-picking, undermines principles of the scientific method and is a quintessential example of applying methodologies (valid or otherwise) in an unreliable fashion.”).

An essential feature of the Supreme Court’s decision is that it was not willing to engage in the common reductionism that has “all epidemiologic studies are flawed,” and which thus privileges cherry picking. Not all disagreements between expert witnesses can be framed as differences in interpretation. In re Accutane will likely stand as a bulwark against flawed expert witness opinion testimony in the Garden State for a long time.


1 Judge Nelson Johnson is also the author of Boardwalk Empire: The Birth, High Times, and Corruption of Atlantic City (2010), a spell-binding historical novel about political and personal corruption.

2 In support of the defendants’ positions, amicus briefs were filed by the New Jersey Business & Industry Association, Commerce and Industry Association of New Jersey, and New Jersey Chamber of Commerce; by law professors Kenneth S. Broun, Daniel J. Capra, Joanne A. Epps, David L. Faigman, Laird Kirkpatrick, Michael M. Martin, Liesa Richter, and Stephen A. Saltzburg; by medical associations the American Medical Association, Medical Society of New Jersey, American Academy of Dermatology, Society for Investigative Dermatology, American Acne and Rosacea Society, and Dermatological Society of New Jersey, by the Defense Research Institute; by the Pharmaceutical Research and Manufacturers of America; and by New Jersey Civil Justice Institute. In support of the plaintiffs’ position and the intermediate appellate court’s determination, amicus briefs were filed by political action committee the New Jersey Association for Justice; by the Ironbound Community Corporation; and by plaintiffs’ lawyer Allan Kanner.

3 Nothing in the intervening scientific record called question upon Judge Johnson’s trial court judgment. See, e.g., I.A. Vallerand, R.T. Lewinson, M.S. Farris, C.D. Sibley, M.L. Ramien, A.G.M. Bulloch, and S.B. Patten, “Efficacy and adverse events of oral isotretinoin for acne: a systematic review,” 178 Brit. J. Dermatol. 76 (2018).

4 Slip op. at 9, 14-15, citing Landrigan v. Celotex Corp., 127 N.J. 404, 414 (1992); Rubanick v. Witco Chem. Corp., 125 N.J. 421, 447 (1991) (“We initially took that step to allow the parties in toxic tort civil matters to present novel scientific evidence of causation if, after the trial court engages in rigorous gatekeeping when reviewing for reliability, the proponent persuades the court of the soundness of the expert’s reasoning.”).

5 The Court did acknowledge that Federal Rule of Evidence 702 had been amended in 2000, to reflect the Supreme Court’s decision in Daubert, Joiner, and Kumho Tire, but the Court did not deal with the inconsistencies between the present rule and the 1993 Daubert case. Slip op. at 64, citing Calhoun v. Yamaha Motor Corp., U.S.A., 350 F.3d 316, 320-21, 320 n.8 (3d Cir. 2003).

6 See Accutane slip op. at 12-18, 24, 73-74, 77-78. With respect to meta-analysis, the Reference Manual’s epidemiology chapter is still stuck in the 1980s and the prevalent resistance to poorly conducted, often meaningless meta-analyses. SeeThe Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011) (The Reference Manual fails to come to grips with the prevalence and importance of meta-analysis in litigation, and fails to provide meaningful guidance to trial judges).

Scientific Evidence in Canadian Courts

February 20th, 2018

A couple of years ago, Deborah Mayo called my attention to the Canadian version of the Reference Manual on Scientific Evidence.1 In the course of discussion of mistaken definitions and uses of p-values, confidence intervals, and significance testing, Sander Greenland pointed to some dubious pronouncements in the Science Manual for Canadian Judges [Manual].

Unlike the United States federal court Reference Manual, which is published through a joint effort of the National Academies of Science, Engineering, and Medicine, the Canadian version, is the product of the Canadian National Judicial Institute (NJI, or the Institut National de la Magistrature, if you live in Quebec), which claims to be an independent, not-for-profit group, that is committed to educating Canadian judges. In addition to the Manual, the Institute publishes Model Jury Instructions and a guide, Problem Solving in Canada’s Courtrooms: A Guide to Therapeutic Justice (2d ed.), as well as conducting educational courses.

The NJI’s website describes the Instute’s Manual as follows:

Without the proper tools, the justice system can be vulnerable to unreliable expert scientific evidence.

         * * *

The goal of the Science Manual is to provide judges with tools to better understand expert evidence and to assess the validity of purportedly scientific evidence presented to them. …”

The Chief Justice of Canada, Hon. Beverley M. McLachlin, contributed an introduction to the Manual, which was notable for its frank admission that:

[w]ithout the proper tools, the justice system is vulnerable to unreliable expert scientific evidence.

****

Within the increasingly science-rich culture of the courtroom, the judiciary needs to discern ‘good’ science from ‘bad’ science, in order to assess expert evidence effectively and establish a proper threshold for admissibility. Judicial education in science, the scientific method, and technology is essential to ensure that judges are capable of dealing with scientific evidence, and to counterbalance the discomfort of jurists confronted with this specific subject matter.”

Manual at 14. These are laudable goals, indeed, but did the National Judicial Institute live up to its stated goals, or did it leave Canadian judges vulnerable to the Institute’s own “bad science”?

In his comments on Deborah Mayo’s blog, Greenland noted some rather cavalier statements in Chapter two that suggest that the conventional alpha of 5% corresponds to a “scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it.” And he, pointed elsewhere where the chapter seems to suggest that the coefficient of confidence that corresponds to an alpha of 5% “constitutes a rather high standard of proof,” thus confusing and conflating probability of random error with posterior probabilities. Greenland is absolutely correct that the Manual does a rather miserable job of educating Canadian judges if our standard for its work product is accuracy and truth.

Some of the most egregious errors are within what is perhaps the most important chapter of the Manual, Chapter 2, “Science and the Scientific Method.” The chapter has two authors, a scientist, Scott Findlay, and a lawyer, Nathalie Chalifour. Findlay is an Associate Professor, in the Department of Biology, of the University of Ottawa. Nathalie Chalifour is an Associate Professor on the Faculty of Law, also in the University of Ottawa. Together, they produced some dubious pronouncements, such as:

Weight of the Evidence (WOE)

First, the concept of weight of evidence in science is similar in many respects to its legal counterpart. In both settings, the outcome of a weight-of-evidence assessment by the trier of fact is a binary decision.”

Manual at 40. Findlay and Chalifour cite no support for their characterization of WOE in science. Most attempts to invoke WOE are woefully vague and amorphous, with no meaningful guidance or content.2  Sixty-five pages later, if any one is noticing, the authors let us in a dirty, little secret:

at present, there exists no established prescriptive methodology for weight of evidence assessment in science.”

Manual at 105. The authors omit, however, that there are prescriptive methods for inferring causation in science; you just will not see them in discussions of weight of the evidence. The authors then compound the semantic and conceptual problems by stating that “in a civil proceeding, if the evidence adduced by the plaintiff is weightier than that brought forth by the defendant, a judge is obliged to find in favour of the plaintiff.” Manual at 41. This is a remarkable suggestion, which implies that if the plaintiff adduces the crummiest crumb of evidence, a mere peppercorn on the scales of justice, but the defendant has none to offer, that the plaintiff must win. The plaintiff wins notwithstanding that no reasonable person could believe that the plaintiff’s claims are more likely than not true. Even if there were the law of Canada, it is certainly not how scientists think about establishing the truth of empirical propositions.

Confusion of Hypothesis Testing with “Beyond a Reasonable Doubt”

The authors’ next assault comes in conflating significance probability with the probability connected with the burden of proof, a posterior probability. Legal proceedings have a defined burden of proof, with criminal cases requiring the state to prove guilt “beyond a reasonable doubt.” Findlay and Chalifour’s discussion then runs off the rails by likening hypothesis testing, with an alpha of 5% or its complement, 95%, as a coefficient of confidence, to a “very high” burden of proof:

In statistical hypothesis-testing – one of the tools commonly employed by scientists – the predisposition is that there is a particular hypothesis (the null hypothesis) that is assumed to be true unless sufficient evidence is adduced to overturn it. But in statistical hypothesis-testing, the standard of proof has traditionally been set very high such that, in general, scientists will only (provisionally) reject the null hypothesis if they are at least 95% sure it is false. Third, in both scientific and legal proceedings, the setting of the predisposition and the associated standard of proof are purely normative decisions, based ultimately on the perceived consequences of an error in inference.”

Manual at 41. This is, as Greenland and many others have pointed out, a totally bogus conception of hypothesis testing, and an utterly false description of the probabilities involved.

Later in the chapter, Findlay and Chalifour flirt with the truth, but then lapse into an unrecognizable parody of it:

Inferential statistics adopt the frequentist view of probability whereby a proposition is either true or false, and the task at hand is to estimate the probability of getting results as discrepant or more discrepant than those observed, given the null hypothesis. Thus, in statistical hypothesis testing, the usual inferred conclusion is either that the null is true (or rather, that we have insufficient evidence to reject it) or it is false (in which case we reject it). 16 The decision to reject or not is based on the value of p if the estimated value of p is below some threshold value a, we reject the null; otherwise we accept it.”

Manual at 74. OK; so far so good, but here comes the train wreck:

By convention (and by convention only), scientists tend to set α = 0.05; this corresponds to the collective – and, one assumes, consensual – scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it. It is partly because of this that scientists have the reputation of being a notoriously conservative lot, given that a 95% threshold constitutes a rather high standard of proof.”

Manual at 75. Uggh; so we are back to significance probability’s being a posterior probability. As if to atone for their sins, in the very next paragraph, the authors then remind the judicial readers that:

As noted above, p is the probability of obtaining results at least as discrepant as those observed if the null is true. This is not the same as the probability of the null hypothesis being true, given the results.”

Manual at 75. True, true, and completely at odds with what the authors have stated previously. And to add to the reader’s now fully justified conclusion, the authors describe the standard for rejecting the null hypothesis as “very high indeed.” Manual at 102, 109. Any reader who is following the discussion might wonder how and why there is such a problem of replication and reproducibility in contemporary science.

Conflating Bayesianism with Frequentist Modes of Inference

We have seen how Findlay and Chalifour conflate significance and posterior probabilities, some of the time. In a section of their chapter that deals explicitly with probability, the authors tell us that before any study is conducted the prior probability of the truth of the tested hypothesis is 50%, sans evidence. This an astonishing creation of certainty out nothingness, and perhaps it explains the authors’ implied claim that the crummiest morsel of evidence on one side is sufficient to compel a verdict, if the other side has no morsels at all. Here is how the authors put their claim to the Canadian judges:

Before each study is conducted (that is, a priori), the hypothesis is as likely to be true as it is to be false. Once the results are in, we can ask: How likely is it now that the hypothesis is true? In the first study, the low a priori inferential strength of the study design means that this probability will not be much different from the a priori value of 0.5 because any result will be rather equivocal owing to limitations in the experimental design.”

Manual at 64. This implied Bayesian slant, with 50% priors, in the world of science would lead anyone to believe “as many as six impossible things before breakfast,” and many more throughout the day.

Lest you think that the Manual is all rubbish, there are occasional gems of advice to the Canadian judges. The authors admonish the judges to

be wary of individual ‘statistically significant’ results that are mined from comparatively large numbers of trials or experiments, as the results may be ‘cherry picked’ from a larger set of experiments or studies that yielded mostly negative results. The court might ask the expert how many other trials or experiments testing the same hypothesis he or she is aware of, and to describe the outcome of those studies.”

Manual at 87. Good advice, but at odds with the authors’ characterization of statistical significance as establishing the rejection of the null hypothesis well-nigh beyond a reasonable doubt.

When Greenland first called attention to this Manual, I reached to some people who had been involved in its peer review. One reviewer told me that it was a “living document,” and would likely be revised after he had the chance to call the NJI’s attention to the errors. But two years later, the errors remain, and so we have to infer that the authors meant to say all the contradictory and false statements that are still present in the downloadable version of the Manual.


2 SeeWOE-fully Inadequate Methodology – An Ipse Dixit By Another Name” (May 1, 2012); “Weight of the Evidence in Science and in Law” (July 29, 2017); see also David E. Bernstein, “The Misbegotten Judicial Resistance to the Daubert Revolution,” 89 Notre Dame L. Rev. 27 (2013).

High, Low and Right-Sided Colonics – Ridding the Courts of Junk Science

July 16th, 2016

Not surprisingly, many of Selikoff’s litigation- and regulatory-driven opinions have not fared well, such as the notions that asbestos causes gastrointestinal cancers and that all asbestos minerals have equal potential and strength to cause mesothelioma.  Forty years after Selikoff testified in litigation that occupational asbestos exposure caused an insulator’s colorectal cancer, the Institute of Medicine reviewed the extant evidence and announced that the evidence was  “suggestive but not sufficient to infer a causal relationship between asbestos exposure and pharyngeal, stomach, and colorectal cancers.” Jonathan Samet, et al., eds., Institute of Medicine Review of Asbestos: Selected Cancers (2006).[1] The Institute of Medicine’s monograph has fostered a more circumspect approach in some of the federal agencies.  The National Cancer Institute’s website now proclaims that the evidence is insufficient to permit a conclusion that asbestos causes non-pulmonary cancers of gastrointestinal tract and throat.[2]

As discussed elsewhere, Selikoff testified as early as 1966 that asbestos causes colorectal cancer, in advance of any meaningful evidence to support such an opinion, and then he, and his protégées, worked hard to lace the scientific literature with their pronouncements on the subject, without disclosing their financial, political, and positional conflicts of interest.[3]

With plaintiffs’ firm’s (Lanier) zealous pursuit of bias information from the University of Idaho, in the LoGuidice case, what are we to make of Selikoff’s and his minions’ dubious ethics of failed disclosure. Do Selikoff and Mount Sinai receive a pass because their asbestos research predated the discovery of ethics? The “Lobby” (as the late Douglas Liddell called Selikoff and his associates)[4] has seriously distorted truth-finding in any number of litigations, but nowhere are the Lobby’s distortions more at work than in lawsuits for claimed asbestos injuries. Here the conflicts of interests truly have had a deleterious effect on the quality of civil justice. As we saw with the Selikoff exceptionalism displayed by the New York Supreme Court in reviewing third-party subpoenas,[5] some courts seem bent on ignoring evidence-based analyses in favor of Mount Sinai faith-based initiatives.

Current Asbestos Litigation Claims Involving Colorectal Cancer

Although Selikoff has passed from the litigation scene, his trainees and followers have lined up at the courthouse door to propagate his opinions. Even before the IOM’s 2006 monograph, more sophisticated epidemiologists consistently rejected the Selikoff conclusion on asbestos and colon cancer, which grew out of Selikoff’s litigation activities.[6] And yet, the minions keep coming.

In the pre-Daubert era, defendants lacked an evidentiary challenge to the Selikoff’s opinion that asbestos caused colorectal cancer. Instead of contesting the legal validity or sufficiency of the plaintiffs’ general causation claims, defendants often focused on the unreliability of the causal attribution for the specific claimant’s disease. These early cases are often misunderstood to be challenges to expert witnesses’ opinions about whether asbestos causes colorectal cancer; they were not.[7]

Of course, after the IOM’s 2006 monograph, active expert witness gatekeeping should eliminate asbestos gastrointestinal cancer claims, but sadly they persist. Perhaps, courts simply considered the issue “grandfathered” in from the era in which judicial scrutiny of expert witness opinion testimony was restricted. Perhaps, defense counsel are failing to frame and support their challenges properly.  Perhaps both.

Arthur Frank Jumps the Gate

Although ostensibly a “Frye” state, Pennsylvania judges have, when moved by the occasion, to apply a fairly thorough analysis of proffered expert witness opinion.[8] On occasion, Pennsylvania judges have excluded unreliably or invalidly supported causation opinions, under the Pennsylvania version of the Frye standard. A recent case, however, tried before a Workman’s Compensation Judge (WCJ), and appealed to the Commonwealth Court, shows how inconsistent the application of the standard can be, especially when Selikoff’s legacy views are at issue.

Michael Piatetsky, an architect, died of colorectal cancer. Before his death, he and his wife filed a worker’s compensation claim, in which they alleged that his disease was caused by his workplace exposure to asbestos. Garrison Architects v. Workers’ Comp. Appeal Bd. (Piatetsky), No. 1095 C.D. 2015, Pa. Cmwlth. Ct., 2016 Pa. Commw. Unpub. LEXIS 72 (Jan. 22, 2016) [cited as Piatetsky]. Mr. Piatetsky was an architect, almost certainly knowledgeable about asbestos hazards generally.  Despite his knowledge, Piatetsky eschewed personal protective equipment even when working at dusty work sites well marked with warnings. Although he had engaged in culpable conduct, the employer in worker compensation proceedings does not have ordinary negligence defenses, such as contributory negligence or assumption of risk.

In litigating the Piatetsky’s claim, the employer dragged its feet and failed to name an expert witness.  Eventually, after many requests for continuances, the Workers’ Compensation Judge barred the employer from presenting an expert witness. With the record closed, and without an expert witness, the Judge understandably ruled in favor of the claimant.

The employer, sans expert witness, had to confront claimant’s expert witness, Arthur L. Frank, a minion of Selikoff and a frequent testifier in asbestos and many other litigations. Frank, of course, opined that asbestos causes colon cancer and that it caused Mr. Piatetsky’s cancer. Mr. Piatetsky’s colon cancer originated on the right side of his colon. Dr. Frank thus emphasized that asbestos causes colon cancer in all locations, but especially on the right side in view of one study’s having concluded “that colon cancer caused by asbestos is more likely to begin on the right side.” Piatetsky at *6.

On appeal, the employer sought relief on several issues, but the only one of interest here is the employer’s argument “that Claimant’s medical expert based his opinion on flimsy medical studies.” Piatetsky at *10. The employer’s appeal seemed to go off the rails with the insistence that the Claimant’s medical opinion was invalid because Dr. Frank relied upon studies not involving architects. Piatetsky at *14. The Commonwealth Court was able to point to testimony, although probably exaggerated, which suggested that Mr. Piatetsky had been heavily exposed, at least at times, and thus his exposure was similar to that in the studies cited by Frank.

With respect to Frank’s right-sided (non-sinister) opinion, the Commonwealth Court framed the employer’s issue as a contention that Dr. Frank’s opinion on the asbestos-relatedness of right-sided colon cancer was “not universally accepted.” But universal acceptance has never been the test or standard for the rejection or acceptance of expert witness opinion testimony in any state.  Either the employer badly framed its appeal, or the appellate court badly misstated the employer’s ground for relief. In any event, the Commonwealth Court never addressed the relevant legal standard in its discussion.

The Claimant argued that the hearing Judge had found that Frank’s opinion was based on “numerous studies.” Piatetsky at *15. None of these studies is cited to permit the public to assess the argument and the Court’s acceptance of it. The appellate court made inappropriately short work of this appellate issue by confusing general and specific causation, and invoking Mr. Piatetsky’s age, his lack of family history of colon cancer, Frank’s review of medical records, testimony, and work records, as warranting Frank’s causal inference. None of these factors is relevant to general causation, and none is probative of the specific causation claim.  Many if not most colon cancers have no identifiable risk factor, and Dr. Frank had no way to rule out baseline risk, even if there were an increased risk from asbestos exposure. Piatetsky at *16. With no defense expert witness, the employer certainly had a difficult appellate journey. It is hard for the reader of the Commonwealth Court’s opinion to determine whether the case was poorly defended, poorly briefed on appeal, or poorly described by the appellate judges.

In any event, the right-sided ruse of Arthur Frank went unreprimanded.  Intellectual due process might have led the appellate court to cite the article at issue, but it failed to do so.  It is interesting and curious to see how the appellate court gave a detailed recitation of the controverted facts of asbestos exposure, while how glib the court was when describing the scientific issues and evidence.  Nonetheless, the article referenced vaguely, which went uncited by the appellate court, was no doubt the paper:  K. Jakobsson, M. Albin & L. Hagmar, “Asbestos, cement, and cancer in the right part of the colon,” 51 Occup. & Envt’l Med. 95 (1994).

These authors 24 observed versus 9.63 expected right-sided colon cancers, and they concluded that there was an increased rate of right-sided colon cancer in the asbestos cement plant workers.  Notably the authors’ reference population had a curiously low rate of right-sided colon cancer.  For left-sided colon cancer, the authors 9.3 expected cases but observed only 5 cases in the asbestos-cement cohort.  Contrary to Frank’s suggestion, the authors did not conclude that right-sided colon cancers had been caused by asbestos; indeed, the authors never reached any conclusion whether asbestos causes colorectal  cancer under any circumstances.  In their discussion, these authors noted that “[d]espite numerous epidemiological and experimental studies, there is no consensus concerning exposure to asbestos and risks of gastrointestinal cancer.” Jakobsson at 99; see also Dorsett D. Smith, “Does Asbestos Cause Additional Malignancies Other than Lung Cancer,” chap. 11, in Dorsett D. Smith, The Health Effects of Asbestos: An Evidence-based Approach 143, 154 (2015). Even this casual description of the Jakobsson study will awake the learned reader to the multiple comparisons that went on in this cohort study, with outcomes reported for left, right, rectum, and multiple sites, without any adjustment to the level of significance.  Risk of right-sided colon cancer was not a pre-specified outcome of the study, and the results of subsequent studies have never corroborated this small cohort study.

A sane understanding of subgroup analyses is important to judicial gatekeeping. SeeSub-group Analyses in Epidemiologic Studies — Dangers of Statistical Significance as a Bright-Line Test” (May 17, 2011).  The chapter on statistics in the Reference Manual for Scientific Evidence (3d ed. 2011) has some prudent caveats for multiple comparisons and testing, but neither the chapter on epidemiology, nor the chapter on clinical medicine[9], provides any sense of the dangers of over-interpreting subgroup analyses.

Some commentators have argued that we must not dissuade scientists from doing subgroup analysis, but the issue is not whether they should be done, but how they should be interpreted.[10] Certainly many authors have called for caution in how subgroup analyses are interpreted[11], but apparently Expert Witness Arthur Frank, did not receive the memo, before testifying in the Piatetsky case, and the Commonwealth Court did not before deciding this case.


[1] As good as the IOM process can be on occasion, even its reviews are sometimes less than thorough. The asbestos monograph gave no consideration to alcohol in the causation of laryngeal cancer, and no consideration to smoking in its analysis of asbestos and colorectal cancer. See, e.g., Peter S. Liang, Ting-Yi Chen & Edward Giovannucci, “Cigarette smoking and colorectal cancer incidence and mortality: Systematic review and meta-analysis,” 124 Internat’l J. Cancer 2406, 2410 (2009) (“Our results indicate that both past and current smokers have an increased risk of [colorectal cancer] incidence and mortality. Significantly increased risk was found for current smokers in terms of mortality (RR 5 1.40), former smokers in terms of incidence (RR 5 1.25)”); Lindsay M. Hannan, Eric J. Jacobs and Michael J. Thun, “The Association between Cigarette Smoking and Risk of Colorectal Cancer in a Large Prospective Cohort from the United States,” 18 Cancer Epidemiol., Biomarkers & Prevention 3362 (2009).

[2] National Cancer Institute, “Asbestos Exposure and Cancer Risk” (last visited July 10, 2016) (“In addition to lung cancer and mesothelioma, some studies have suggested an association between asbestos exposure and gastrointestinal and colorectal cancers, as well as an elevated risk for cancers of the throat, kidney, esophagus, and gallbladder (3, 4). However, the evidence is inconclusive.”).

[3] Compare “Health Hazard Progress Notes: Compensation Advance Made in New York State,” 16(5) Asbestos Worker 13 (May 1966) (thanking Selikoff for testifying in a colon cancer case) with, Irving J. Selikoff, “Epidemiology of gastrointestinal cancer,” 9 Envt’l Health Persp. 299 (1974) (arguing for his causal conclusion between asbestos and all gastrointestinal cancers, with no acknowledgment of his role in litigation or his funding from the asbestos insulators’ union).

[4] F.D.K. Liddell, “Magic, Menace, Myth and Malice,” 41 Ann. Occup. Hyg. 3, 3 (1997); see alsoThe Lobby Lives – Lobbyists Attack IARC for Conducting Scientific Research” (Feb. 19, 2013).

[5]

SeeThe LoGiudice Inquisitiorial Subpoena & Its Antecedents in N.Y. Law” (July 14, 2016).

[6] See, e.g., Richard Doll & Julian Peto, Asbestos: Effects on health of exposure to asbestos 8 (1985) (“In particular, there are no grounds for believing that gastrointestinal cancers in general are peculiarly likely to be caused by asbestos exposure.”).

[7] See Landrigan v. The Celotex Corporation, Revisited” (June 4, 2013); Landrigan v. The Celotex Corp., 127 N.J. 404, 605 A.2d 1079 (1992); Caterinicchio v. Pittsburgh Corning Corp., 127 NJ. 428, 605 A.2d 1092 (1992). In both Landrigan and Caterinicchio, there had been no challenge to the reliability or validity of the plaintiffs’ expert witnesses’ general causation opinions. Instead, the trial courts entered judgments, assuming arguendo that asbestos can cause colorectal cancer (a dubious proposition), on the ground that the low relative risk cited by plaintiffs’ expert witnesses (about 1.5) was factually insufficient to support a verdict for plaintiffs on specific causation.  Indeed, the relative risk suggested that the odds were about 2 to 1 in defendants’ favor that the plaintiffs’ colorectal cancers were not caused by asbestos.

[8] See, e.g., Porter v. Smithkline Beecham Corp., Sept. Term 2007, No. 03275. 2016 WL 614572 (Phila. Cty. Com. Pleas, Oct. 5, 2015); “Demonstration of Frye Gatekeeping in Pennsylvania Birth Defects Case” (Oct. 6, 2015).

[9] John B. Wong, Lawrence O. Gostin & Oscar A. Cabrera, “Reference Guide on Medical Testimony,” in Reference Manual for Scientific Evidence 687 (3d ed. 2011).

[10] See, e.g., Phillip I. Good & James W. Hardin, Common Errors in Statistics (and How to Avoid Them) 13 (2003) (proclaiming a scientists’ Bill of Rights under which they should be allowed to conduct subgroup analyses); Ralph I. Horwitz, Burton H. Singer, Robert W. Makuch, Catherine M. Viscoli, “Clinical versus statistical considerations in the design and analysis of clinical research,” 51 J. Clin. Epidemiol. 305 (1998) (arguing for the value of subgroup analyses). In United States v. Harkonen, the federal government prosecuted a scientist for fraud in sending a telecopy that described a clinical trial as “demonstrating” a benefit in a subgroup of a secondary trial outcome.  Remarkably, in the Harkonen case, the author, and criminal defendant, was describing a result in a pre-specified outcome, in a plausible but post-hoc subgroup, which result accorded with prior clinical trials and experimental evidence. United States v. Harkonen (D. Calif. 2009); United States v. Harkonen (D. Calif. 2010) (post-trial motions), aff’d, 510 F. App’x 633 (9th Cir. 2013) (unpublished), cert. denied, 134 S. Ct. 824, ___ U.S. ___ (2014); Brief by Scientists And Academics as Amici Curiae In Support Of Petitioner, On Petition For Writ Of Certiorari in the Supreme Court of the United States, W. Scott Harkonen v. United States, No. 13-180 (filed Sept. 4, 2013).

[11] SeeSub-group Analyses in Epidemiologic Studies — Dangers of Statistical Significance as a Bright-Line Test” (May 17, 2011) (collecting commentary); see also Lemuel A. Moyé, Statistical Reasoning in Medicine:  The Intuitive P-Value Primer 206, 225 (2d ed. 2006) (noting that subgroup analyses are often misleading: “Fishing expeditions for significance commonly catch only the junk of sampling error”); Victor M. Montori, Roman Jaeschke, Holger J. Schünemann, Mohit Bhandari, Jan L Brozek, P. J. Devereaux & Gordon H Guyatt, “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004) (“Beware subgroup analysis”); Susan F. Assmann, Stuart J. Pocock, Laura E. Enos, Linda E. Kasten, “Subgroup analysis and other (mis)uses) of baseline data in clinical trials,” 355 Lancet 1064 (2000); George Davey Smith & Mathias Egger, “Commentary: Incommunicable knowledge? Interpreting and applying the results of clinical trials and meta-analyses,” 51 J. Clin. Epidemiol. 289 (1998) (arguing against post-hoc hypothesis testing); Douglas G. Altman, “Statistical reviewing for medical journals,” 17 Stat. Med. 2662 (1998); Douglas G. Altman, “Commentary:  Within trial variation – A false trail?” 51 J. Clin. Epidemiol. 301 (1998) (noting that observed associations are expected to vary across subgroup because of random variability); Christopher Bulpitt, “Subgroup Analysis,” 2 Lancet: 31 (1988).

Judicial Control of the Rate of Error in Expert Witness Testimony

May 28th, 2015

In Daubert, the Supreme Court set out several criteria or factors for evaluating the “reliability” of expert witness opinion testimony. The third factor in the Court’s enumeration was whether the trial court had considered “the known or potential rate of error” in assessing the scientific reliability of the proffered expert witness’s opinion. Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 593 (1993). The Court, speaking through Justice Blackmun, failed to provide much guidance on the nature of the errors subject to gatekeeping, on how to quantify the errors, and on to know how much error was too much. Rather than provide a taxonomy of error, the Court lumped “accuracy, validity, and reliability” together with a grand pronouncement that these measures were distinguished by no more than a “hen’s kick.” Id. at 590 n.9 (1993) (citing and quoting James E. Starrs, “Frye v. United States Restructured and Revitalized: A Proposal to Amend Federal Evidence Rule 702,” 26 Jurimetrics J. 249, 256 (1986)).

The Supreme Court’s failure to elucidate its “rate of error” factor has caused a great deal of mischief in the lower courts. In practice, trial courts have rejected engineering opinions on stated grounds of their lacking an error rate as a way of noting that the opinions were bereft of experimental and empirical evidential support[1]. For polygraph evidence, courts have used the error rate factor to obscure their policy prejudices against polygraphs, and to exclude test data even when the error rate is known, and rather low compared to what passes for expert witness opinion testimony in many other fields[2]. In the context of forensic evidence, the courts have rebuffed objections to random-match probabilities that would require that such probabilities be modified by the probability of laboratory or other error[3].

When it comes to epidemiologic and other studies that require statistical analyses, lawyers on both sides of the “v” frequently misunderstand p-values or confidence intervals to provide complete measures of error, and ignore the larger errors that result from bias, confounding, study validity (internal and external), inappropriate data synthesis, and the like[4]. Not surprisingly, parties fallaciously argue that the Daubert criterion of “rate of error” is satisfied by expert witness’s reliance upon studies that in turn use conventional 95% confidence intervals and measures of statistical significance in p-values below 0.05[5].

The lawyers who embrace confidence intervals and p-values as their sole measure of error rate fail to recognize that confidence intervals and p-values are means of assessing only one kind of error: random sampling error. Given the carelessness of the Supreme Court’s use of technical terms in Daubert, and its failure to engage in the actual evidence at issue in the case, it is difficult to know whether the Court intended to suggest that random error was the error rate it had in mind[6]. The statistics chapter in the Reference Manual on Scientific Evidence helpfully points out that the inferences that can be drawn from data turn on p-values and confidence intervals, as well as on study design, data quality, and the presence or absence of systematic errors, such as bias or confounding.  Reference Manual on Scientific Evidence at 240 (3d 2011) [Manual]. Random errors are reflected in the size of p-values or the width of confidence intervals, but these measures of random sampling error ignore systematic errors such as confounding and study biases. Id. at 249 & n.96.

The Manual’s chapter on epidemiology takes an even stronger stance: the p-value for a given study does not provide a rate of error or even a probability of error for an epidemiologic study:

“Epidemiology, however, unlike some other methodologies—fingerprint identification, for example—does not permit an assessment of its accuracy by testing with a known reference standard. A p-value provides information only about the plausibility of random error given the study result, but the true relationship between agent and outcome remains unknown. Moreover, a p-value provides no information about whether other sources of error – bias and confounding – exist and, if so, their magnitude. In short, for epidemiology, there is no way to determine a rate of error.”

Manual at 575. This stance seems not entirely justified given that there are Bayesian approaches that would produce credibility intervals accounting for sampling and systematic biases. To be sure, such approaches have their own problems and they have received little to no attention in courtroom proceedings to date.

The authors of the Manual’s epidemiology chapter, who are usually forgiving of judicial error in interpreting epidemiologic studies, point to one United States Court of Appeals case that fallaciously interpreted confidence intervals magically to quantify bias and confounding in a Bendectin birth defects case. Id. at 575 n. 96[7]. The Manual could have gone further to point out that, in the context of multiple studies, of different designs and analyses, cognitive biases involved in evaluating, assessing, and synthesizing the studies are also ignored by statistical measures such as p-values and confidence intervals. Although the Manual notes that assessing the role of chance in producing a particular set of sample data is “often viewed as essential when making inferences from data,” the Manual never suggests that random sampling error is the only kind of error that must be assessed when interpreting data. The Daubert criterion would appear to encompass all varieties or error, not just random error.

The Manual’s suggestion that epidemiology does not permit an assessment of the accuracy of epidemiologic findings misrepresents the capabilities of modern epidemiologic methods. Courts can, and do, invoke gatekeeping approaches to weed out confounded study findings. SeeSorting Out Confounded Research – Required by Rule 702” (June 10, 2012). The “reverse Cornfield inequality” was an important analysis that helped establish the causal connection between tobacco smoke and lung cancer[8]. Olav Axelson studied and quantified the role of smoking as a confounder in epidemiologic analyses of other putative lung carcinogens.[9] Quantitative methods for identifying confounders have been widely deployed[10].

A recent study in birth defects epidemiology demonstrates the power of sibling cohorts in addressing the problem of residual confounding from observational population studies with limited information about confounding variables. Researchers looking at various birth defect outcomes among offspring of women who used certain antidepressants in early pregnancy generally found no associations in pooled data from Iceland, Norway, Sweden, Finland, and Denmark. A putative association between maternal antidepressant use and a specific kind of cardiac defect (right ventricular outflow tract obstruction or RVOTO) did appear in the overall analysis, but was reversed when the analysis was limited to the sibling subcohort. The study found an apparent association between RVOTO defects and first trimester maternal exposure to selective serotonin reuptake inhibitors, with an adjusted odds ratio of 1.48 (95% C.I., 1.15, 1.89). In the adjusted analysis for siblings, the study found an OR of 0.56 (95% C.I., 0.21, 1.49) in an adjusted sibling analysis[11]. This study and many others show how creative analyses can elucidate and quantify the direction and magnitude of confounding effects in observational epidemiology.

Systematic bias has also begun to succumb to more quantitative approaches. A recent guidance paper by well-known authors encourages the use of quantitative bias analysis to provide estimates of uncertainty due to systematic errors[12].

Although the courts have failed to articulate the nature and consequences of erroneous inference, some authors would reduce all of Rule 702 (and perhaps 704, 403 as well) to a requirement that proffered expert witnesses “account” for the known and potential errors in their opinions:

“If an expert can account for the measurement error, the random error, and the systematic error in his evidence, then he ought to be permitted to testify. On the other hand, if he should fail to account for any one or more of these three types of error, then his testimony ought not be admitted.”

Mark Haug & Emily Baird, “Finding the Error in Daubert,” 62 Hastings L.J. 737, 739 (2011).

Like most antic proposals to revise Rule 702, this reform vision shuts out the full range of Rule 702’s remedial scope. Scientists certainly try to identify potential sources of error, but they are not necessarily very good at it. See Richard Horton, “Offline: What is medicine’s 5 sigma?” 385 Lancet 1380 (2015) (“much of the scientific literature, perhaps half, may simply be untrue”). And as Holmes pointed out[13], certitude is not certainty, and expert witnesses are not likely to be good judges of their own inferential errors[14]. Courts continue to say and do wildly inconsistent things in the course of gatekeeping. Compare In re Zoloft (Setraline Hydrochloride) Products, 26 F. Supp. 3d 449, 452 (E.D. Pa. 2014) (excluding expert witness) (“The experts must use good grounds to reach their conclusions, but not necessarily the best grounds or unflawed methods.”), with Gutierrez v. Johnson & Johnson, 2006 WL 3246605, at *2 (D.N.J. November 6, 2006) (denying motions to exclude expert witnesses) (“The Daubert inquiry was designed to shield the fact finder from flawed evidence.”).


[1] See, e.g., Rabozzi v. Bombardier, Inc., No. 5:03-CV-1397 (NAM/DEP), 2007 U.S. Dist. LEXIS 21724, at *7, *8, *20 (N.D.N.Y. Mar. 27, 2007) (excluding testimony from civil engineer about boat design, in part because witness failed to provide rate of error); Sorto-Romero v. Delta Int’l Mach. Corp., No. 05-CV-5172 (SJF) (AKT), 2007 U.S. Dist. LEXIS 71588, at *22–23 (E.D.N.Y. Sept. 24, 2007) (excluding engineering opinion that defective wood-carving tool caused injury because of lack of error rate); Phillips v. Raymond Corp., 364 F. Supp. 2d 730, 732–33 (N.D. Ill. 2005) (excluding biomechanics expert witness who had not reliably tested his claims in a way to produce an accurate rate of error); Roane v. Greenwich Swim Comm., 330 F. Supp. 2d 306, 309, 319 (S.D.N.Y. 2004) (excluding mechanical engineer, in part because witness failed to provide rate of error); Nook v. Long Island R.R., 190 F. Supp. 2d 639, 641–42 (S.D.N.Y. 2002) (excluding industrial hygienist’s opinion in part because witness was unable to provide a known rate of error).

[2] See, e.g., United States v. Microtek Int’l Dev. Sys. Div., Inc., No. 99-298-KI, 2000 U.S. Dist. LEXIS 2771, at *2, *10–13, *15 (D. Or. Mar. 10, 2000) (excluding polygraph data based upon showing that claimed error rate came from highly controlled situations, and that “real world” situations led to much higher error (10%) false positive error rates); Meyers v. Arcudi, 947 F. Supp. 581 (D. Conn. 1996) (excluding polygraph in civil action).

[3] See, e.g., United States v. Ewell, 252 F. Supp. 2d 104, 113–14 (D.N.J. 2003) (rejecting defendant’s objection to government’s failure to quantify laboratory error rate); United States v. Shea, 957 F. Supp. 331, 334–45 (D.N.H. 1997) (rejecting objection to government witness’s providing separate match and error probability rates).

[4] For a typical judicial misstatement, see In re Zoloft Products, 26 F. Supp.3d 449, 454 (E.D. Pa. 2014) (“A 95% confidence interval means that there is a 95% chance that the ‘‘true’’ ratio value falls within the confidence interval range.”).

[5] From my experience, this fallacious argument is advanced by both plaintiffs’ and defendants’ counsel and expert witnesses. See also Mark Haug & Emily Baird, “Finding the Error in Daubert,” 62 Hastings L.J. 737, 751 & n.72 (2011).

[6] See David L. Faigman, et al. eds., Modern Scientific Evidence: The Law and Science of Expert Testimony § 6:36, at 359 (2007–08) (“it is easy to mistake the p-value for the probability that there is no difference”)

[7] Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989), modified, 884 F.2d 166 (5th Cir. 1989), cert. denied, 494 U.S. 1046 (1990). As with any error of this sort, there is always the question whether the judges were entrapped by the parties or their expert witnesses, or whether the judges came up with the fallacy on their own.

[8] See Joel B Greenhouse, “Commentary: Cornfield, Epidemiology and Causality,” 38 Internat’l J. Epidem. 1199 (2009).

[9] Olav Axelson & Kyle Steenland, “Indirect methods of assessing the effects of tobacco use in occupational studies,” 13 Am. J. Indus. Med. 105 (1988); Olav Axelson, “Confounding from smoking in occupational epidemiology,” 46 Brit. J. Indus. Med. 505 (1989); Olav Axelson, “Aspects on confounding in occupational health epidemiology,” 4 Scand. J. Work Envt’l Health 85 (1978).

[10] See, e.g., David Kriebel, Ariana Zeka1, Ellen A Eisen, and David H. Wegman, “Quantitative evaluation of the effects of uncontrolled confounding by alcohol and tobacco in occupational cancer studies,” 33 Internat’l J. Epidem. 1040 (2004).

[11] Kari Furu, Helle Kieler, Bengt Haglund, Anders Engeland, Randi Selmer, Olof Stephansson, Unnur Anna Valdimarsdottir, Helga Zoega, Miia Artama, Mika Gissler, Heli Malm, and Mette Nørgaard, “Selective serotonin reuptake inhibitors and ventafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design,” 350 Brit. Med. J. 1798 (2015).

[12] Timothy L.. Lash, Matthew P. Fox, Richard F. MacLehose, George Maldonado, Lawrence C. McCandless, and Sander Greenland, “Good practices for quantitative bias analysis,” 43 Internat’l J. Epidem. 1969 (2014).

[13] Oliver Wendell Holmes, Jr., Collected Legal Papers at 311 (1920) (“Certitude is not the test of certainty. We have been cock-sure of many things that were not so.”).

[14] See, e.g., Daniel Kahneman & Amos Tversky, “Judgment under Uncertainty:  Heuristics and Biases,” 185 Science 1124 (1974).

ALI Reporters Are Snookered by Racette Fallacy

April 27th, 2015

In the Reference Manual on Scientific Evidence, the authors of the epidemiology chapter advance instances of acceleration of onset of disease as an example of a situation in which reliance upon doubling of risk will not provide a reliable probability of causation calculation[1]. In a previous post, I suggested that the authors’ assertion may be unfounded. SeeReference Manual on Scientific Evidence on Relative Risk Greater Than Two For Specific Causation Inference” (April 25, 2014). Several epidemiologic methods would permit the calculation of relative risk within specific time windows from first exposure.

The American Law Institute (ALI) Reporters, for the Restatement of Torts, make similar claims.[2] First, the Reporters, citing the Manual’s second edition, repeat the Manual’s claim that:

 “Epidemiologists, however, do not seek to understand causation at the individual level and do not use incidence rates in group to studies to determine the cause of an individual’s disease.”

American Law Institute, Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28(a) cmt. c(4) & rptrs. notes (2010) [Comment c(4)]. In making this claim, the Reporters ignore an extensive body of epidemiologic studies on genetic associations and on biomarkers, which do address causation implicitly or explicitly, on an individual level.

The Reporters also repeat the Manual’s doubtful claim that acceleration of onset of disease prevents an assessment of attributable risk, although they acknowledge that an average earlier age of onset would form the basis of damages calculations rather than calculations for damages for an injury that would not have occurred but for the tortious exposure. Comment c(4). The Reporters go a step further than the Manual, however, and provide an example of the acceleration-of-onset studies that they have in mind:

“For studies whose results suggest acceleration, see Brad A. Racette, Welding-Related Parkinsonism: Clinical Features, Treatments, and Pathophysiology,” 56 Neurology 8, 12 (2001) (stating that authors “believe that welding acts as an accelerant to cause [Parkinson’s Disease]… .”

The citation to Racette’s 2001 paper[3] is curious, interesting, disturbing, and perhaps revealing. In this 2001 paper, Racette misrepresented the type of study he claimed to have done, and the inferences he drew from his case series are invalid. Any one experienced in the field of epidemiology would have dismissed this study, its conclusions, and its suggested relation between welding and parkinsonism.

Dr. Brad A. Racette teaches and practices neurology at Washington University in St. Louis, across the river from a hotbed of mass tort litigation, Madison County, Illinois. In the 1990s, Racette received referrals from plaintiffs’ attorneys to evaluate their clients in litigation over exposure to welding fumes. Plaintiffs were claiming that their occupational exposures caused them to develop manganism, a distinctive parkinsonism that differs from Parkinson’s disease [PD], but has signs and symptoms that might be confused with PD by unsophisticated physicians unfamiliar with both manganism and PD.

After the publication of his 2001 paper, Racette became the darling of felon Dicky Scruggs and other plaintiffs’ lawyers. The litigation industrialists invited Racette and his team down to Alabama and Mississippi, to conduct screenings of welding tradesmen, recruited by Scruggs and his team, for potential lawsuits for PD and parkinsonism. The result was a paper that helped Scruggs propel a litigation assault against the welding industry.[4]

Racette’s 2001 paper was accompanied by a press release, as have many of his papers, in which he was quoted as stating that “[m]anganism is a very different disease” from PD. Gila Reckess, “Welding, Parkinson’s link suspected” (Feb. 9, 2001)[5].

Racette’s 2001 paper provoked a strongly worded letter that called Racette and his colleagues out for misrepresenting the nature of their work:

“The authors describe their work as a case–control study. Racette et al. ascertained welders with parkinsonism and compared their concurrent clinical features to those of subjects with PD. This is more consistent with a cross-sectional design, as the disease state and factors of interest were ascertained simultaneously. Cross-sectional studies are descriptive and therefore cannot be used to infer causation.”

*****

“The data reported by Racette et al. do not necessarily support any inference about welding as a risk factor in PD. A cohort study would be the best way to evaluate the role of welding in PD.”

Bernard Ravina, Andrew Siderowf, John Farrar, Howard Hurtig, “Welding-related parkinsonism: Clinical features, treatment, and pathophysiology,” 57 Neurology 936, 936 (2001).

As we will see, Dr. Ravina and his colleagues were charitable to suggest that the study was more compatible with a cross-sectional study. Racette had set out to determine “whether welding-related parkinsonism differs from idiopathic PD.” He claimed that he had “performed a case-control study,” with a case group of welders and two control groups. His inferences drawn from his “data” are, however, fallacious because he employed an invalid study design.

In reality, Racette’s paper was nothing more than a chart review, a case series of 15 “welders” in the context of a movement disorder clinic. After his clinical and radiographic evaluation, Racette found that these 15 cases were clinically indistinguishable from PD, and thus unlike manganism. Racette did not reveal whether any of these 15 welders had been referred by plaintiffs’ counsel; nor did he suggest that these welding tradesmen made up a disproportionate number of his patient base in St. Louis, Missouri.

Racette compared his selected 15 career welders with PD to his general movement disorders clinic patient population, for comparison. From the patient population, Racette deployed two “control” groups, one matched for age and sex with the 15 welders, and the other group not matched. The America Law Institute reporters are indeed correct that Racette suggested that the average age of onset for these 15 welders was lower than that for his non-welder patients, but their uncritical embrace overlooked the fact that Racette’s suggestion does not support his claimed inference that in welders, therefore, “welding exposure acts as an accelerant to cause PD.”

Racette’s claimed inference is remarkable because he did not perform an analytical epidemiologic study that was capable of generating causal inferences. His paper incongruously presents odds ratios, although the controls have PD, the disease of interest, which invalidates any analytical inference from his case series. Given the referral and selection biases inherent in tertiary-care specialty practices, this paper can provide no reliable inferences about associations or differences in ages of onset. Even within the confines of a case series misrepresented to be a case-control study, Racette acknowledged that “[s]ubsequent comparisons of the welders with age-matched controls showed no significant differences.”

NOT A CASE-CONTROL STUDY

That Racette wrongly identified his paper as a case-control study is beyond debate. How the journal Neurology accepted the paper for publication is a mystery. The acceptance of the inference by the ALI Reporter, lawyers and judges, is regrettable.

Structurally, Racette’s paper could never quality as a case-control study, or any other analytical epidemiologic study. Here is how a leading textbook on case-control studies defines a case-control study:

“In a case-control study, individuals with a particular condition or disease (the cases) are selected for comparison with a series of individuals in whom the condition or disease is absent (the controls).”

James J. Schlesselman, Case-control Studies. Design, Conduct, Analysis at 14 (N.Y. 1982)[6].

Every patient in Racette’s paper, welders and non-welders, have the outcome of interest, PD. There is no epidemiologic study design that corresponds to what Racette did, and there is no way to draw any useful inference from Racette’s comparisons. Racette’s paper violates the key principle for a proper case-control study; namely, all subjects must be selected independently of the study exposure that is under investigation. Schlesselman stressed that that identifying an eligible case or control must not depend upon that person’s exposure status for any factor under consideration. Id. Racette’s 2001 paper deliberately violated this basic principle.

Racette’s study design, with only cases with the outcome of interest appearing in the analysis, recklessly obscures the underlying association between the exposure (welding) and age in the population. We would, of course, expect self-identified welders to be younger than the average Parkinson’s disease patient because welding is physical work that requires good health. An equally fallacious study could be cobbled together to “show” that the age-of-onset of Parkinson’s disease for sitcom actors (such as Michael J. Fox) is lower than the age-of-onset of Parkinson’s disease for Popes (such as John Paul II). Sitcom actors are generally younger as a group than Popes. Comparing age of onset between disparate groups that have different age distributions generates a biased comparison and an erroneous inference.

The invalidity and fallaciousness of Racette’s approach to studying the age-of-onset issue of PD in welders, and his uncritical inferences, have been extensively commented upon in the general epidemiologic literature. For instance, in studies that compared the age at death for left-handed versus right-handed person, studies reported an observed nine-year earlier death for left handers, leading to (unfounded) speculation that earlier mortality resulted from birth and life stressors and accidents for left handers, living in a world designed to accommodate right-handed person[7]. The inference has been shown to be fallacious and the result of social pressure in the early twentieth century to push left handers to use their right hands, a prejudicial practice that abated over the decades of the last century. Left handers born later in the century were less likely to be “switched,” as opposed to those persons born earlier and now dying, who were less likely to be classified as left-handed, as a result of a birth-cohort effect[8]. When proper prospective cohort studies were conducted, valid data showed that left-handers and right-handers have equivalent mortality rates[9].

Epidemiologist Ken Rothman addressed the fallacy of Racette’s paper at some length in one of his books:

“Suppose we study two groups of people and look at the average age at death among those who die. In group A, the average age of death is 4 years; in group B, it is 28 years. Can we say that being a member of group A is riskier than being a member of group B? We cannot… . Suppose that group A comprises nursery school students and group B comprises military commandos. It would be no surprise that the average age at death of people who are currently military commandos is 28 years or that the average age of people who are currently nursery students is 4 years. …

In a study of factory workers, an investigator inferred that the factory work was dangerous because the average age of onset of a particular kind of cancer was lower in these workers than among the general population. But just as for the nursery school students and military commandos, if these workers were young, the cancers that occurred among them would have to be occurring in young people. Furthermore, the age of onset of a disease does not take into account what proportion of people get the disease.

These examples reflect the fallacy of comparing the average age at which death or disease strikes rather than comparing the risk of death between groups of the same age.”

Kenneth J. Rothman, “Introduction to Epidemiologic Thinking,” in Epidemiology: An Introduction at 5-6 (N.Y. 2002).

And here is how another author of Modern Epidemiology[10] addressed the Racette fallacy in a different context involving PD:

“Valid studies of age-at-onset require no underlying association between the risk factor and aging or birth cohort in the source population. They must also consider whether a sufficient induction time has passed for the risk factor to have an effect. When these criteria and others cannot be satisfied, age-specific or standardized risks or rates, or a population-based case-control design, must be used to study the association between the risk factor and outcome. These designs allow the investigator to disaggregate the relation between aging and the prevalence of the risk factor, using familiar methods to control confounding in the design or analysis. When prior knowledge strongly suggests that the prevalence of the risk factor changes with age in the source population, case-only studies may support a relation between the risk factor and age-at-onset, regardless of whether the inference is justified.”

Jemma B. Wilk & Timothy L. Lash, “Risk factor studies of age-at-onset in a sample ascertained for Parkinson disease affected sibling pairs: a cautionary tale,” 4 Emerging Themes in Epidemiology 1 (2007) (internal citations omitted) (emphasis added).

A properly designed epidemiologic study would have avoided Racette’s fallacy. A relevant cohort study would have enrolled welders in the study at the outset of their careers, and would have continued to follow them even if they changed occupations. A case-control study would have enrolled cases with PD and controls without PD (or more broadly, parkinsonism), with cases and controls selected independently of their exposure to welding fumes. Either method would have determined the rate of PD in both groups, absolutely or relatively. Racette’s paper, which completely lacked non-PD cases, could not have possibly accomplished his stated objectives, and it did not support his claims.

Racette’s questionable work provoked a mass tort litigation and ultimately federal Multi-District Litigation 1535.[11] Ultimately, analytical epidemiologic studies consistently showed no association between welding and PD. A meta-analysis published in 2012 ended the debate[12] as a practical matter, and MDL 1535 is no more. How strange that the ALI reporters chose the Racette work as an example of their claims about acceleration of onset!


[1] Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, Reference Manual on Scientific Evidence 549, 614 (Wash., DC 3d ed. 2011).

[2] Michael D. Green was an ALI Reporter, and of course, an author of the chapter in the Reference Manual.

[3] Brad A. Racette, L. McGee-Minnich, S. M. Moerlein, J. W. Mink, T. O. Videen, and Joel S. Perlmutter, “Welding-related parkinsonism: clinical features, treatment, and pathophysiology,” 56 Neurology 8 (2001).

[4] See Brad A. Racette, S.D. Tabbal, D. Jennings, L. Good, Joel S. Perlmutter, and Brad Evanoff, “Prevalence of parkinsonism and relationship to exposure in a large sample of Alabama welders,” 64 Neurology 230 (2005); Brad A. Racette, et al., “A rapid method for mass screening for parkinsonism,” 27 Neurotoxicology 357 (2006) (duplicate publication of the earlier, 2005, paper).

[5] Previously available at <http://record.wustl.edu/archive/2001/02-09-01/articles/welding.html>, last visited on June 27, 2005.

[6] See also Brian MacMahon & Dimitrios Trichopoulos, Epidemiology. Principles and Methods at 229 (2ed 1996) (“A case-control study is an inquiry in which groups of individuals are selected based on whether they do (the cases) or do not (the controls) have the disease of which the etiology is to be studied.”); Jennifer L. Kelsey, W.D. Thompson, A.S. Evans, Methods in Observational Epidemiology at 148 (N.Y. 1986) (“In a case-control study, persons with a given disease (the cases) and persons without the disease (the controls) are selected … .”).

[7] See, e.g., Diane F. Halpern & Stanley Coren, “Do right-handers live longer?” 333 Nature 213 (1988); Diane F. Halpern & Stanley Coren, “Handedness and life span,” 324 New Engl. J. Med. 998 (1991).

[8] Kenneth J. Rothman, “Left-handedness and life expectancy,” 325 New Engl. J. Med. 1041 (1991) (pointing out that by comparing age of onset method, nursery education would be found more dangerous than paratrooper training, given that the age at death of pres-schoolers wo died would be much lower than that of paratroopers who died); see also Martin Bland & Doug Altman, “Do the left-handed die young?” Significance 166 (Dec. 2005).

[9] See Philip A. Wolf, Ralph B. D’Agostino, Janet L. Cobb, “Left-handedness and life expectancy,” 325 New Engl. J. Med. 1042 (1991); Marcel E. Salive, Jack M. Guralnik & Robert J. Glynn, “Left-handedness and mortality,” 83 Am. J. Public Health 265 (1993); Olga Basso, Jørn Olsen, Niels Holm, Axel Skytthe, James W. Vaupel, and Kaare Christensen, “Handedness and mortality: A follow-up study of Danish twins born between 1900 and 1910,” 11 Epidemiology 576 (2000). See also Martin Wolkewitz, Arthur Allignol, Martin Schumacher, and Jan Beyersmann, “Two Pitfalls in Survival Analyses of Time-Dependent Exposure: A Case Study in a Cohort of Oscar Nominees,” 64 Am. Statistician 205 (2010); Michael F. Picco, Steven Goodman, James Reed, and Theodore M. Bayless, “Methodologic pitfalls in the determination of genetic anticipation: the case of Crohn’s disease,” 134 Ann. Intern. Med. 1124 (2001).

[10] Kenneth J. Rothman, Sander Greenland, Timothy L. Lash, eds., Modern Epidemiology (3d ed. 2008).

[11] Dicky Scruggs served on the Plaintiffs’ Steering Committee until his conviction on criminal charges.

[12] James Mortimer, Amy Borenstein, and Lorene Nelson, “Associations of welding and manganese exposure with Parkinson disease: Review and meta-analysis,” 79 Neurology 1174 (2012).