Professor Sanders’ Paen to Milward

Deconstructing the Deconstruction of Deconstruction

Some scholars have suggested that the most searching scrutiny of scientific research takes place in the courtroom.  Barry Nace’s discovery of the “mosaic method” notwithstanding, lawyers rarely contribute new findings, which I suppose supports Professor Sanders’ characterization of the process as “deconstructive.”  The scrutiny of courtroom science is encouraged by the large quantity of poor quality opinions, on issues that must be addressed by lawyers and their clients who wish to prevail.  As philosopher Harry Frankfurt described this situation:

“Bullshit is unavoidable whenever circumstances require someone to talk without knowing what he is talking about.  Thus the production of bullshit is stimulated whenever a person’s obligations or opportunities to speak about some topic exceed his knowledge of the facts that are relevant to that topic.”

Harry Frankfurt, On Bullshit 63 (Princeton Univ. 2005).

This unfortunate situation would seem to be especially true for advocacy science that involves scientists who are intent upon influencing public policy questions, regulation, and litigation outcomes.  Some of the most contentious issues, and tendentious studies, take place within the realm of occupational, environmental, and related disciplines. Sadly, many occupational and environmental medical practitioners seem particularly prone to publish in journals with low standards and poor peer review.  Indeed, the scientists and clinicians who work in some areas make up an insular community, in which the members are the peer reviewers and editors of each other’s work.  The net result is that any presumption of reliability for peer-reviewed biomedical research is untenable.

The silicone gel-breast implant litigation provides an interesting case study of the phenomenon.  Contrary to post-hoc glib assessments that there was “no” scientific evidence offered by plaintiffs, the fact is that there was a great deal.  Most of what was offered was published in peer-reviewed journals; some was submitted by scientists who had some credibility and standing within their scientific, academic communities:  Gershwin, Kossovsky, Lappe, Shanklin, Garrido, et al.  Lawyers, armed with subpoenas, interrogatories, and deposition notices, were able to accomplish what peer reviewers could not.  What Professor Sanders and others call “deconstruction” was none other than a scientific demonstration of study invalidity, seriously misleading data collection and analysis, and even fraud.  See Hon. Jack B. Weinstein, “Preliminary Reflections on Administration of Complex Litigation” 2009 Cardozo L. Rev. de novo 1, 14 (2009) (describing plaintiffs’ expert witnesses in silicone litigation as “charlatans” and the litigation as largely based upon fraud).

Some scientific publications are motivated almost exclusively by the goal of influencing regulatory or political action.  Consider the infamous meta-analysis by Nissen and Wolski, of clinical trials and heart attack among patients taking Avandia.  Steven Nissen & Kathy Wolski, “Effect of Rosiglitazone on the Risk of Myocardial Infarction and Death from Cardiovascular Causes,” 356 New Engl. J. Med. 2457 (2007). The New England Journal of Medicine rushed the meta-analysis into print in order to pressure the FDA to step up its regulation of post-marketing surveillance of licensed medications.  Later, better-conducted meta-analyses showed how fragile Nissen’s findings were.  See, e.g., George A. Diamond, MD, et al., “Uncertain Effects of Rosiglitazone on the Risk for Myocardial Infarction and Cardiovascular Death,” 147 Ann. Intern. Med. 578 (2007); Tian, et al., “Exact and efficient inference procedure for meta-analysis and its application to the analysis of independent 2 × 2 tables with all available data but without artificial continuity correction,” 10 Biostatistics 275 (2008).  Lawyers should not be shy about pointing out political motivations of badly conducted scientific research, regardless of authorship or where published.

On the other hand, lawyers on both sides of litigation are prone to attack on personal bias and potential conflicts of interest because these attacks are more easily made, and better understood by judges and jurors.  Perhaps it is these “deconstructions” that Professor Sanders finds overblown, in which case, I would agree.  Precisely because jurors have difficulty distinguishing between allegations of funding bias and validity flaws that render studies nugatory, and because inquiries into validity require more time, care, analysis, attention, and scientific and statistical learning, pretrial gatekeeping of expert witnesses is an essential part of achieving substantial justice in litigation of scientific issues.  This is a message that is obscured by the recent cheerleading for the Milward decision at the litigation industry’s symposium on the case.

Deconstructing Professor Sanders’ Deconstruction of the Deconstruction in Milward

A few comments about Professor Sanders’ handling of the facts of Milward itself.

The case arose from a claim of occupational exposure to benzene and an outcome known as APL (acute promyelocytic leukemia), which makes up about 10% of AML (acute myeloid leukemia).  Sanders argues, without any support, that APL is too rare for epidemiology to be definitive.  Sanders at 164.  Here Sanders asserts what Martyn Smith opined, and ignores the data that contradicted Smith.  At least one of the epidemiologic studies cited by Smith was quite large and was able to discern small statistically significant associations when present.  See, e.g., Nat’l Investigative Group for the Survey of Leukemia & Aplastic Anemia, “Countrywide Analysis of Risk Factors for Leukemia and Aplastic Anemia,” 14 Acta Academiae Medicinae Sinicae (1992).  This study found a crude odds ratio of 1.42 for benzene exposure and APL (M3). The study had adequate power to detect a statistically significant odds ratio of 1.54 between benzene and M2a.  Of course, even if one study’s “power” were low, there are other, aggregative strategies, such as meta-analysis, available.  This was not a credibility issue concerning Dr. Smith, for the jury; Smith’s opinion turned on an incorrect and fallacious analyses that did not deserve “jury time.”

The problem is, according to Sanders one of “power.”  In a lengthy footnote, Sander explains what “power” is, and why he believes it is a problem:

“The problem is one of power. Tests of statistical significance are designed to guard against one type error, commonly called Type I Error. This error occurs when one declares a causal relationship to exist when in fact there is no relationship, … . A second type of error, commonly called Type II Error, occurs when one declares a causal relationship does not exist when in fact it does. Id. The “power” of a study measures its ability to avoid a Type II Error. Power is a function of a study’s sample size, the size of the effect one wishes to detect, and the significance level used to guard against Type I Error. . Because power is a function of, among other things, the significance level used to guard against Type I errors, all things being equal, minimizing the probability of one type of error can be done only by increasing the probability of making the other.  Formulae exist to calculate the power of case-control and cohort studies from 2 x 2 contingency table data.

Because the power of any test is reduced as the incidence of an effect decreases, Type II threats to causal conclusions are particularly relevant with respect to rare events. Plaintiffs make a fair criticism of randomized trials or epidemiological cohort studies when they note that sometimes the studies have insufficient power to detect rare events. In this situation, case-control studies are particularly valuable because of their relatively greater power. In most toxic tort contexts, the defendant would prefer to minimize Type I Error while the plaintiffs would prefer to minimize Type II Error. Ideally, what we would prefer are studies that minimize the probability of both types of errors. Given the importance of power in assessing epidemiological evidence, surprisingly few appellate opinions discuss this issue. But see DeLuca v. Merrell Dow Pharm., Inc., 911 F.2d 941, 948 (3d Cir. 1990), which contains a good discussion of epidemiological evidence. The opinion discusses the two types of error and suggests that courts should be concerned about both. Id. Unfortunately, neither the district court opinion nor the court of appeals opinion in Milward discusses power.”

Sanders at 164 n.115 (internal citations omitted).

Sanders is one of the few law professors who almost manages to describe statistical power correctly.  Calculating and evaluating power requires pre-specification of alpha (our maximum tolerated Type I error), sample size, and an alternative hypothesis that we would want to be able to identify at a statistically significant level.  This much is set out in the footnote quoted above.

Sample size, however, is just one factor in the study’s variance, which is not in turn completely specified by sample size.  More important, Sanders’ invocation of power to evaluate the exonerative quality of a study has been largely rejected in the world of epidemiology.  His note that “[f]ormulae exist to calculate the power of case-control and cohort studies from 2 x 2 contingency table data” is largely irrelevant because power is mostly confined to sample size determinations before a study is conducted.  After the data are collected, studies are evaluated by their point estimates and their corresponding confidence intervals. See, e.g., Vandenbroucke, et al., “Strengthening the reporting of observational studies in epidemiology (STROBE):  Explanation and elaboration,” 18 Epidemiology 805, 815 (2007) (Section 10, sample size) (“Do not bother readers with post hoc justifications for study size or retrospective power calculations. From the point of view of the reader, confidence intervals indicate the statistical precision that was ultimately obtained.) (emphasis added). See alsoPower in the Courts — Part Two” (Jan. 21, 2011).

Type II error is important in the evaluation of evidence, but it requires a commitment to a specific alternative hypothesis.  That alternative can always be set closer and closer to the null hypothesis of no association in order to conclude, as some plaintiffs’ counsel would want, that all studies lack power (except of course the ones that turn out to support their claims).  Sanders’ discussion of statistical power ultimately falters because claiming a lack of power without specifying the size of the alternative hypothesis is unprincipled and meaningless.

Sanders tells us that cohorts will have less power than case-control studies, but again the devil is in the details.  Case-control studies are of course relatively more efficient in studying rare diseases, but the statistical precision of their odds ratios will be given by the corresponding confidence intervals.

What is missing from Sanders’ scholarship is a simple statement of what the point estimates and their confidence intervals are.  Plaintiffs in Milward argued that epidemiology was well-nigh unable to detect increased risks of APL, but then they embraced epidemiology when Smith had manipulated and re-arranged data in published studies.

The Yuck Factor

One of the looming problems in expert witness gatekeeping is judicial discomfort and disability in recounting the parties’ contentions, the studies’ data, and the witnesses’ testimony.  In a red car/blue car case, judges are perfectly comfortable giving detailed narratives of the undisputed facts, and the conditions that give rise to discounting or excluding evidence or testimony.  In science cases, not so much.

Which brings us to the data manipulation conducted by Martyn Smith in the Milward case.  Martyn Smith is not an epidemiologist, and he has little or no  experience or expertise in conducting and analyzing epidemiologic studies.  The law of expert witnesses makes challenges to an expert’s qualifications very difficult; generally courts presume that expert witnesses are competent to testify about general scientific and statistical matters.  Often the presumption is incorrect.

In Milward, Smith claimed, on the one hand, that he did not need epidemiology to reach his conclusion, but on the other hand that “suggestive” findings supported his opinion.  On the third hand, he seemed to care enough about the epidemiologic evidence to engage in fairly extensive reanalysis of published studies.  As the district court noted,  Smith made “unduly favorable assumptions in reinterpreting the studies, such as that cases reported as AML could have been cases of APL.”  Milward v. Acuity Specialty Products Group, Inc., 664 F.Supp. 2d 137, 149 (D. Mass. 2009), rev’d, 639 F.3d 11, 19 (1st Cir. 2011), cert. denied sub nom. U.S. Steel Corp. v. Milward, 132 S. Ct. 1002 (2012).  Put less charitably, Smith made up data to suit his hypothesis.

The details of Smith’s manipulations go well beyond cherry picking.  Smith assumed, without evidence, that AML cases were APL cases.  Smith arbitrarily chose and rearranged data to create desirable results.  See Deposition Testimony of Dr. David Garabrant at 22 – 53, in Milward (Feb. 18, 2009).  In some studies, Smith discarded APL cases from the unexposed group, with the consequence of increasing the apparent association; he miscalculated odds ratios; and he presented odds ratios without p-values or confidence intervals.  The district court certainly was entitled to conclude that Smith had sufficiently deviated from scientific standards of care as to make his testimony inadmissible.

Regrettably, the district court did not provide many details of Smith’s reanalyses of studies and their data.  The failure to document Smith’s deviations facilitated the Circuit’s easy generalization that the fallacious reasoning and methodology was somehow invented by the district court.

The appellate court gave no deference to the district court’s assessment, and by judicial fiat turned methodological missteps into credibility issues for the jury.  The Circuit declared that the analytical gap was of the district court’s making, which seemed plausible enough if one read only the appellate decision.  If one reads the actual testimony, the Yuck Factor becomes palpable.

WOE Unto Bradford Hill

Professor Sanders accepts the appellate court’s opinion at face value for its suggestion that:

“Dr. Smith’s opinion was based on a ‘weight of the evidence’ methodology in which he followed the guidelines articulated by world-renowned epidemiologist Sir Arthur Bradford Hill in his seminal methodological article on inferences of causality.”

Sanders at 170 n.140 (quoting Milward, 639 F.3d at 17).

Sanders (and the First Circuit) is unclear whether WOE consists of following the guidelines articulated by Sir Arthur (perhaps Sir Austin Bradford Hill’s less distinguished brother?), or merely includes the guidelines as a larger process.  Not only was there no Sir Arthur, but Sir Austin’s guidelines are distinctly different from WOE in that they pre-specify the consideration to be applied.  No where does the appellate court give any meaningful consideration to whether there was an exposure-response gradient shown, or whether the epidemiologic studies consistently showed an association between benzene and APL.  Had the Circuit given any consideration to the specifics of the guidelines, it would have likely concluded that the district court had engaged in fairly careful, accurate gatekeeping, well within its discretion.  (If the standard were de novo review rather than “abuse of discretion,” the Circuit would have had to confront the significant analytical gaps and manipulations in Smith’s testimony.)  Futhermore, it is time to acknowledge that Bradford Hill’s “guidelines” are taken from a speech given by Sir Austin almost 50 years ago; they hardly represent a comprehensive, state-of-the-art set of guidelines for causal analysis in epidemiology today.

So there you have it.  WOE means the Bradford Hill guidelines, except that the individual guidelines need not be considered.  And although Bradford Hill’s guidelines were offered to evaluate a body of epidemiologic studies, WOE teaches us that we do not need epidemiologic studies, especially if they do not help to establish a plaintiffs’ claim.  Sanders at 168 & n.133 (citing Milward at 22-24).

What is WOE?

If WOE were not really the Bradford Hill guidelines, then what might it be? Attempting to draw a working definition of WOE from the Milward appellate decision, Sanders tell us that WOE requires looking at all the relevant evidence.  Sanders at 169.  Not much guidance there.  Elsewhere he tells us that WOE is “reasoning to the best explanation,” without explicating what such reasoning entails.  Sanders at 169 & n. 136 (quoting Milward at 23,“The hallmark of the weight of the evidence approach is reasoning to the best explanation.”).  This hardly tells us anything about what method Smith and his colleagues were using.

Sanders then tells us that WOE means the whole “tsumish.” (My word; not his.)  Not only should expert witnesses rely upon all the relevant evidence, but they should eschew an atomistic approach that looks (too hard) at individual studies.  Of course, there may be value in looking at the entire evidentiary display.  Indeed, a holistic view may be needed to show the absence of causation.  In many litigations, plaintiffs’ counsel are adept in filling up the courtroom with “bricks,” which do not fit together to form the wall they claim.  In the silicone gel breast implant litigation, plaintiffs’ counsel were able to pick out factoids from studies to create sufficient confusion and doubt that there might be a causal connection between silicone and autoimmune disease.  A careful, systematic analysis, which looked at the big picture, demonstrated that these contentions were bogus.  Committee on the Safety of Silicone Breast Implants, Institute of Medicine, Safety of Silicone Breast Implants (Wash. D.C. 1999) (reviewing studies, many of which were commissioned by litigation defendants, and which collectively showed lack of association between silicone and autoimmune diseases).  Sometimes, however, taking in the view of the entire evidentiary display may obscure what makes up the display.  A piece by El Anatsui may look like a beautiful tapestry, but a closer look will reveal it is just a bunch of bottle caps wired together.

Contrary to Professor Sanders’ assertions, nothing in the Milward appellate opinion explains why studies should be viewed only as a group, or why this view will necessarily show something greater than the parts. Sanders at 170.  Although Sanders correctly discerns that the Circuit elevated WOE from “perspective” to a methodology, there is precious little content to the methodology, especially if it permits witnesses to engage in all sorts of data shenanigans or arbitrary weighting of evidence.  The quaint notion that there is always a best explanation obscures the reality that in science, and especially in science that is likely to be contested in a courtroom, the best explanation will often be “we don’t know.”

Sanders eventually comes around to admit that WOE is perplexingly vague as to how the weighing should be done.  Id. at 170.  He also admits that the holistic view is not always helpful.  Id. at 170 & n.139 (the sum is greater than its parts but only when the combination enhances supportiveness of the parts, and the collective support for the conclusion at issue, etc.).  These concessions should give courts serious pause before they adopt a dissent from a Supreme Court case, that has been repeatedly rejected by courts, commentators, and ultimately by Congress in revising Rule 702.

WOE is Akin to Differential Diagnosis

The Milward opinion seems like a bottomless reserve of misunderstandings.    Professor Sanders barely flinches at the court’s statement that “The use of judgment in the weight of the evidence methodology is similar to that in differential diagnosis.”  Milward at 18.  See Sanders at 171.  Differential “diagnosis” requires previous demonstration of general causation, and proceeds by iterative disjunctive syllogism.  Sanders, and the First Circuit, somehow missed that this syllogistic reasoning is completely unrelated to the abductive inferences that may play a role in reaching conclusions about general causation.  Sanders revealingly tells us that “[e]xperts using a weight of the evidence methodology should be given the same wide latitude as is given those employing the differential diagnosis method.”  Sanders at 172 & n.147.  This counsel appears to be an invitation to speculate.  If the “wide latitude” to which Sanders refers means the approach of a minority of courts that allow expert witnesses to rule in differentials by speculation, and then rule them in by failing to rule out idiopathic cases, then Sanders’ approach is advocacy for epistemic nihilism.

The Corpuscular Approach

Professor Sanders seems to endorse the argument of Milward, as well as Justice Stevens’ dissent in Joiner, that scientists do not assess research by looking at the validity (vel non) of individual studies, and therefore courts should not permit this approach.  Sanders at 173 & n.15.  Neither Justice Stevens nor Professor Sanders presents any evidence for the predicate assertion, which a brief tour of IARC’s less political working group reports would show to be incorrect.

The rationale for Sanders (and Milward’s) reductionism of science to WOE becomes clear when Sanders asserts that “[p]erhaps all or nearly all critiques of an expert employing a weight of the evidence methodology should go to weight, not admissibility. Id. at 173 & n.155.  To be fair, Sanders notes that the Milward court carved out a “solid-body” of exonerative epidemiology exception to WOE.  Id. at 173-74.  This exception, however, does nothing other than placing a substantial burden upon the opponent of expert witness opinion to show that the opinion is demonstrably incorrect.  The proponent gets a free pass as long as there is no “solid body” of such evidence that shows he is affirmatively wrong.  Discerning readers will observe that maneuver simply shifts the burden of admissibility to the opponent,  and eschews the focus on methodology for a renewed emphasis upon general acceptance of conclusions.  Id.

Sanders also notes that other courts have seen through the emptiness of WOE and rejected its application in specific cases.  Id. at 174 & n.163-64 (citing Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 601-02 (D.N.J. 2002), aff’d, 68 F. App’x 356 (3d Cir. 2003), where the trial court rejected Dr. Ozonoff’s attempt to deploy WOE without explaining or justifying the mixing and matching of disparate kinds of studies with disparate results).  Sanders’ analysis of Milward seems, however, designed to skim the surface of the case in an effort to validate the First Circuit’s superficial approach.