TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Love is Blind but What About Judicial Gatekeeping of Expert Witnesses? – Viagra Part I

July 7th, 2012

The Viagra litigation over claimed vision loss vividly illustrates the difficulties that trial judges have in understanding and applying the concept of statistical significance.  In this MDL, plaintiffs sued for a specific form of vision loss, non-arteritic ischemic optic neuropathy (NAION), which they claimed was caused by their use of defendant’s medication, Viagra.  In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071 (D. Minn. 2008).  Plaintiffs’ key expert witness, Gerald McGwin considered three epidemiologic studies; none found a statistically significant elevation of risk of NAION after Viagra use.  Id. at 1076. The defense filed a Rule 702 motion to exclude McGwin’s testimony, based in part upon the lack of statistical significance of the risk ratios he relied upon for his causal opinion.  The trial court held that this lack did not render McGwin’s testimony and unreliable and inadmissible  Id. at 1090.

One of the three studies considered by McGwin was his own published paper.  G. McGwin, Jr., M. Vaphiades, T. Hall, C. Owsley, ‘‘Non-arteritic anterior ischaemic optic neuropathy and the treatment of erectile dysfunction,’’ 90 Br. J. Ophthalmol. 154 (2006)[“McGwin 2006”].    The MDL court noted that McGwin had stated that his paper reported an odds ratio (OR) of 1.75, with a 95% confidence interval (CI), 0.48 to 6.30.  Id. at 1080.  The study also presented multiple subgroup analyses of men who had reported Viagra use after a history of heart attack (OR = 10.7) or hypertension (OR = 6.9), but the MDL court did not provide p-values or confidence intervals for the subgroup analysis results.

Curiously, Judge Magnuson eschewed the guidance of the Reference Manual on Scientific Evidence, in dealing with statistics of sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.  The new third edition again conveys the same basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The point of the RSME‘s guidance is, of course, that the standard error, or the confidence interval (C.I.) based upon a specified number of standard errors, is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

The statistical data on which McGwin was basing his opinion was readily available from McGwin 2006:

“Overall, males with NAION were no more likely to report a history of Viagra … use compared to similarly aged controls (odd ratio (OR) 1.75, 95% confidence interval (CI) 0.48 to 6.30.  However, for those with a history of myocardial infarction, a statistically significant association was observed (OR 10.7, 95% CI 1.3 to 95.8). A similar association was observed for those with a history of hypertension though it lacked statistical significance (OR 6.9, 95% CI 0.8 to 63.6).”

McGwin 2006, at 154.  Following the RSME‘s guidance would have assisted the MDL court in its gatekeeping responsibility in several distinct ways.  First, the court would have focused on how wide the 95% confidence intervals were.  The width of the intervals pointed to statistical imprecision and instability in the point estimates urged by McGwin.  Second, the MDL court would have confronted the extent to which there were multiple ad hoc subgroup analyses in McGwin’s paper.  See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002)(“It is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns.”) Third, the court would have confronted the extent to which the study’s validity was undermined by several potent biases.  Statistical significance was the least of the problems faced by McGwin 2006.

The second study considered and relied upon by McGwin was referred to as Margo & French.  McGwin cited this paper for an “elevated OR of 1.10,” id. at 1081, but again, had the court engaged with the actual evidence, it would have found that McGwin had cherry picked the data he chose to emphasize.  The Margo & French study was a retrospective cohort study using the National Veterans Health Administration’s pharmacy and clinical databases.  C. Margo & D. French, ‘‘Ischemic optic neuropathy in male veterans prescribed phosphodiesterase-5 inhibitors,’’ 143 Am. J. Ophthalmol. 538 (2007).  There were two outcomes ascertained:  NAION and “possible” NAION.  The relative risk of NAION among men prescribed a PDE-5 inhibitor (the class to which Viagra belongs) was 1.02 (95% confidence interval [CI]: 0.92 to 1.12.  In other words, the Margo & French paper had very high statistical precision, and it reported essentially no increased risk at all.  Judge Magnuson cited uncritically McGwin’s endorsement of a risk ratio that included ‘‘possible’’ NAION cases, which could not bode well for a gatekeeping process that is supposed to protect against speculative evidence and conclusions.

McGwin’s citation of Margo & French for the proposition that men who had taken the PDE-5 inhibitors had a 10% increased risk was wrong on several counts.  First, he relied upon an outcome measure that included ‘‘possible’’ cases of NAION.  Second, he completely ignored the sampling error that is captured in the confidence interval.  The MDL court failed to note or acknowledge the p-value or confidence interval for any result in Margo & French. The consideration of random error was not an optional exercise for the expert witness or the court; nor was ignoring it a methodological choice that simply went to the ‘‘disagreement among experts.’’

The Viagra MDL court not only lost its way by ignoring the guidance of the RMSE, it appeared to confuse the magnitude of the associations with the concept of statistical significance.  In the midst of the discussion of statistical significance, the court digressed to address the notion that the small relative risk in Margo & French might mean that no plaintiff could show specific causation, and then in the same paragraph returned to state that ‘‘persuasive authority’’ supported the notion that the lack of statistical significance did not detract from the reliability of a study.  Id. at 1081 (citing In re Phenylpropanolamine (PPA) Prods. Liab. Litig., MDL No. 1407, 289 F.Supp.2d 1230, 1241 (W.D.Wash. 2003)).  The magnitude of the observed odds ratio is an independent concept from that of whether an odds ratio as extreme or more so would have occurred by chance if there really was no elevation.

Citing one case, at odds with a great many others, however, did not create an epistemic warrant for ignoring the lack of statistical significance.  The entire notion of cited caselaw for the meaning and importance of statistical significance for drawing inferences is wrong headed.  Even more to the point, the lack of statistical significance in the key study in the PPA litigation did not detract from the reliability of the study, although other features of that study certainly did.  The lack of statistical significance in the PPA study did, however, detract from the reliability of the inference from the study’s estimate of ‘‘effect size’’ to a conclusion of causal association. Indeed, nowhere in the key PPA study did its authors draw a causal conclusion with respect to PPA ingestion and hemorrhagic stroke.  See Walter Kernan, Catherine Viscoli, Lawrence Brass, Joseph Broderick, Thomas Brott, Edward Feldmann, Lewis Morgenstern, Janet Lee Wilterdink, and Ralph Horwitz, ‘‘Phenylpropanolamine and the Risk of Hemorrhagic Stroke,’’ 343 New England J. Med. 1826 (2000).

The MDL court did attempt to distinguish the Eighth Circuit’s decision in Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001), cited by the defense:

‘‘[I]n Glastetter … expert evidence was excluded because ‘rechallenge and dechallenge data’ presented statistically insignificant results and because the data involved conditions ‘quite distinct’ from the conditions at issue in the case. Here, epidemiologic data is at issue and the studies’ conditions are not distinct from the conditions present in the case. The Court does not find Glastetter to be controlling.’’

Id. at 1081 (internal citations omitted; emphasis in original).  This reading of Glastetter, however, misses important features of that case and the Parlodel litigation more generally.  First, the Eighth Circuit commented not only upon the rechallenge-dechallenge data, which involved arterial spasms, but upon an epidemiologic study of stroke, from which Ms. Glastetter suffered.  The Glastetter court did not review the epidemiologic evidence itself, but cited to another court, which did discuss and criticize the study for various ‘‘statistical and conceptual flaws.’’  See Glastetter, 252 F.3d at 992 (citing Siharath v. Sandoz Pharms.Corp., 131 F.Supp. 2d 1347, 1356-59 (N.D.Ga.2001)).  Glastetter was binding authority, and not so easily dismissed and distinguished.

The Viagra MDL court ultimately placed its holding upon the facts that:

‘‘the McGwin et al. and Margo et al. studies were peer-reviewed, published, contain known rates of error, and result from generally accepted epidemiologic research.’’

In re Viagra, 572 F. Supp. 2d at 1081 (citations omitted).  This holding was a judicial ipse dixit substituting for the expert witness’s ipse dixit.  There were no known rates of error for the systematic errors in the McGwin study, and the ‘‘known’’ rates of error for random error in McGwin 2006  were intolerably high.  The MDL court never considered any of the error rates, systematic or random, for the Margo & French study.  The court appeared to have abdicated its gatekeeping responsibility by delegating it to unknown peer reviewers, who never considered whether the studies at issue in isolation or together could support a causal health claim.

With respect to the last of the three studies considered, the Gorkin study, McGwin opined that it was  too small, and the data were not suited to assessing temporal relationship.  Id.  The court did not appear inclined to go beyond McGwin’s ipse dixit.  The Gorkin study was hardly small, in that it was based upon more than 35,000 patient-years of observation in epidemiologic studies and clinical trials, and provided an estimate of incidence for NAION among users of Viagra that was not statistically different from the general U.S. population.  See L. Gorkin, K. Hvidsten, R. Sobel, and R. Siegel, ‘‘Sildenafil citrate use and the incidence of nonarteritic anterior ischemic optic neuropathy,’’ 60 Internat’l J. Clin. Pract. 500, 500 (2006).

Judge Magnuson did proceed, in his 2008 opinion, to exclude all the other expert witnesses put forward by the plaintiffs.  McGwin survived the defendant’s Rule 702 challenge, largely because the court refused to consider the substantial random variability in the point estimates from the studies relied upon by McGwin. There was no consideration of the magnitude of random error, or for that matter, of the systematic error in McGwin’s study.  The MDL court found that the studies upon which McGwin relied had a known and presumably acceptable ‘‘rate of error.’’  In fact, the court did not consider the random or sampling error in any of the three cited studies; it failed to consider the multiple testing and interaction; and it failed to consider the actual and potential biases in the McGwin study.

Some legal commentators have argued that statistical significance should not be a litmus test.  David Faigman, Michael Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence: The Law and Science of Expert Testimony § 23:13, at 241 (‘‘Statistical significance should not be a litmus test. However, there are many situations where the lack of significance combined with other aspects of the research should be enough to exclude an expert’s testimony.’’)  While I agree that significance probability should not be evaluated in a mechanical fashion, without consideration of study validity, multiple testing, bias, confounding, and the like, handing waving about litmus tests does not excuse courts or commentators from totally ignoring random variability in studies based upon population sampling.  The dataset in the Viagra litigation was not a close call.

Maryland Puts the Brakes on Each and Every Asbestos Exposure

July 3rd, 2012

Last week, the Maryland Court of Special Appeals reversed a plaintiffs’ verdict in Dixon v. Ford Motor Company, 2012 WL 2483315 (Md. App. June 29, 2012).  Jane Dixon died of pleural mesothelioma.  The plaintiffs, her survivors, claimed that her last illness and death were caused by her household improvement projects, which involved exposure to spackling/joint compound, and by her husband’s work with car parts and brake linings, which involved “take home” exposure on his clothes.  Id. at *1.

All the expert witnesses appeared to agree that mesothelioma is a “dose-response disease,” meaning that the more the exposure, the greater the likelihood that a person exposed will develop the disease. Id. at *2.  Plaintiffs’ expert witness, Dr. Laura Welch, testified that “every exposure to asbestos is a substantial contributing cause and so brake exposure would be a substantial cause even if [Mrs. Dixon] had other exposures.” On cross-examination, Dr. Welch elaborated upon her opinion to explain that any “discrete” exposure would be a contributing factor. Id.

Welch, of course, criticized the entire body of epidemiology of car mechanics and brake repairmen, which generally finds no increased risk of mesothelioma above overall population rates.  With respect to the take-home exposure, Welch had to acknowledge that there were no epidemiologic studies that investigated the risk of wives of brake mechanics.  Welch argued that the studies of car mechanics did not involve exposure to brake shoes as would have been experienced by brake repairmen, but her argument only served to make her attribution based upon take-home exposure to brake linings seem more preposterous.  Id. at *3.  The court recognized that Dr. Welch’s opinion may have been trivially true, but still unhelpful.  Each discrete exposure, even as attenuated as a take-home exposure from having repaired a single brake shoe may have “contributed,” but that opinion did not help the jury assess whether the contribution was substantial.

The court sidestepped the issue of fiber type, and threshold, and honed in on the agreement that mesothelioma risk showed a dose-response relationship with asbestos exposure.  (There is a sense that the court confused the dose-response concept to mean no threshold.)  The court credited hyperbolic risk assessment figures from the United States Environmental Protection Agency, which suggested that even ambient air exposure to asbestos leads to an increase in mesothelioma risk, but then realized that such claims made the legal need to characterize the risk from the defendant’s product all the more important before the jury could reasonably have concluded that any particular exposure experienced by Ms. Dixon was “a substantial contributing factor.”  Id. at *5.

Having recognized that the best the plaintiffs could offer was a claim of increased risk, and perhaps crude quantification of the relative risks resulting from each product’s exposure, the court could not escape that the conclusion that Dr. Welch’s empty recitation of “every exposure” is substantial was nothing more than an unscientific and empty assertion.  Welch’s claim was either tautologically true or empirical nonsense.  The court also recognized that risk substituting for causation opened the door to essentially probabilistic evidence:

“If risk is our measure of causation, and substantiality is a threshold for risk, then it follows—as intimated above—that ‘substantiality’ is essentially a burden of proof. Moreover, we can explicitly derive the probability of causation from the statistical measure known as ‘relative risk’ … .  For reasons we need not explore in detail, it is not prudent to set a singular minimum ‘relative risk’ value as a legal standard.12 But even if there were some legal threshold, Dr. Welch provided no information that could help the finder of fact to decide whether the elevated risk in this case was ‘substantial’.”

Id. at *7.  The court’s discussion here of “the elevated risk” seems wrong unless we understand it to mean the elevated risk attributable to the particular defendant’s product, in the context of an overall exposure that we accept as having been sufficient to cause the decedent’s mesothelioma.  Despite the lack of any quantification of relative risks in the case, overall or from particular products, and the court’s own admonition against setting a minimum relative risk as a legal standard, the court proceeded to discuss relative risks at length.  For instance, the court criticized Judge Kozinski’s opinion in Daubert, upon remand from the Supreme Court, for not going far enough:

“In other words, the Daubert court held that a plaintiff’s risk of injury must have at least doubled in order to hold that the defendant’s action was ‘more likely than not’ the actual cause of the plaintiff’s injury. The problem with this holding is that relative risk does not behave like a ‘binary’ hypothesis that can be deemed ‘true’ or ‘false’ with some degree of confidence; instead, the un-certainty inherent in any statistical measure means that relative risk does not resolve to a certain probability of specific causation. In order for a study of relative risk to truly fulfill the preponderance standard, it would have to result in 100% confidence that the relative risk exceeds two, which is a statistical impossibility. In short, the Daubert approach to relative risk fails to account for the twin statistical uncertainty inherent in any scientific estimation of causation.”

Id. at *7 n.12 (citing Daubert v. Merrell Dow Pharms., Inc., 43 F.3d 1311, 1320-21 (9th Cir.1995) (holding that that a preponderance standard requires causation to be shown by probabilistic evidence of relative risk greater than two) (opinion on remand from Daubert v. Merrell Dow Pharms., 509 U.S. 579 (1993)).  The statistical impossibility derives from the asymptotic nature of the normal distribution, but the court failed to explain why a relative risk of two must be excluded as statistically implausible based upon the sample statistic.  After all, a relative risk greater than two, with a lower bound of a 95% confidence interval above one, based upon an unbiased sampling, suggests that our best evidence is that the population parameter is greater than two, as well.  The court, however, insisted upon stating the relative-risk-greater-than-two rule with a vengeance:

“All of this is not to say, however, that any and all attempts to establish a burden of proof of causation using relative risk will fail. Decisions can be – and in science or medicine are – premised on the lower limit of the relative risk ratio at a requisite confidence level. The point of this minor discussion is that one cannot apply the usual, singular ‘preponderance’ burden to the probability of causation when the only estimate of that probability is statistical relative risk. Instead, a statistical burden of proof of causation must consist of two interdependent parts: a requisite confidence of some minimum relative risk. As we explain in the body of our discussion, the flaws in Dr. Welch’s testimony mean we need not explore this issue any further.44

Id. (emphasis in original).

And despite having declared the improvidence of addressing the relative risk issue, and then the lack of necessity for addressing the issue given Dr. Welch’s flawed testimony, the court nevertheless tackled the issue once more, a couple of pages later:

“It would be folly to require an expert to testify with absolute certainty that a plaintiff was exposed to a specific dose or suffered a specific risk. Dose and risk fall on a spectrum and are not ‘true or false’. As such, any scientific estimate of those values must be expressed as one or more possible intervals and, for each interval, a corresponding confidence that the true value is within that interval.”

Id. at 9 (emphasis in original; internal citations omitted).  The court captured the frequentist concept of the confidence interval as being defined operationally by repeated samplings and their random variability, but the confidence of the confidence interval means that the specified coefficient represents the percentage of all such intervals that include the “true” value, not the probability that a particular interval, calculated from a given sample, contains the true value.  The true value is either in or not in the interval generated from a single sample risk statistic.  Again, it is unclear why the court was weighing in on this aspect of probabilistic evidence when plaintiffs’ expert witness, Welch, offered no quantitation of the overall risk or of the risk attributable to a specific product exposure.

The court indulged the plaintiffs’ no-threshold fantasy but recognized that the risks of low-level asbestos exposure were low, and likely below a doubling of risk, an issue that the court stressed it wanted to avoid.  The court cited one study that suggested a risk (odds) ratio of 1.1 for exposures less than 0.5 fiber/ml – years.  See id. at *5 (citing Y. Iwatsubo et al., “Pleural mesothelioma: dose-response relation at low levels of asbestos exposure in a French population-based case-control study,” 148 Am. J. Epidemiol. 133 (1998) (estimating an odds ratio of 1.1 for exposures less than 0.5 fibers/ml-years).  But the court, which tried to be precise elsewhere, appears to have lost its way in citing Iwatsubo here.  After all, how can a single odds ratio of 1.1 describe all exposures from 0 all the way up to 0.5 f/ml-years?  How can a single odds ratio describe all exposures in this range, regardless of fiber type, when chrystotile asbestos carries little to no risk for mesothelioma, and certainly orders of magnitude risk less than amphibole fibers such as amosite and crocidolite.  And if a low-level exposure has a risk ratio of 1.1, how can plaintiffs’ hired expert witness, Welch, even make the attribution of Dixon’s mesothelioma to the entirety of her exposure, let alone the speculative take-home chrysotile exposure involved from Ford’s brake linings?  Obviously, had the court posed these questions, it would it would have realized that “it is not possible” to permit Welch’s testimony at all.

The court further lost its way in addressing the exculpatory epidemiology put forward by the defense expert witnesses:

“Furthermore, the leading epidemiological report cited by Ford and its amici that specifically studied ‘brake mechanics’, P.A. Hessel et al., ‘Meso-thelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study’, 24 Risk Analysis 547 (2004), does not at all dispel the notion that this population faced an increased risk of mesothelioma due to their industrial asbestos exposure. … When calculated at the 95% confidence level, Hessel et al. estimated that the odds ratio of mesothelioma could have been as low as 0.01 or as high as 4.71, implying a nearly quintupled risk of mesothelioma among the population of brake mechanics. 24 Risk Analysis at 550–51.”

Id. at *8.  Again, the court is fixated with the confidence interval, to the exclusion of the estimated magnitude of the association!  This time, after earlier shouting that it was the lower bound of the interval that matters scientifically, the court emphasizes the upper bound.  The court here has strayed far from the actual data, and any plausible interpretation of them:

“The odds ratio (OR) for employment in brake installation or repair was 0.71 (95% CI: 0.30-1.60) when controlled for insulation or shipbuilding. When a history of employment in any of the eight occupations with potential asbestos exposure was controlled, the OR was 0.82 (95% CI: 0.36-1.80). ORs did not increase with increasing duration of brake work. Exclusion of those with any of the eight exposures resulted in an OR of 0.62 (95% CI: 0.01-4.71) for occupational brake work.”

P.A. Hessel et al., “Mesothelioma Among Brake Mechanics: An Expanded Analysis of a Case-control Study,” 24 Risk Analysis 547, 547 (2004).  All of Dr. Hessel’s estimates of effect sizes were below 1.0, and he found no trend for duration of brake work.  Cherry picking out the upper bound of a single subgroup analysis for emphasis was unwarranted, and hardly did justice to the facts or the science.

Dr. Welch’s conclusion that the exposure and risk in this case were “substantial” simply was not a scientific conclusion, and without it her testimony did not provide information for the jury to use in reaching its conclusion as to substantial factor causation. Id. at *7.  The court noted that Welch, and the plaintiffs, may have lacked scientific data to provide estimates of Dixon’s exposure to asbestos or relative risk of mesothelioma, but ignorance or uncertainty was hardly the basis to warrant an expert witness’s belief that the relevant exposures and risks are “substantial.” Id. at *10.  The court was well justified in being discomforted by the conclusory, unscientific opinion rendered by Laura Welch.

In the final puzzle of the Dixon case, the court vacated the judgment, and remanded for a new trial, “either without her opinion on substantiality or else with some quantitative testimony that will help the jury fulfill its charge.”  Id. at *10.  The court thus seemed to imply that an expert witness need not utter the magic word, “substantial,” for the case to be submitted to the jury against a brake defendant in a take-home exposure case.  Given the state of the record, the court should have simply reversed and rendered judgment for Ford.

Meta-Meta-Analysis — The Gadolinium MDL — More Than Ix’se Dixit

June 8th, 2012

There is an tendency, for better or worse, for legal bloggers to be partisan cheerleaders over litigation outcomes.  I admit that most often I am dismayed by judicial failures or refusals to exclude dubious plaintiffs’ expert witnesses’ opinion testimony, and I have been known to criticize such decisions.  Indeed, I wouldn’t mind seeing courts exclude dubious defendants’ expert witnesses.  I have written approvingly about cases in which judges have courageously engaged with difficult scientific issues, seen through the smoke screen, and properly assessed the validity of the opinions expressed.  The Gadolinium MDL (No. 1909) Daubert motions and decision offer a fascinating case study of a challenge to an expert witness’s meta-analysis, an effective defense of the meta-analysis, and a judicial decision to admit the testimony, based upon the meta-analysis.  In re Gadolinium-Based Contrast Agents Prods. Liab. Litig., 2010 WL 1796334 (N.D. Ohio May 4, 2010) [hereafter Gadolinium], reconsideration denied, 2010 WL 5173568 (June 18, 2010).

Plaintiffs proffered general causation opinions (between gadolinium contrast media and Nephrogenic Systemic Fibrosis (“NSF”), by a nephrologist, Joachim H. Ix, M.D., with training in epidemiology.  Dr. Ix’s opinions were based in large part upon a meta-analysis he conducted on data in published observational studies.  Judge Dan Aaron Polster, the MDL judge, itemized the defendant’s challenges to Dr. Ix’s proposed testimony:

“The previously-used procedures GEHC takes issue with are:

(1) the failure to consult with experts about which studies to include;

(2) the failure to independently verify which studies to select for the meta-analysis;

(3) using retrospective and non-randomized studies;

(4) relying on studies with wide confidence intervals; and

(5) using a “more likely than not” standard for causation that would not pass scientific scrutiny.”

Gadolinium at *23.  Judge Polster confidently dispatched these challenges.  Dr. Ix, as a nephrologist, had subject-matter expertise with which to develop inclusionary and exclusionary criteria on his own.  The defendant never articulated what, if any, studies were inappropriately included or excluded.  The complaint that Dr. Ix had used retrospective and non-randomized studies also rang hollow in the absence of any showing that there were randomized clinical trials with pertinent data at hand.  Once a serious concern of nephrotoxicity arose, clinical trials were unethical, and the defendant never explained why observational studies were somehow inappropriate for inclusion in a meta-analysis.

Relying upon studies with wide confidence intervals can be problematic, but that is one of the reasons to conduct a meta-analysis, assuming the model assumptions for the meta-analysis can be verified.  The plaintiffs effectively relied upon a published meta-analysis, which pre-dated their expert witness’s litigation effort, in which the authors used less conservative inclusionary criteria, and reported a statistically significant summary estimate of risk, with an even wider confidence interval.  R. Agarwal, et al., ” Gadolinium-based contrast agents and nephrogenic systemic fibrosis: a systematic review and meta-analysis,” 24 Nephrol. Dialysis & Transplantation 856 (2009).  As the plaintiffs noted in their opposition to the challenge to Dr. Ix:

“Furthermore, while GEHC criticizes Dr. Ix’s CI from his meta-analysis as being “wide” at (5.18864 and 25.326) it fails to share with the court that the peer-reviewed Agarwal meta-analysis, reported a wider CI of (10.27–69.44)… .”

Plaintiff’s Opposition to GE Healthcare’s Motion to Exclude the Opinion Testimony of Joachim Ix at 28 (Mar. 12, 2010)[hereafter Opposition].

Wider confidence intervals certainly suggest greater levels of random error, but Dr. Ix’s intervals suggested statistical significance, and he had carefully considered statistical heterogeneity.  Opposition at 19. (Heterogeneity was never advanced by the defense as an attack on Dr. Ix’s meta-analysis).  Remarkably, the defendant never advanced a sensitivity analysis to suggest or to show that reasonable changes to the evidentiary dataset could result in loss of statistical significance, as might be expected from the large intervals.  Rather, the defendant relied upon the fact that Dr. Ix had published other meta-analyses in which the confidence interval was much narrower, and then claimed that he had “required” these narrower confidence intervals for his professional, published research.  Memorandum of Law of GE Healthcare’s Motion to Exclude Certain Testimony of Plaintiffs’ Generic Expert, Joachim H. Ix, MD, MAS, In re Gadolinium MDL No. 1909, Case: 1:08-gd-50000-DAP  Doc #: 668   (Filed Feb. 12, 2010)[hereafter Challenge].  There never was, however, a showing that narrower intervals were required for publication, and the existence of the published Agarwal meta-analysis contradicted the suggestion.

Interestingly, the defense did not call attention to Dr. Ix’s providing an incorrect definition of the confidence interval!  Here is how Dr. Ix described the confidence interval, in language quoted by plaintiffs in their Opposition:

“The horizontal lines display the “95% confidence interval” around this estimate. This 95% confidence interval reflects the range of odds ratios that would be observed 95 times if the study was repeated 100 times, thus the narrower these confidence intervals, the more precise the estimate.”

Opposition at 20.  The confidence interval does not provide a probability distribution of the parameter of interest; rather the distribution of confidence intervals has a probability of covering the hypothesized “true value” of the parameter.

Finally, the defendant never showed any basis for suggesting that a scientific opinion on causation requires something more than a “more likely than not” basis.

Judge Polster also addressed some more serious challenges:

“Defendants contend that Dr. Ix’s testimony should also be excluded because the methodology he utilized for his generic expert report, along with varying from his normal practice, was unreliable. Specifically, Defendants assert that:

(1) Dr. Ix could not identify a source he relied upon to conduct his meta-analysis;

(2) Dr. Ix imputed data into the study;

(3) Dr. Ix failed to consider studies not reporting an association between GBCAs and NSF; and

(4) Dr. Ix ignored confounding factors.”

Gadolinium at *24

IMPUTATION

The first point, above – the alleged failure to identify a source for conducting the meta-analysis – rings fairly hollow, and Judge Polster easily deflected it.  The second point raised a more interesting challenge.  In the words of defense counsel:

“However, in arriving at this estimate, Dr. Ix imputed, i.e., added, data into four of the five studies.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  Specifically, Dr. Ix added a single case of NSF without antecedent GBCA exposure to the patient data in the underlying studies.

* * *

During his deposition, Dr. Ix could not provide any authority for his decision to impute the additional data into his litigation meta-analysis.  (See Sept. 22 Ix Dep. Tr. (Ex. 20), at 149:10-151:4.)  When pressed for any authority supporting his decision, Dr. Ix quipped that ‘this may be a good question to ask a Ph.D level biostatistician about whether there are methods to [calculate an odds ratio] without imputing a case [of NSF without antecedent GBCA exposure]’.”

Challenge at 12-13.

The deposition reference suggests that the examiner had scored a debating point by catching Dr. Ix unprepared, but by the time the parties briefed the challenge, the plaintiffs had the issue well in hand, citing A. W. F. Edwards, “The Measure of Association in a 2 × 2 Table,” 126 J. Royal Stat. Soc. Series A 109 (1963); R.L. Plackett, “The Continuity Correction in 2 x 2 Tables,” 51 Biometrika 327 (1964).  Opposition at 36 (describing the process of imputation in the event of zero counts in the cells of a 2 x 2 table for odds ratios).  There are qualms to be stated about imputation, but the defense failed to make them.  As a result, the challenge overall lost momentum and credibility.  As the trial court stated the matter:

“Next, there is no dispute that Dr. Ix imputed data into his meta-analysis. However, as Defendants acknowledge, there are valid scientific reasons to impute data into a study. Here, Dr. Ix had a valid basis for imputing data. As explained by Plaintiffs, Dr. Ix’s imputed data is an acceptable technique for avoiding the calculation of an infinite odds ratio that does not accurately measure association.7 Moreover, Dr. Ix chose the most conservative of the widely accepted approaches for imputing data.8 Therefore, Dr. Ix’s decision to impute data does not call into question the reliability of his meta-analysis.”

Gadolinium at *24.

FAILURE TO CONSIDER NULL STUDIES

The defense’s challenged including a claim that Dr. Ix had arbitrarily excluded studies in which there was no reported incidence of NSF. The defense brief unfortunately does not describe the studies excluded, and what, if any, effect their inclusion in the meta-analysis would have had.  This was, after all, the crucial issue. The abstract nature of the defense claim left the matter ripe for misrepresentation by the plaintiffs:

“GEHC continues to misunderstand the role of a meta-analysis and the need for studies that included patients both that did or did not receive GBCAs and reported on the incidence of NSF, despite Dr. Ix’s clear elucidation during his deposition. (Ix Depo. TR [Exh.1] at 97-98).  Meta-analyses such as performed by Dr. Ix and Dr. Agarwal search for whether or not there is a statistically valid association between exposure and disease event. In order to ascertain the relationship between the exposure and event one must have an event to evaluate. In other words, if you have a study in which the exposed group consists of 10,000 people that are exposed to GBCAs and none develop NSF, compared to a non-exposed group of 10,000 who were not exposed to GBCAs and did not develop NSF, the study provides no information about the association between GBCAs and NSF or the relative risk of developing NSF.”

Challenge at 37 – 38 (emphasis in original).  What is fascinating about this particular challenge, and the plaintiffs’ response, is the methodological hypocrisy exhibited.  In essence, the plaintiffs argued that imputation was appropriate in a case-control study, in which one cell contained a zero, but they would ignore a great deal of data in a cohort study with data.  To be sure, case-control studies are more efficient than cohort studies for identifying and assessing risk ratios for rare outcomes.  Nevertheless, the plaintiffs could easily have been hoisted with their own hypothetical petard.  No one in 10,000 gadolinium-exposed patients developed NSF; and no one in a control group did either.  The hypothetical study suggests that the rate of NSF is low and not different in the exposed and in the unexposed patients.  The risk ratio could be obtained by imputing an integer for the cells containing zero, and a confidence interval calculated.  The risk ratio, of course, would be 1.0.

Unfortunately, the defense did not make this argument; nor did it explore where the meta-analysis might have come out had a more even-handed methodology been taken by Dr. Ix.  The gap allowed the trial court to brush the challenge aside:

“The failure to consider studies not reporting an association between GBCAs and NSF also does not render Dr. Ix’s meta-analysis unreliable. The purpose of Dr. Ix’s meta-analysis was to study the strength of the association between an exposure (receiving GBCA) and an outcome (development of NSF). In order to properly do this, Dr. Ix necessarily needed to examine studies where the exposed group developed NSF.”

Gadolinium at *24.  Judge Polster, with no help from the defense brief, missed the irony of Dr. Ix’s willingness to impute data in the case-control 2 x 2 contingency tables, but not in the relative risk tables.

CONFOUNDING

Defendants complained that Dr. Ix had ignored the possibility that confounding factors had contributed to the development of NSF.  Challenge at 13.  Defendants went so far as to charge Dr. Ix with misleading the court by failing to consider other possible causative exposures or conditions.  Id.

Defendants never identified the existence, source, and likely magnitude of confounding factors.  As a result, the plaintiffs’ argument, based in the Reference Manual, that confounding was an unlikely explanation for a very large risk ratio was enthusiastically embraced by the trial court, virtually verbatim from the plaintiffs’ Opposition (at 14):

“Finally, the Court rejects Defendants’ argument that Dr. Ix failed to consider confounding factors. Plaintiffs argued and Defendants did not dispute that, applying the Bradford Hill criteria, Dr. Ix calculated a pooled odds ratio of 11.46 for the five studies examined, which is higher than the 10 to 1 odds ratio of smoking and lung cancer that the Reference Manual on Scientific Evidence deemed to be “so high that it is extremely difficult to imagine any bias or confounding factor that may account for it.” Id. at 376.  Thus, from Dr. Ix’s perspective, the odds ratio was so high that a confounding factor was improbable. Additionally, in his deposition, Dr. Ix acknowledged that the cofactors that have been suggested are difficult to confirm and therefore he did not try to specifically quantify them. (Doc # : 772-20, at 27.) This acknowledgement of cofactors is essentially equivalent to the Agarwal article’s representation that “[t]here may have been unmeasured variables in the studies confounding the relationship between GBCAs and NSF,” cited by Defendants as a representative model for properly considering confounding factors. (See Doc # : 772, at 4-5.)”

Gadolinium at *24.

The real problem is that the defendant’s challenge pointed only to possible, unidentified causal agents.  The smoking/lung cancer analogy, provided by the Reference Manual, was inapposite.  Smoking is indeed a large risk factor for lung cancer, with relative risks over 20.  Although there are other human lung carcinogens, none is consistently in the same order of magnitude (not even asbestos), and as a result, confounding can generally be excluded as an explanation for the large risk ratios seen in smoking studies.  It would be easy to imagine that there are confounders for NSF, especially given that it is relatively recently been identified, and that they might be of the same or greater magnitude as that suggested for the gadolinium contrast media.  The defense, however, failed to identify confounders that actually threatened the validity of any of the individual studies, or of the meta-analysis.

CONCLUSION

The defense hinted at the general unreliability of meta-analysis, with references to References Manual on Scientific Evidence at 381 (2d ed. 2000)(noting problems with meta-analysis), and other, relatively dated papers.  See, e.g., John Bailar, “Assessing Assessments,” 277 Science 529 (1997)(arguing that “problems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with [meta-analysis].”).  The Reference Manual language carried over into the third edition, is out of date, and represents a failing of the new edition.  See The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011).

The plaintiffs came forward with some descriptive statistics of the prevalence of meta-analysis in contemporary biomedical literature.  The defendants gave mostly argument; there is a dearth of citation to defense expert witnesses, affidavits, consensus papers on meta-analysis, textbooks, papers by leading authors, and the like.  The defense challenge suffered from being diffuse and unfocused; it lost persuasiveness by including weak, collateral issues such as claiming that Dr. Ix was opining “only” on a “more likely than not” basis, and that he had not consulted with other experts, and that he had failed to use randomized trial data.  The defense was quick to attack perceived deficiencies, but it did not illustrate how or why the alleged deficiencies threatened the validity of Dr. Ix’s meta-analysis.  Indeed, even when the defense made strong points, such as the exclusion of zero-event cohort studies, it failed to document that such studies existed, and that their inclusion might have made a difference.

 

On the Importance of Showing Relative Risks Greater Than Two – Haack’s Arguments

May 23rd, 2012

Professor Susan Haack has set out, repeatedly, to criticize the judicial requirement of relative risks greater than two to support findings that exposure to a substance, process, or medication was a specific cause of a plaintiff’s injury.  If for no other reason than the frequency with which Haack has published on this same issue, her views are worth examining more closely.

Haack’s argument, typically, proceeds along the lines that requiring a relative risk greater than two (RR > 2) is improper because a RR > 2 is neither necessary nor sufficient for finding specific causation.  See, e.g., Susan Haack, “Warrant, Causation, and the Atomism of Evidence Law,” 5 Episteme 253, 261 (2008)[hereafter “Warrant“];  “Proving Causation: The Holism of Warrant and the Atomism of Daubert,” 4 J. Health & Biomedical Law 273, 304 (2008)[hereafter “Proving Causation“].

Unlike the more sophisticated reasons offered by Professor Sander Greenland, Professor Haack’s reasoning fails to understand both the law and the science.

Haack:  RR > 2 Not Sufficient

Haack argues that RR > 2 is not sufficient for two reasons:

“Epidemiological evidence of a doubling of risk is not sufficient for specific causation: first, because if the study showing a doubling of risk is poorly-designed or poorly-executed, we would have only a low epistemological likelihood of a greater than 50% statistical probability; and second, because even a well-designed and well-conducted study might also show that those subjects who develop D [some claimed causally related disease] when exposed to S [some substance] have some characteristic in common – older patients rather than younger, perhaps, or women rather than men, or the sedentary rather than the active – and our plaintiff might be an elderly, sedentary female.”

Proving Causation at 304 (emphasis added).

The first argument is largely irrelevant to the legal context in which the RR > 2 rationale arises.  Typically, plaintiffs assert general and specific causation on the basis of a complex evidentiary display.  This display includes evidence of an epidemiologic association, but the magnitude of the association is weak, with RR > 1, but < 2.  Thus the defendants challenge the attributability in the plaintiff’s individual case.  The overall evidentiary display may or may not support general causation, but even if general causation were conceded, specific causation would remain as independent factual issue.  Haack’s first “argument” is that the RR > 2 argument is insufficient because the study with RR > 2 may lack internal validity on grounds that it was poorly designed, poorly conducted, or poorly analyzed.  True, true, but immaterial.  On motions for summary judgment or directed verdict, the trial court would resolve any factual issues about disputed validity in favor of the non-moving party.  The defense may have better studies that show the RR =1, but these would not factor in the decision to grant or refuse the motion.  (If the defense can show that the plaintiffs’ studies with RR > 2 are fatally flawed, then the plaintiffs might be relegated to their studies with lower risk.)

Haack’s second reason appears to go to external validity.  She suggests that a study at issue may be in a population that shares key risk factors with the plaintiff.  Why this similarity would suggest that RR > 2 is not sufficient is quite mysterious.  External validity would support the applicability of the study, with its RR > 2, not militate against its sufficiency.  If the “characteristic in common” is the basis for an interaction with the exposure to S, then we would expect that to be shown by the data in the study; it would not, and should not, be a matter of conjecture or speculation.

Haack:  RR > 2 Not Necessary

Similarly, Haack argues that RR > 2 is not necessary for two reasons:

“And epidemiological evidence of a doubling of risk is not necessary for specific causation, either: first, because studies that fail to show a doubling of risk may be flawed – for example, by failing to take account of the period of pregnancy in which subjects are exposed to S, or by failing to take account of the fact that subjects are included who may have been exposed to S in cold medication or sleep-aids; 99 and second, because even a good epidemiological study indicating to a high degree of epistemic likelihood that there is a doubling of risk may also indicate that those subjects who develop D have some characteristic (such as being over 50 or sedentary or subject to allergies or whatever) that this plaintiff lacks.100

Proving Causation at 304 (emphasis added).

Again, Haack’s reasoning is nothing other than an invitation to speculate.  Sure, studies with RR < 2 may be flawed, but the existence of flaws in the studies is hardly a warrant for the true RR > 2.  The evidence is the thing; and she is quick to point out elsewhere:  absence of evidence is not evidence of absence.  And so a flawed study is not particularly probative of anything; it cannot be made into affirmative evidence of the opposite result by the existence of a flaw.  Haack seems to be suggesting that the studies at issue, with RR < 2, may be biased low by misclassification or other systemic bias.  Again, true, true, and immaterial.  An epidemiologic study may suffer bias (or not), but if it does, the usual path is conduct the study again without the previous bias.  Sometimes the data may be re-analyzed, and the march of progress is in the direction of having underlying data accessible to permit some degree of re-analysis.  In any event, cases with RR < 2, or RR = 2, are not transformed into cases of RR > 2, solely by hand waving or speculation over the existence of potential bias.  The existence and direction of the bias remains something that must be shown by competent evidence.

As for the second argument, again, Haack invokes external invalidity as a possible reason that a RR > 2 does not necessarily require a finding for plaintiff.  The plaintiff may be sufficiently different from study participants such that the RR > 2 is not relevant.  This argument hardly undermines a requirement for a RR > 2, based upon a relevant study.

These arguments are repeated virtually verbatim in Proving Causation, where Haack asserts for the same reasons that a RR > 2 is neither necessary nor sufficient for showing specific causation.  Proving Causation at 261.

In an unpublished paper, which Haack has presented several times over the last few years, she has criticized the RR >2 argument as an example of flawed “probabilism” in the law.  Susan Haack, “Risky Business:  Statistical Proof of Individual Causation,” in Jordi Ferrer Beltrán, ed., Casuación y atribución de responsibilidad (Madrid: Marcial Pons, forthcoming)[hereafter Risky Business]; Presentation at the Hastings Law School (Jan. 20, 2012);  Presentation at University of Girona (May 24, 2011)

While there is some merit to Haack’s criticisms of probabilism, they miss the important point, which is that sometimes probabilistic inference is all there is.  Haack cites the New Jersey Supreme Court’s decision in Landrigan as supporting her notion that “other evidence,” presumably particularistic, plaintiff-specific evidence, plus a RR < 2 will suffice:

“The following year (1992), in Landrigan, the Supreme Court of New Jersey briskly observed that ‘a relative risk of 2.0 is not so much a password to a finding of causation as one piece of evidence among others’.”

Risky Business at 22 (citing and quoting Landrigan v. Celotex Corp., 127 N.J. 404, 419, 605 A.2d 1079 (1992)).

Haack, however, advances a common, but mistaken reading of Landrigan, where the Court blurred the distinction between sufficiency and admissibility of expert witness opinion on specific causation.  Landrigan, and another case, Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992), were both tried to juries, about the same time, in different counties in New Jersey.  (My former partner Terri Keeley tried Landrigan; I tried Caterinicchio.)  There was no motion to exclude expert witness testimony in either case; nor was there a motion for summary judgment ever lodged pre-trial.  Both cases involved motions for directed verdict, in which the defense invited the trial courts to accept the plaintiffs’ expert witnesses’ opinions, arguendo, and to focus on the inference of specific causation, which was drawn upon the assertion that both Mr. Landrigan and Mr. Caterinicchio had an increased risk of colorectal cancer as a result of their occupational asbestos exposure.  Admissibility was never in issue.

There were no valid biomarkers, no “fingerprints” of causation; no evidence of either plaintiff’s individual, special vulnerability.  The plaintiffs had put in their cases and rested; the trial courts were required to assume that the facts were as presented by the plaintiffs.  All the plaintiffs had offered, however, of any possible relevance, was a relative risk statistic. The trial courts in both cases granted the directed verdicts, and separate panels of the New Jersey Appellate Division affirmed.  Riding roughshod over the evidence, the New Jersey Supreme Court granted certification in both cases, and reversed and remanded for new trials.

Haack does an admirable job of echoing the speculation advanced by plaintiffs on appeal, in both Landrigan and Caterinicchio.  She speculates that the plaintiffs may have had greater than average exposure, or that they were somehow more vulnerable than the average exposed person in the relevant studies.

To paraphrase a Rumsfeldian bon mot:  The litigants must go to trial with the evidence that they have.

Both cases were remanded for new trials.  What is often not reported or discussed in connection with these two cases is that plaintiffs’ counsel dismissed Landrigan before proceeding with a new trial.  Caterinicchio was indeed retried — to a defense verdict.

Haack Attack on Legal Probabilism

May 6th, 2012

Last year, Professor Susan Haack presented a lecture on “legal probabilism,” at a conference on Standards of Proof and Scientific Evidence, held at the University of Girona, in Spain.  The lecture can be viewed on-line, and a manuscript of Haack’s paper is available , as well.  Susan Haack, “Legal Probabilism:  An Epistemological Dissent” (2011)(cited here as “Haack”).   Professor Haack has franked her paper as a draft, with an admonition “do not cite without permission,” an imperative that has no moral or legal force.  Her imperative certainly has no epistemic warrant.  We will ignore it.

As I have noted previously, here and there, Professor Haack is a Professor of philosophy and of law, at the University of Miami, Florida.  She has written widely on the philosophy of science, in the spirit of Peirce’s pragmatism.  Despite her frequent untutored judgments about legal matters, much of what she has written is a useful corrective to formalistic writings on “the scientific method,” and are worthy of study by lawyers interested in the intersection of science and the law.

The video of Professor Haack’s presentation is worth watching to get an idea of how ad hominem her style is.  I won’t repeat her aspersions and pejorative comments here.  They are not in her paper, and I will take her paper, which she posted online, as the expression of her mature thinking.

Invoking Lord Russell and Richard von Mises, Haack criticizes the reduction of epistemology to a calculus of probability.  Russell, for instance, cautioned against confusing the credibility of a claim with the probability that the claim is true:

“[I]t is clear that some things are almost certain, while others are matters of hazardous conjecture. For a rational man, there is a scale of doubtfulness, from simple logical and arithmetical propositions and perceptive judgments, at one end, to such questions as what language the Myceneans spoke or “what song the Sirens sang” at the other … , [T]he rational man, who attaches to each proposition the right degree of credibility, will be guided by the mathematical theory of probability when it is applicable . … The concept ‘degree of credibility’, however, is applicable much more widely than that of mathematical probability.”‘

Bertrand Russell, Human Knowledge, Its Scope and Limits 381 (N.Y. 1948)(quoted in Haack, supra, at 1).   Haack argues that ordinary language is beguiling.  We use “probably” to hedge our commitment to the truth of a prediction or a proposition of fact.  We insert the adverb “probably” to recognize that our statement might turn out false, although we have no idea of how likely, and no way of quantifying the probability of error.  Thus,

“[w]e commonly use the language of probability or likelihood when we talk about the credibility or warrant of a claim-about how likely is it, given this evidence, that the claim is true, or, unconditionally, about how probable the claim is.”

Haack at 14.

Epistemology is the “thing,” and psychology, not.  Haack admits that legal language is inconsistent:  sometimes the law appears to embrace psychological states of mind as relevant criteria for decisions; sometimes the law is expressly looking at epistemic warrant for the truth of claim.  Flipping the philosophical bird to Derrida and Feyerabend, Haack argues that trials are searches for the truth, and that our notions of substantial justice require replacement of psychological standards of proof, to the extent that they are merely subjective and non-epistemic, with a clear theory of epistemic warrant.  Haack at 6 (citing Tehan v. United States, 383 U.S. 406,416 (1966)(“the purpose of a trial is to determine the truth”); id. at 7 (citing In re Winship, 397 U.S. 358, 368, 370 (1970) (Harlan, J. concurring)(the standard of proof is meant to “instruct the factfinder concerning the degree of confidence our society thinks he should have in the correctness of factual conclusions for a particular type of adjudication.)

Haack points out that there are instances where evidence seems to matter more than subjective state of mind, although the law sometimes equivocates.  She cautions us that “we shouldn’t simply assume, just because the word “probable” or “probability” occurs in legal contexts, that we are dealing with mathematical, rather than epistemological, probabilities.  Haack at 16.  (citing and quoting Thomas Starkie, et al., A Practical Treatise of the Law of Evidence and Digest of Proofs in Civil and Criminal Proceedings vol. I, 579 (Philadelphia 1842)(“That … moral probabilities … could ever be represented by numbers … and thus be subject to numerical analysis,” … “cannot but be regarded as visionary and chimerical.”)  Thus the criminal standard, “beyond a reasonable doubt” seems to be about state of mind, but it is described, at least some of the time, as about the quality and strength of the evidence needed to attain such a state of mind.  The standards of “preponderance of the evidence” and “clear and convincing evidence,” on the other hand, appear to be directly related to the strength of the evidentiary display offered by the party with the burden of proof.

An example that Haack might have used, but did not, is the requirement that an expert witness express an opinion to a “reasonable degree of medical or scientific certainty.”  The law is not particularly concerned about the psychological state of certainty possessed by the witness:  the witness may be a dogmatist with absolute certainty but no epistemic warrant; and that simply will not do.

Of course, the preponderance standard is alternatively expressed as the burden to show the disputed fact is “more likely than not” correct, and that brings us back to explicit probabilisms in the law.  Haack’s argument would be bolstered by acknowledging the work of Professor Kahnemann, who makes the interesting point, at several places, that experts, or for that matter anyone making decisions, are not necessarily expert at determining their level of certainty.  Can someone really say that they believe one set of claims have been shown to be 50.1%, and have an intelligent discussion with another person, who adamantly believes that the claims have been shown to 49.9% true.  Do they resolve their differences by splitting the differences?  Unless we are dealing with an explicit set of frequencies or proportions, the language of probability is metaphorical.

Haack appropriates the term warrant for her epistemiologic theory, but the use seems much older and not novel with Haack.  In any event, Haack sets out her theory of “warrants”:

“(i) How supportive the evidence is; analogue: how well a crossword entry fits with the clue and intersecting completed entries. Evidence may be supportive (positive, favorable), undermining (negative, unfavorable), or neutral (irrelevant) with respect to some conclusion.

(ii) How secure the reasons are, independent of the claim in question; analogue:  how reasonable the competed intersecting entries are, independent of the entry in question. The better the independent security of positive reasons, the more warranted the conclusion, but the better the independent security of negative reasons, the less warranted the conclusion.

(iii) How comprehensive the evidence is, i.e., how much of the relevant evidence it includes; analogue: how much of the crossword has been completed. More comprehensive evidence gives more warrant to a conclusion than less comprehensive evidence does iff the additional evidence is at least as favorable as the rest.”

Haack at 18 (internal citation omitted).  According to Haack, the calculus of probabilities does not help in computing degrees of epistemic warrant.  Id. at 20. Her reasons are noteworthy:

  • “since quality of evidence has several distinct dimensions (supportiveness, independent security, comprehensiveness), and there is no way to rank relative success and failure across these different factors, there is no guarantee even of a linear ordering of degrees of warrant;
  • while the probability of p and the probability of not-p must add up to 1, when there is no evidence, or only very weak evidence, either way, neither p nor not-p may be warranted to any degree; and
  • while the probability of p and q (for independent p and q) is the product of the two, and hence, unless both are 1, less than the probability of either, the warrant of a conjunction may be higher than the warrant of its components”

Id. at 20-21.  The third bullet appears to have been a misfire.  If we were to use Bayes’ theorem, the two pieces of evidence would require sequential adjustments to our posterior odds or probability; we would not multiply the two probabilities directly.

Haack’s attack on legal probabilism blinds her to the reality that sometimes all there is in a legal case is probabilistic evidence.  For instance, in the litigation over claims that asbestos causes colorectal cancer, plaintiffs had only a relative risk statistic to support their desired inference that asbestos had caused their colorectal cancers.  There was no other evidence.  (On general causation, the animal studies failed to find colorectal cancer from asbestos ingestion, and the “weight of evidence” was against an association in any event.)  Nonetheless, Haack cites one case as a triumph of her anti-probabilistic viewpoint:

“Here I am deliberately echoing the words of the Supreme Court of New Jersey in Landrigan, rejecting the idea that epidemiological evidence of a doubling of risk is sufficient to establish specific causation in a toxic-tort case: ‘a relative risk of 2.0 is not so much a password to a finding of causation as one piece of evidence among many’.114 This gets the key epistemological point right.”

Landrigan v. Celotex Corp., 127 N.J. 405, 419, 605 A.2d 1079 (1992).  Well, not really.  Had Haack read the Landrigan decision, including the lower courts’ opinions, she would be aware that there were no other pieces of evidence.  There were no biomarkers, no “fingerprints” of causation; no evidence of Mr. Landrigan’s individual, special vulnerability.  The case went up to the New Jersey Supreme Court, along with a companion case, as a result of directed verdicts.  Caterinicchio v. Pittsburgh Corning Corp., 127 N.J. 428, 605 A.2d 1092 (1992). The plaintiffs had put in their cases and rested; the trial courts were required to assume that the facts were as presented by the plaintiffs.  All the plaintiffs had offered, however, of any possible relevance, was a relative risk statistic.

Haack’s fervent anti-probabilism obscures the utility of probability concepts, especially when probabilities are all we have.   In another jarring example, Haack seems to equate any use of Bayes’ theorem, or any legal analysis that invokes an assessment of probability, with misguided “legal probabilism.”  For instance, Haack writes:

“Mr. Raymond Easton was arrested for a robbery on the basis of a DNA “cold hit”; statistically, the probability was very low that the match between Mr. Easton’s DNA (on file after an arrest for domestic violence) and DNA found at the crime scene was random. But Mr. Easton, who suffered from Parkinson’s disease, was too weak to dress himself or walk more than a few yards-let alone to drive to the crime scene, or to commit the crime.”

Haack at 37 (internal citation omitted).  Bayes’ Theorem, with its requirement of inclusion of a base rate, or prior probability, in the complete analysis provides the complete answer to Haack’s misguided error about DNA cold hits.

 

Judge Posner’s Digression on Regression

April 6th, 2012

Cases that deal with linear regression are not particularly exciting except to a small brand of “quant” lawyers who see such things “differently.”  Judge Posner, the author of several books, including Economic Analysis of Law (8th ed. 2011), is a judge who sees things differently as well.

In a case decided late last year, Judge Posner took the occasion to chide the district court and the parties’ legal counsel for failing to assess critically a regression analysis offered by an expert witness on the quantum of damages in a contract case.  ATA Airlines Inc. (ATA), a subcontractor of Federal Express Corporation, sued FedEx for breaching an alleged contract to include ATA in a lucrative U.S. military deal.

Remarkably, the contract liability was a non-starter; the panel of the Seventh Circuit reversed and rendered the judgment in favor of the plaintiff.  There never was a contract, and so the case should never have gone to trial.  ATA Airlines, Inc. v. Federal Exp. Corp., 665 F.3d 882, 888-89 (2011).

End of Story?

In a diversity case, based upon state law, with no liability, you would think that the panel would and perhaps should stop once it reached the conclusion that there was no contract upon which to predicate liability.  Anything more would be, of course, pure obiter dictum, but Judge Posner could not resist the teaching moment, both for the trial judge below, the parties, their counsel, and the bar:

“But we do not want to ignore the jury’s award of damages, which presents important questions that have been fully briefed and are bound to arise in future cases.”

Id. at 889. That award of damages was based upon plaintiff’s expert witness’s regression analysis.  Judge Posner was perhaps generous in suggesting that the damages issue, as it involved a regression analysis, had been fully briefed.  Neither party addressed the regression with the level of scrutiny given by Judge Posner and his colleagues, Judges Wood and Easterbrook.

The Federal Express defense lawyers were not totally asleep at the wheel; they did object on Rule 702 grounds to the regression analysis offered by plaintiff’s witness, Lawrence D. Morriss, a forensic accountant.

“There were, as we’re about to see, grave questions concerning the reliability of Morriss’s application of regression analysis to the facts. Yet in deciding that the analysis was admissible, all the district judge said was that FedEx’s objections ‘that there is no objective test performed, and that [Morriss] used a subjective test, and [gave] no explanation why he didn’t consider objective criteria’, presented issues to be explored on cross-examination at trial, and that ‘regression analysis is accepted, so this is not “junk science.” [Morriss] appears to have applied it. Although defendants disagree, he has applied it and come up with a result, which apparently is acceptable in some areas under some models. Simple regression analysis is an accepted model.”

Id. (quoting District Judge Richard L. Young).

Apparently it is not enough for trial judges within the Seventh Circuit to wave their hands and proclaim that objections go to weight not admissibility; nor is it sufficient to say that a generally accepted technique was involved in formulating an opinion without exploring whether the technique was employed properly and reliably.  Judge Posner’s rebuke was short on subtlety and tact in describing the district judge’s response to FedEx’s Rule 702 objections:

“This cursory, and none too clear, response to FedEx’s objections to Morriss’s regression analysis did not discharge the duty of a district judge to evaluate in advance of trial a challenge to the admissibility of an expert’s proposed testimony. The evaluation of such a challenge may not be easy; the ‘principles and methods’ used by expert witnesses will often be difficult for a judge to understand. But difficult is not impossible. The judge can require the lawyer who wants to offer the expert’s testimony to explain to the judge in plain English what the basis and logic of the proposed testimony are, and the judge can likewise require the opposing counsel to explain his objections in plain English.”

Id. The lawyers, including Federal Express’s lawyers, also came in for admonishment:

“This might not have worked in the present case; neither party’s lawyers, judging from the trial transcript and the transcript of the Rule 702 hearing and the briefs and oral argument in this court, understand regression analysis; or if they do understand it they are unable to communicate their understanding in plain English.”

Id.

The court and counsel are not without resources, as Judge Posner pointed out.  The trial court can appoint its own expert to assist in evaluating the parties’ expert witnesses’ opinions.  Alternatively, the trial judge could roll up his sleeves and read the chapter on regression analysis in the Reference Manual on Scientific Evidence (3d ed. 2011). Id. at 889-890.  Judge Posner’s opinion makes clear that had the trial court taken any of these steps, Morriss’s regression analysis would not have survived the Rule 702 challenge.

Morriss’s analysis was, to be sure, a rather peculiar regression of costs regressed on revenues.  Inexplicably, ATA’s witness made cost the dependent variable, with revenue the independent variable.  Common sense would have told the judge that revenue (gained or lost) should have been the dependent term in the analysis.  ATA’s expert witness attempted to justify this peculiar regression by claiming that that the more plausible variables that make up costs (personnel, labor, fuel, equipment) were not available.  Judge Posner would have none of this incredible excuse mongering:

“In any event, a plaintiff’s failure to maintain adequate records is not a justification for an irrational damages theory.”

Id. at 893.

Judge Posner proceeded to dissect Morriss’s regression in detail, both in terms of its design and implementation.  Interestingly, FedEx had a damages expert witness, who was not called at trial.  Judge Posner correctly observed that defendants frequently do not call their damages witnesses at trial lest the jury infer that they are less than sincere in their protestations about no liability.  The FedEx damages expert, however, had calculated a 95 percent confidence interval for Morriss’s prediction for ATA’s costs in a year after the alleged breach of contract.  (It is unclear whether the interval calculated was truly a confidence interval, or a prediction interval, which would have been wider.)  In any event, the interval included costs at the high end, which would have resulted in net losses, rather than net profits, as Morriss had opined.  “All else aside, the confidence interval is so wide that there can be no reasonable confidence in the jury’s damages award.”  Id. at 896.

After summarizing the weirdness of Morriss’s regression analysis, Judge Posner delivered his coup de grâce:

“This is not nitpicking. Morriss’s regression had as many bloody wounds as Julius Caesar when he was stabbed 23 times by the Roman Senators led by Brutus. We have gone on at such length about the deficiencies of the regression analysis in order to remind district judges that, painful as it may be, it is their responsibility to screen expert testimony, however technical; we have suggested aids to the discharge of that responsibility. The responsibility is especially great in a jury trial, since jurors on average have an even lower comfort level with technical evidence than judges. The examination and cross-examination of Morriss were perfunctory and must have struck most, maybe all, of the jurors as gibberish. It became apparent at the oral argument of the appeal that even ATA’s lawyer did not understand Morriss’s analysis; he could not answer our questions about it but could only refer us to Morriss’s testimony. And like ATA’s lawyer, FedEx’s lawyer, both at the trial and in his appellate briefs and at argument, could only parrot his expert.

***

If a party’s lawyer cannot understand the testimony of the party’s own expert, the testimony should be withheld from the jury. Evidence unintelligible to the trier or triers of fact has no place in a trial. See Fed.R.Evid. 403, 702.”

Id. at 896.  Ouch! Even being the victor can be a joyless occasion before Judge Posner.  For those who are interested in such things, the appellate briefs of the parties can be found on line, both for ATA and for FedEx.

It is interesting to compare Judge Posner’s close scrutiny and analysis of the plaintiff’s expert witness’s regression with how the United States Supreme Court treated a challenge to the use of multiple regression in a race discrimination case in the mid-1980s.  In Bazemore v. Friday, 478 U.S. 385 (1986), the defendant criticized the plaintiffs’ regression on grounds that it omitted variables for major factors in any fair, sensible model of salary.  The Fourth Circuit had treated the omissions as fatal, but the Supreme Court excused the omissions by shifting the burden of producing a sensible, reliable regression model to the defense:

“The Court of Appeals erred in stating that petitioners’ regression analyses were ‘unacceptable as evidence of discrimination’, because they did not include ‘all measurable variables thought to have an effect on salary level’. The court’s view of the evidentiary value of the regression analysis was plainly incorrect. While the omission of variables from a regression analysis may render the analysis less probative than it otherwise might be, it can hardly be said, absent some other infirmity, that an analysis which accounts for the major factors ‘must be considered unacceptable as evidence of discrimination’. Ibid. Normally, failure to include variables will affect the analysis’ probativeness, not its admissibility.

Id. at 400.  The Court, buried in a footnote, made an abstract concession that “there may, of course, be some regressions so incomplete as to be inadmissible as irrelevant; but such was clearly not the case here.” Id. at 400 n.15.  When the Court decided Bazemore, the federal courts were still enthralled with their libertine approach to expert witness evidence.  It is unclear whether a straightforward analysis of the plaintiffs’ regression analyses in Bazemore under current Rule 702, without the incendiary claims of racism, would have permitted a more dispassionate analysis of the proffered evidence.

Confidence in Intervals and Diffidence in the Courts

March 4th, 2012

Next year, the Supreme Court’s Daubert decision will turn 20.  The decision, in interpreting Federal Rule of Evidence 702, dramatically changed the landscape of expert witness testimony.  Still, there are many who would turn the clock back to disabling the gatekeeping function.  In past posts, I have identified scholars, such as Erica Beecher-Monas and the late Margaret Berger, who tried to eviscerate judicial gatekeeping.  Recently a student note argued for the complete abandonment of all judicial control of expert witness testimony.  See  Note, “Admitting Doubt: A New Standard for Scientific Evidence,” 123 Harv. L. Rev. 2021 (2010)(arguing that courts should admit all relevant evidence).

One advantage that comes from requiring trial courts to serve as gatekeepers is that the expert witnesses’ reasoning is approved or disapproved in an open, transparent, and rational way.  Trial courts subject themselves to public scrutiny in a way that jury decision making does not permit.  The critics of Daubert often engage in a cynical attempt to remove all controls over expert witnesses in order to empower juries to act on their populist passions and prejudices.  When courts misinterpret statistical and scientific evidence, there is some hope of changing subsequent decisions by pointing out their errors.  Jury errors on the other hand, unless they involve determinations of issues for which there were “no evidence,” are immune to institutional criticism or correction.

Despite my whining, not all courts butcher statistical concepts.  There are many astute judges out there who see error and call it error.  Take for instance, the trial judge who was confronted with this typical argument:

“While Giles admits that a p-value of .15 is three times higher than what scientists generally consider statistically significant—that is, a p-value of .05 or lower—she maintains that this ‘‘represents 85% certainty, which meets any conceivable concept of preponderance of the evidence.’’ (Doc. 103 at 16).”

Giles v. Wyeth, Inc., 500 F.Supp. 2d 1048, 1056-57 (S.D.Ill. 2007), aff’d, 556 F.3d 596 (7th Cir. 2009).  Despite having case law cited to it (such as In re Ephedra), the trial court looked to the Reference Manual on Scientific Evidence, a resource that seems to be ignored by many federal judges, and rejected the bogus argument.  Unfortunately, the lawyers who made the bogus argument still are licensed, and at large, to incite the same error in other cases.

This business perhaps would be amenable to an empirical analysis.  An enterprising sociologist of the law could conduct some survey research on the science and math training of the federal judiciary, on whether the federal judges have read chapters of the Reference Manual before deciding cases involving statistics or science, and whether federal judges expressed the need for further education.  This survey evidence could be capped by an analysis of the prevalence of certain kinds of basic errors, such as the transpositional fallacy committed by so many judges (but decisively rejected in the Giles case).  Perhaps such an empirical analysis would advance our understanding whether we need specialty science courts.

One of the reasons that the Reference Manual on Scientific Evidence is worthy of so much critical attention is that the volume has the imprimatur of the Federal Judicial Center, and now the National Academies of Science.  Putting aside the idiosyncratic chapter by the late Professor Berger, the Manual clearly present guidance on many important issues.  To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges.

Unfortunately, the Manual has competition from lesser authors whose work obscures, misleads, and confuses important issues.  Consider an article by two would-be expert witnesses, who testify for plaintiffs, and confidently misstate the meaning of a confidence interval:

“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”

Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 Am. J. L. & Med. 189, 210 (2004).  This misstatement was then cited and quoted with obvious approval by Professor Beecher-Monas, in her text on scientific evidence.  Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 60-61 n. 17 (2007).   Beecher-Monas goes on, however, to argue that confidence interval coefficients are not the same as burdens of proof, but then implies that scientific standards of proof are different from the legal preponderance of the evidence.  She provides no citation or support for the higher burden of scientific proof:

“Some commentators have attributed the causation conundrum in the courts to the differing burdens of proof in science and law.28 In law, the civil standard of ‘more probable than not’ is often characterized as a probability greater than 50 percent.29 In science, on the other hand, the most widely used standard is a 95 percent confidence interval (corresponding to a 5 percent level of significance, or p-level).30 Both sound like probabilistic assessment. As a result, the argument goes, civil judges should not exclude scientific testimony that fails scientific validity standards because the civil legal standards are much lower. The transliteration of the ‘more probable than not’ standard of civil factfinding into a quantitative threshold of statistical evidence is misconceived. The legal and scientific standards are fundamentally different. They have different goals and different measures.  Therefore, one cannot justifiably argue that evidence failing to meet the scientific standards nonetheless should be admissible because the scientific standards are too high for preponderance determinations.”

Id. at 65.  This seems to be on the right track, although Beecher-Monas does not state clearly whether she subscribes to the notion that the burdens of proof in science and law differ.  The argument then takes a wrong turn:

“Equating confidence intervals with burdens of persuasion is simply incoherent. The goal of the scientific standard – the 95 percent confidence interval – is to avoid claiming an effect when there is none (i.e., a false positive).31

Id. at 66.   But this is crazy error; confidence intervals are not burdens of persuasion, legal or scientific.  Beecher-Monas is not, however, content to leave this alone:

“Scientists using a 95 percent confidence interval are making a prediction about the results being due to something other than chance.”

Id. at 66 (emphasis added).  Other than chance?  Well this implies causality, as well as bias and confounding, but the confidence interval, like the p-value, addresses only random or sampling error.  Beecher-Monas’s error is neither random nor scientific.  Indeed, she perpetuates the same error committed by the Fifth Circuit in a frequently cited Bendectin case, which interpreted the confidence interval as resolving questions of the role of matters “other than chance,” such as bias and confounding.  Brock v. Merrill Dow Pharmaceuticals, Inc., 874 F.2d 307, 311-12 (5th Cir. 1989)(“Fortunately, we do not have to resolve any of the above questions [as to bias and confounding], since the studies presented to us incorporate the possibility of these factors by the use of a confidence interval.”)(emphasis in original).  See, e.g., David H. Kaye, David E. Bernstein, and Jennifer L. Mnookin, The New Wigmore – A Treatise on Evidence:  Expert Evidence § 12.6.4, at 546 (2d ed. 2011) Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 86-87 (2009)(criticizing the overinterpretation of confidence intervals by the Brock court).

Clapp, Ozonoff, and Beecher-Monas are not alone in offering bad advice to judges who must help resolve statistical issues.  Déirdre Dwyer, a prominent scholar of expert evidence in the United Kingdom, manages to bundle up the transpositional fallacy and a misstatement of the meaning of the confidence interval into one succinct exposition:

“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”

Déirdre Dwyer, The Judicial Assessment of Expert Evidence 154-55 (Cambridge Univ. Press 2008).

Of course, Clapp, Ozonoff, Beecher-Monas, and Dwyer build upon a long tradition of academics’ giving errant advice to judges on this very issue.  See, e.g., Christopher B. Mueller, “Daubert Asks the Right Questions:  Now Appellate Courts Should Help Find the Right Answers,” 33 Seton Hall L. Rev. 987, 997 (2003)(describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant & Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 Judicature 12, 19 (2003)(“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, Modern Epidemiology 190 (Philadelphia 1998);  John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New Reference Manual on Scientific Evidence,” 74 N.C.L.Rev. 1183, 1212 n.172 (1996)(“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

Who has prevailed?  The statistically correct authors of the statistics chapter of the Reference Manual on Scientific Evidence, or the errant commentators?  It would be good to have some empirical evidence to help evaluate the judiciary’s competence. Here are some cases, many drawn from the Manual‘s discussions, arranged chronologically, before and after the first appearance of the Manual:

Before First Edition of the Reference Manual on Scientific Evidence:

DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 948 (3d Cir. 1990)(“A 95% confidence interval is constructed with enough width so that one can be confident that it is only 5% likely that the relative risk attained would have occurred if the true parameter, i.e., the actual unknown relationship between the two studied variables, were outside the confidence interval.   If a 95% confidence interval thus contains ‘1’, or the null hypothesis, then a researcher cannot say that the results are ‘statistically significant’, that is, that the null hypothesis has been disproved at a .05 level of significance.”)(internal citations omitted)(citing in part, D. Barnes & J. Conley, Statistical Evidence in Litigation § 3.15, at 107 (1986), as defining a CI as “a limit above or below or a range around the sample mean, beyond which the true population is unlikely to fall”).

United States ex rel. Free v. Peters, 806 F. Supp. 705, 713 n.6 (N.D. Ill. 1992) (“A 99% confidence interval, for instance, is an indication that if we repeated our measurement 100 times under identical conditions, 99 times out of 100 the point estimate derived from the repeated experimentation will fall within the initial interval estimate … .”), rev’d in part, 12 F.3d 700 (7th Cir. 1993)

DeLuca v. Merrell Dow Pharms., Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992)(”A 95% confidence interval means that there is a 95% probability that the ‘true’ relative risk falls within the interval”) , aff’d, 6 F.3d 778 (3d Cir. 1993)

Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1353-54 & n.1 (6th Cir. 1992)(describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10”)

Hilao v. Estate of Marcos, 103 F.3d 767, 787 (9th Cir. 1996)(Rymer, J., dissenting and concurring in part).

After the first publication of the Reference Manual on Scientific Evidence:

American Library Ass’n v. United States, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), rev’d on other grounds, 539 U.S. 194 (2003)

SmithKline Beecham Corp. v. Apotex Corp., 247 F.Supp.2d 1011, 1037-38 (N.D. Ill. 2003)(“the probability that the true value was between 3 percent and 7 percent, that is, within two standard deviations of the mean estimate, would be 95 percent”)(also confusing attained significance probability with posterior probability: “This need not be a fatal concession, since 95 percent (i.e., a 5 percent probability that the sign of the coefficient being tested would be observed in the test even if the true value of the sign was zero) is an  arbitrary measure of statistical significance.  This is especially so when the burden of persuasion on an issue is the undemanding ‘preponderance’ standard, which  requires a confidence of only a mite over 50 percent. So recomputing Niemczyk’s estimates as significant only at the 80 or 85 percent level need not be thought to invalidate his findings.”), aff’d on other grounds, 403 F.3d 1331 (Fed. Cir. 2005)

In re Silicone Gel Breast Implants Prods. Liab. Litig, 318 F.Supp.2d 879, 897 (C.D. Cal. 2004) (interpreting a relative risk of 1.99, in a subgroup of women who had had polyurethane foam covered breast implants, with a 95% CI that ran from 0.5 to 8.0, to mean that “95 out of 100 a study of that type would yield a relative risk somewhere between on 0.5 and 8.0.  This huge margin of error associated with the PUF-specific data (ranging from a potential finding that implants make a woman 50% less likely to develop breast cancer to a potential finding that they make her 800% more likely to develop breast cancer) render those findings meaningless for purposes of proving or disproving general causation in a court of law.”)(emphasis in original)

Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc., 482 F.Supp. 2d 478, 495 (D.N.J.2007)(“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”)(apparently relying party’s expert witness’s report), aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd., 344 Fed.Appx. 595 (Fed. Cir. 2009)

Eli Lilly & Co. v. Teva Pharms, USA, 2008 WL 2410420, *24 (S.D.Ind. 2008)(stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”)

Benavidez v. City of Irving, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009)(interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”)

Estate of George v. Vermont League of Cities and Towns, 993 A.2d 367, 378 n.12 (Vt. 2010)(erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”)

Correct Statements

There is no reason for any of these courts to have struggled so with the concept of statistical significance or of the confidence interval.  These concepts are well elucidated in the Reference Manual on Scientific Evidence (RMSE):

“To begin with, ‘confidence’ is a term of art. The confidence level indicates the percentage of the time that intervals from repeated samples would cover the true value. The confidence level does not express the chance that repeated estimates would fall into the confidence interval.91

* * *

According to the frequentist theory of statistics, probability statements cannot be made about population characteristics: Probability statements apply to the behavior of samples. That is why the different term ‘confidence’ is used.”

RMSE 3d at 247 (2011).

Even before the Manual, many capable authors have tried to reach the judiciary to help them learn and apply statistical concepts more confidently.  Professors Michael Finkelstein and Bruce Levin, of the Columbia University’s Law School and Mailman School of Public Health, respectively, have worked hard to educate lawyers and judges in the important concepts of statistical analyses:

“It is the confidence limits PL and PU that are random variables based on the sample data. Thus, a confidence interval (PL, PU ) is a random interval, which may or may not contain the population parameter P. The term ‘confidence’ derives from the fundamental property that, whatever the true value of P, the 95% confidence interval will contain P within its limits 95% of the time, or with 95% probability. This statement is made only with reference to the general property of confidence intervals and not to a probabilistic evaluation of its truth in any particular instance with realized values of PL and PU. “

Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers at 169-70 (2d ed. 2001)

Courts have no doubt been confused to some extent between the operational definition of a confidence interval and the role of the sample point estimate as an estimator of the population parameter.  In some instances, the sample statistic may be the best estimate of the population parameter, but that estimate may be rather crummy because of the sampling error involved.  See, e.g., Kenneth J. Rothman, Sander Greenland, Timothy L. Lash, Modern Epidemiology 158 (3d ed. 2008) (“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible. * * *  A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”); Nicholas P. Jewell, Statistics for Epidemiology 23 (2004)(“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data.  But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”); Charles Poole, “Confidence Intervals Exclude Nothing,” 77 Am. J. Pub. Health 492, 493 (1987)(“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval … .”).

Admittedly, I have given an impressionistic account, and I have used anecdotal methods, to explore the question whether the courts have improved in their statistical assessments in the 20 years since the Supreme Court decided Daubert.  Many decisions go unreported, and perhaps many errors are cut off from the bench in the course of testimony or argument.  I personally doubt that judges exercise greater care in their comments from the bench than they do in published opinions.  Still, the quality of care exercised by the courts would be a worthy area of investigation by the Federal Judicial Center, or perhaps by other sociologists of the law.

Scientific illiteracy among the judiciary

February 29th, 2012

Ken Feinberg, speaking at a symposium on mass torts, asks what legal challenges do mass torts confront in the federal courts.  The answer seems obvious.

Pharmaceutical cases that warrant federal court multi-district litigation (MDL) treatment typically involve complex scientific and statistical issues.  The public deserves having MDL cases assigned to judges who have special experience and competence to preside in cases in which these complex issues predominate.  There appears to be no procedural device to ensure that the judges selected in the MDL process have the necessary experience and competence, and a good deal of evidence to suggest that the MDL judges are not up to the task at hand.

In the aftermath of the Supreme Court’s decision in Daubert, the Federal Judicial Center assumed responsibility for producing science and statistics tutorials to help judges grapple with technical issues in their cases.  The Center has produced videotaped lectures as well as the Reference Manual on Scientific Evidence, now in its third edition.  Despite the Center’s best efforts, many federal judges have shown themselves to be incorrigible.  It is time to revive the discussions and debates about implementing a “science court.”

The following three federal MDLs all involved pharmaceutical products, well-respected federal judges, and a fundamental error in statistical inference.

Avandia

Avandia is a prescription oral anti-diabetic medication licensed by GlaxoSmithKline (GSK).  Concerns over Avandia’s association with excess heart attack risk resulted in regulatory revisions of its availability, as well as thousands of lawsuits.  In a decision that affected virtually all of those several thousand claims, aggregated for pretrial handing in a federal MDL, a federal judge, in ruling on a Rule 702 motion, described a clinical trial with a risk ratio greater than 1.0, with a p-value of 0.08, as follows:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance).FN72

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011)(Rufe, J.).  This is a remarkable error by a trial judge given the responsibility for pre-trial handling of so many cases.  There are many things you can argue about a p-value of 0.08, but Judge Rufe’s interpretation is not an argument; it is error.  That such an error, explicitly warned against in the Reference Manual on Scientific Evidence, could be made by an MDL judge, over 15 years since the first publication of the Manual, highlights the seriousness and the extent of the illiteracy problem.

What possible basis could the Avandia MDL court have to support this clearly erroneous interpretation of crucial studies in the litigation?  Footnote 72 in Judge Rufe’s opinion references a report by plaintiffs’ expert witness, Allan D. Sniderman, M.D, “a cardiologist, medical researcher, and professor at McGill University.” Id. at *10.  The trial court goes on to note that:

“GSK does not challenge Dr. Sniderman’s qualifications as a cardiologist, but does challenge his ability to analyze and draw conclusions from epidemiological research, since he is not an epidemiologist. GSK’s briefs do not elaborate on this challenge, and in any event the Court finds it unconvincing given Dr. Sniderman’s credentials as a researcher and published author, as well as clinician, and his ability to analyze the epidemiological research, as demonstrated in his report.”

Id.

What more evidence could the Avandia MDL trial court possibly have needed to show that Sniderman was incompetent to give statistical and epidemiologic testimony?  Fundamentally at odds with the Manual on an uncontroversial point, Sniderman had given the court a baseless, incorrect interpretation of a p-value.  Everything else he might have to say on the subject was likely suspect.  If, as the court suggested, GSK did not elaborate upon its challenge with specific examples, then shame on GSK. The trial court, however, could have readily determined that Sniderman was speaking nonsense by reading the chapter on statistics in the Reference Manual on Scientific Evidence.  For all my complaints about gaps in coverage in the Manual, the text, on this issue is clear and concise. It really is not too much to expect an MDL trial judge to be conversant with the basic concepts of scientific and statistical evidence set out in the Manual, which is prepared to help federal judges.

Phenylpropanolamine (PPA) Litigation

Litigation over phenylpropanolamine was aggregated, within the federal system, before Judge Barbara Rothstein.  Judge Rothstein is not only a respected federal trial judge, she was the director of the Federal Judicial Center, which produces the Reference Manual on Scientific Evidence.  Her involvement in overseeing the preparation of the third edition of the Manual, however, did not keep Judge Rothstein from badly misunderstanding and misstating the meaning of a p-value in the PPA litigation.  See In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F.Supp. 2d 1230, 1236 n.1 (W.D. Wash. 2003)(“P-values measure the probability that the reported association was due to chance… .”).  Tellingly, Judge Rothstein denied, in large part, the defendants’ Rule 702 challenges.  Juries, however, overwhelmingly rejected the claims that PPA caused their strokes.

Ephedra Litigation

Judge Rakoff, of the Southern District of New York, notoriously committed the transposition fallacy in the Ephedra litigation:

“Generally accepted scientific convention treats a result as statistically significant if the P-value is not greater than .05. The expression ‘P=.05’ means that there is one chance in twenty that a result showing increased risk was caused by a sampling error—i.e., that the randomly selected sample accidentally turned out to be so unrepresentative that it falsely indicates an elevated risk.”

In re Ephedra Prods. Liab. Litig., 393 F.Supp. 2d 181, 191 (S.D.N.Y. 2005).

Judge Rakoff then fallaciously argued that the use of a critical value of less than 5% of significance probability increased the “more likely than not” burden of proof upon a civil litigant.  Id. at 188, 193.  See Michael O. Finkelstein, Basic Concepts of Probability and Statistics in the Law 65 (2009).

Judge Rakoff may well have had help in confusing the probability used to characterize the plaintiff’s burden of proof with the probability of attained significance.  At least one of the defense expert witnesses in the Ephedra cases gave an erroneous definition of “statistically significant association,” which may have invited the judicial error:

“A statistically significant association is an association between exposure and disease that meets rigorous mathematical criteria demonstrating that the finding is unlikely to be the result of chance.”

Report of John Concato, MD, MS, MPH, at 7, ¶29 (Sept. 13, 2004).  Dr. Concato’s error was picked up and repeated in the defense briefing of its motion to preclude:

“The likelihood that an observed association could occur by chance alone is evaluated using tests for statistical significance.”

Memorandum of Law in Support of Motion by Ephedra Defendants to Exclude Expert Opinions of Charles Buncher, [et alia] …That Ephedra Causes Hemorrhagic Stroke, Ischemic Stroke, Seizure, Myocardial Infarction, Sudden Cardiac Death, and Heat-Related Illnesses at 9 (Dec. 3, 2004).

Judge Rakoff’s insistence that requiring “statistical significance” at the customary 5% level would change the plaintiffs’ burden of proof, and require greater certitude for epidemiologists than for other expert witnesses who opine in less “rigorous” fields of learning, is wrong as a matter of fact.  His Honor’s comparison, however, ignores the Supreme Court’s observation that the point of Rule 702 is:

‘‘to make certain that an expert, whether basing testimony upon professional studies or personal experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.’’

Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999).

Judge Rakoff not only ignored the conditional nature of significance probability, but he overinterpreted the role of significance testing in arriving at a conclusion of causality.  Statistical significance may answer the question of the strength of the evidence for ruling out chance in producing the data observed based upon an assumption of the no risk, but it doesn’t alone answer the question whether the study result shows an increased risk.  Bias and confounding must be considered, along with other Bradford Hill factors.

Even if the p-value could be turned into a posterior probability of the null hypothesis, there would be many other probabilities that would necessarily diminish that probability.  Some of the other factors (which could be expressed as objective or subjective probabilities) include:

  • accuracy of the data reporting
  • data collection
  • data categorization
  • data cleaning
  • data handling
  • data analysis
  • internal validity of the study
  • external validity of the study
  • credibility of study participants
  • credibility of study researchers
  • credibility of the study authors
  • accuracy of the study authors’ expression of their research
  • accuracy of the editing process
  • accuracy of the testifying expert witness’s interpretation
  • credibility of the testifying expert witness
  • other available studies, and their respective data and analysis factors
  • all the other Bradford Hill factors

If these largely independent factors each had a probability or accuracy of 95%, the conjunction of their probabilities would likely be below the needed feather weight on top of 50%.  In sum, Judge Rakoff’s confusing significance probability and the posterior probability of the null hypothesis does not subvert the usual standards of proof in civil cases.  See also Sander Greenland, “Null Misinterpretation in Statistical Testing and Its Impact on Health Risk Assessment,” 53 Preventive Medicine 225 (2011).

WHENCE COMES THIS ERROR

As a matter of intellectual history, I wonder where this error entered into the judicial system.  As a general matter, there was not much judicial discussion of statistical evidence before the 1970s.  The earliest manifestation of the transpositional fallacy in connection with scientific and statistical evidence appears in an opinion of the United States Court of Appeals, for the District of Columbia Circuit.  Ethyl Corp. v. EPA, 541 F.2d 1, 28 n.58 (D.C. Cir.), cert. denied, 426 U.S. 941 (1976).  The Circuit’s language is worth looking at carefully:

“Petitioners demand sole reliance on scientific facts, on evidence that reputable scientific techniques certify as certain.

Typically, a scientist will not so certify evidence unless the probability of error, by standard statistical measurement, is less than 5%. That is, scientific fact is at least 95% certain.  Such certainty has never characterized the judicial or the administrative process. It may be that the ‘beyond a reasonable doubt’ standard of criminal law demands 95% certainty.  Cf. McGill v. United States, 121 U.S.App.D.C. 179, 185 n.6, 348 F.2d 791, 797 n.6 (1965). But the standard of ordinary civil litigation, a preponderance of the evidence, demands only 51% certainty. A jury may weigh conflicting evidence and certify as adjudicative (although not scientific) fact that which it believes is more likely than not. ***”

 Id.  The 95% certainty appears to derive from 95% confidence intervals, although “confidence” is a technical term in statistics, and it most certainly does not mean the probability of the alternative hypothesis under consideration.  Similarly, the error that is less than 5% is not the probability of error of the belief in hypothesis of no difference between observations and expectations, but rather the probability of observing the data or the data even more extreme, on the assumption that observed would equal the expected.  The District of Columbia Circuit thus created a strawman:  scientific certainty is 95%, whereas civil and administrative law certainty is 51%.  This is rubbish, which confuses the frequentist probability from hypothesis testing with the subjective probability for belief in a fact.

The transpositional fallacy has a good pedigree, but that does not make it correct.  Only a lawyer would suggest that a mistake once made was somehow binding upon future litigants.  The following collection of citations and references illustrate how widespread the fundamental misunderstanding of statistical inference is, in the courts, in the academy, and at the bar.  If courts cannot deliver fair, accurate adjudication of scientific facts, then it is time to reform the system.


Courts

U.S. Supreme Court

Vasquez v. Hillery, 474 U.S. 254, 259 n.3 (1986) (“the District Court . . . accepted . . . a probability of 2 in 1,000 that the phenomenon was attributable to chance”)

U.S. Court of Appeals

First Circuit

Fudge v. Providence Fire Dep’t, 766 F.2d 650, 658 (1st Cir. 1985) (“Widely accepted statistical techniques have been developed to determine the likelihood an observed disparity resulted from mere chance.”)

Second Circuit

Nat’l Abortion Fed. v. Ashcroft, 330 F. Supp. 2d 436 (S.D.N.Y. 2004), aff’d in part, 437 F.3d 278 (2d Cir. 2006), vacated, 224 Fed. App’x 88 (2d Cir. 2007) (reporting an expert witness’s interpretation of a p-value of 0.30 to mean that there was a 30% probability that the study results were due to chance alone)

Smith v. Xerox Corp., 196 F.3d 358, 366 (2d Cir. 1999) (“If an obtained result varies from the expected result by two standard deviations, there is only about a .05 probability that the variance is due to chance.”)

Waisome v. Port Auth., 948 F.2d 1370, 1376 (2d Cir. 1991) (“about one chance in 20 that the explanation for a deviation could be random”)

Ottaviani v. State Univ. of New York at New Paltz, 875 F.2d 365, 372 n.7 (2d Cir. 1989)

Murphy v. General Elec. Co., 245 F. Supp. 2d 459, 467 (N.D.N.Y. 2003) (“less than a 5% probability that age was related to termination by chance”)

Third Circuit

United States v. State of Delaware, 2004 WL 609331, *10 n.27 (D. Del. 2004) (“there is a 5% (or 1 in 20) chance that the relationship observed is purely random”)

Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 605 n.26 (D.N.J. 2002) (“only 5% probability that an observed association is due to chance”)

Fifth Circuit

EEOC v. Olson’s Dairy Queens, Inc., 989 F.2d 165, 167 (5th Cir. 1993) (“Dr. Straszheim concluded that the likelihood that [the] observed hiring patterns resulted from truly race-neutral hiring practices was less than one chance in ten thousand.”)

Capaci v. Katz & Besthoff, Inc., 711 F.2d 647, 652 (5th Cir. 1983) (“the highest probability of unbiased hiring was 5.367 × 10-20”), cert. denied, 466 U.S. 927 (1984)

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982)(” A variation of two standard deviations would indicate that the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke. Sullivan, Zimmer & Richards, supra n.9 at 74.”)

Vuyanich v. Republic Nat’l Bank, 505 F. Supp. 224, 272 (N.D.Tex. 1980) (“the chances are less than one in 20 that the true coefficient is actually zero”), judgement vacated, 723 F.2d 1195 (5th Cir. 1984).

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982) (“the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke”)

Seventh Circuit

Adams v. Ameritech Services, Inc., 231 F.3d 414, 424, 427 (7th Cir. 2000) (“it is extremely unlikely (that is, there is less than a 5% probability) that the disparity is due to chance.”)

Sheehan v. Daily Racing Form, Inc., 104 F.3d 940, 941 (7th Cir. 1997) (“An affidavit by a statistician . . . states that the probability that the retentions . . . are uncorrelated with age is less than 5 percent.”)

Eighth Circuit

Craik v. Minnesota State Univ. Bd., 731 F.2d 465, 476n. 13 (8th Cir. 1984) (“Statistical significance is a measure of the probability that an observed disparity is not due to chance. Baldus & Cole, Statistical Proof of Discrimination § 9.02, at 290 (1980). A finding that a disparity is statistically significant at the 0.05 or 0.01 level means that there is a 5 per cent. or 1 per cent. probability, respectively, that the disparity is due to chance.

Ninth Circuit

Good v. Fluor Daniel Corp., 222 F.Supp. 2d 1236, 1241n.9 (E.D. Wash. 2002)(describing “statistical tools to calculate the probability that the difference seen is caused by random variation”)

D.C. Circuit

National Lime Ass’n v. EPA, 627 F.2d 416,453 (D.C. Cir. 1980)

FEDERAL CIRCUIT

Hodges v. Secretary Dep’t Health & Human Services, 9 F.3d 958, 967 (Fed. Cir. 1993) (Newman, J., dissenting) (“Scientists as well as judges must understand: ‘the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences’.”)(citing and quoting from the Report of the Carnegie Commission on Science, Technology, and Government, Science and Technology in Judicial Decision Making 28 (1993).


Regulatory Guidance

OSHA’s Guidance for Compliance with Hazard Communication Act:

“Statistical significance is a mathematical determination of the confidence in the outcome of a test. The usual criterion for establishing statistical significance is the p-value (probability value). A statistically significant difference in results is generally indicated by p < 0.05, meaning there is less than a 5% probability that the toxic effects observed were due to chance and were not caused by the chemical. Another way of looking at it is that there is a 95% probability that the effect is real, i.e., the effect seen was the result of the chemical exposure.”

U.S. Dep’t of Labor, Guidance for Hazard Determination for Compliance with the OSHA Hazard Communication Standard (29 CFR § 1910.1200) Section V (July 6, 2007).


Academic Commentators

Lucinda M. Finley, “Guarding the Gate to the Courthouse:  How Trial Judges Are Using Their Evidentiary Screening Role to Remake Tort Causation Rules,” 336 DePaul L. Rev. 335, 348 n. 49 (1999):

“Courts also require that the risk ratio in a study be ‘statistically significant,’ which is a statistical measurement of the likelihood that any detected association has occurred by chance, or is due to the exposure. Tests of statistical significance are intended to guard against what are called ‘Type I’ errors, or falsely ascribing a relationship when there in fact is not one (a false positive).  See SANDERS, supra note 5, at 51. The discipline of epidemiology is inherently conservative in making causal ascriptions, and regards Type I errors as more serious than Type II errors, or falsely assuming no association when in fact there is one (false negative). Thus, epidemiology conventionally requires a 95% level of statistical significance, i.e. that in statistical terms it is 95% likely that the association is due to exposure, rather than to chance. See id. at 50-52; Thompson, supra note 3, at 256-58. Despite courts’ use of statistical significance as an evidentiary screening device, this measurement has nothing to do with causation. It is most reflective of a study’s sample size, the relative rarity of the disease being studied, and the variance in study populations. Thompson, supra note 3, at 256.”

 

Erica Beecher-Monas, Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due Process 42 n. 30 (2007):

 “‘By rejecting a hypothesis only when the test is statistically significant, we have placed an upper bound, .05, on the chance of rejecting a true hypothesis’. Fienberg et al., p. 22. Another way of explaining this is that it describes the probability that the procedure produced the observed effect by chance.”

Professor Fienberg stated the matter corrrectly, but Beecher-Monas goes on to restate the matter in her own words, erroneously.  Later, she repeats her incorrect interpretation:

“Statistical significance is a statement about the frequency with which a particular finding is likely to arise by chance.19”

Id. at 61 (citing a paper by Sander Greenland, who correctly stated the definition).

Mark G. Haug, “Minimizing Uncertainty in Scientific Evidence,” in Cynthia H. Cwik & Helen E. Witt, eds., Scientific Evidence Review:  Current Issues at the Crossroads of Science, Technology, and the Law – Monograph No. 7, at 87 (2006)

Carl F. Cranor, Regulating Toxic Substances: A Philosophy of Science and the Law at 33-34(Oxford 1993)(One can think of α, β (the chances of type I and type II errors, respectively) and 1- β as measures of the “risk of error” or “standards of proof.”) See also id. at 44, 47, 55, 72-76.

Arnold Barnett, “An Underestimated Threat to Multiple Regression Analyses Used in Job Discrimination Cases, 5 Indus. Rel. L.J. 156, 168 (1982) (“The most common rule is that evidence is compelling if and only if the probability the pattern obtained would have arisen by chance alone does not exceed five percent.”)

David W. Barnes, Statistics as Proof: Fundamentals of Quantitative Evidence 162 (1983)(“Briefly, however, the findings of statistical significance at the P < .05, P < .04, and P < .02 levels indicate that the court can be 95%, 96%, and 98% certain, respectively, that the null hypotheses involved in the specific tests carried out … should be rejected.”)

Wayne Roth-Nelson & Kathey Verdeal, “Risk Evidence in Toxic Torts,” 2 Envt’l Lawyer 405,415-16 (1996) (confusing burden of proof with standard for hypothesis testint; and apparently endorsing the erroneous views given by Judge Newman, dissenting in Hodges). Caveat: Roth-Nelson is now a “forensic” toxicologist, who testifies in civil and criminal trials.

Steven R. Weller, “Book Review: Regulating Toxic Substances: A Philosophy of Science and Law,” 6 Harv. J. L. & Tech. 435, 436, 437-38 (1993) (“only when the statistical evidence gathered from studies shows that it is more than ninety-five percent likely that a test substance causes cancer will the substance be characterized scientifically as carcinogenic … to determine legal causality, the plaintiff need only establish that the probability with which it is true that the substance in question causes cancer is at least fifty percent, rather than the ninety-five percent to prove scientific causality”).

The Carnegie Commission on Science, Technology, and Government, Report on Science and Technology in Judicial Decision Making 28 (1993) (“The reality is that courts often decide cases not on the scientific merits, but on concepts such as burden of proof that operate differently in the legal and scientific realms. Scientists may misperceive these decisions as based on a misunderstanding of the science, when in actuality the decision may simply result from applying a different norm, one that, for the judiciary, is appropriate.  Much, for instance, has been written about ‘junk science’ in the courtroom. But judicial decisions that appear to be based on ‘bad’ science may actually reflect the reality that the law requires a burden of proof, or confidence level, other than the 95 percent confidence level that is often used by scientists to reject the possibility that chance alone accounted for observed differences.”).


Plaintiffs’ Counsel

Steven Rotman, “Don’t Know Much About Epidemiology?” Trial (Sept. 2007) (Author’s question answered in the affirmative:  “P values.  These measure the probability that a reported association between a drug and condition was due to chance.  A P-value of 0.05, which is generally considered the standard for statistical significance, means there is a 5 percent probability that the association was due to chance.”)

Defense Counsel

Bruce R. Parker & Anthony F. Vittoria, “Debunking Junk Science: Techniques for Effective Use of Biostatistics,” 65 Defense Csl. J. 35, 44 (2002) (“a P value of .01 means the researcher can be 99 percent sure that the result was not due to chance”).

Meta-Analysis of Observational Studies in Non-Pharmaceutical Litigations

February 26th, 2012

Yesterday, I posted on several pharmaceutical litigations that have involved meta-analytic studies.   Meta-analytic studies have also figured prominently in non-pharmaceutical product liability litigation, as well as in litigation over videogames, criminal recidivism, and eyewitness testimony.  Some, but not all, of the cases in these other areas of litigation are collected below.  In some cases, the reliability or validity of the meta-analyses were challenged; in some cases, the court fleetingly referred to meta-analyses relied upon the parties.  Some of the courts’ treatments of meta-analysis are woefully inadequate or erroneous.  The failure of the Reference Manual on Scientific Evidence to update its treatment of meta-analysis is telling.  See The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011).

 

Abortion (Breast Cancer)

Christ’s Bride Ministries, Inc. v. Southeastern Pennsylvania Transportation Authority, 937 F.Supp. 425 (E.D. Pa. 1996), rev’d, 148 F.3d 242 (3d Cir. 1997)

Asbestos

In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1042 (S.D.N.Y. 1993)(“adding a series of positive but statistically insignificant SMRs [standardized mortality ratios] together does not produce a statistically significant pattern”), rev’d, 52 F.3d 1124 (2d Cir. 1995).

In Re Asbestos Litigation, Texas Multi District Litigation Cause No. 2004-03964 (June 30, 2005)(Davidson, J.)(“The Defendants’ response was presented by Dr. Timothy Lash.  I found him to be highly qualified and equally credible.  He largely relied on the report submitted to the Environmental Protection Agency by Berman and Crump (“B&C”).  He found the meta-analysis contained in B&C credible and scientifically based.  B&C has not been published or formally accepted by the EPA, but it does perform a valuable study of the field.  If the question before me was whether B&C is more credible than the Plaintiffs’ studies taken together, my decision might well be different.”)

Jones v. Owens-Corning Fiberglas, 288 N.J. Super. 258, 672 A.2d 230 (1996)

Berger v. Amchem Prods., 818 N.Y.S.2d 754 (2006)

Grenier v. General Motors Corp., 2009 WL 1034487 (Del.Super. 2009)

Benzene

Knight v. Kirby Inland Marine, Inc., 363 F. Supp. 2d 859 (N.D. Miss. 2005)(precluding proffered opinion that benzene caused bladder cancer and lymphoma; noting without elaboration or explanation, that meta-analyses are “of limited value in combining the results of epidemiologic studies based on observation”), aff’d, 482 F.3d 347 (5th Cir. 2007)

Baker v. Chevron USA, Inc., 680 F.Supp. 2d 865 (S.D. Ohio 2010)

Diesel Exhaust Exposure

King v. Burlington Northern Santa Fe Ry. Co., 277 Neb. 203, 762 N.W.2d 24 (2009)

Kennecott Greens Creek Mining Co. v. Mine Safety & Health Admin., 476 F.3d 946 (D.C. Cir. 2007)

Eyewitness Testimony

State of New Jersey v. Henderson, 208 N.J. 208, 27 A.3d 872 (2011)

Valle v. Scribner, 2010 WL 4671466 (C.D. Calif. 2010)

People v. Banks, 16 Misc.3d 929, 842 N.Y.S.2d 313 (2007)

Lead

Palmer Asarco Inc., 510 F.Supp.2d 519 (N.D. Okla. 2007)

PCBs

In re Paoli R.R. Yard PCB Litigation, 916 F.2d 829, 856-57 (3d Cir.1990) (‘‘There is some evidence that half the time you shouldn’t believe meta-analysis, but that does not mean that meta-analyses are necessarily in error. It means that they are, at times, used in circumstances in which they should not be.’’) (internal quotation marks and citations omitted), cert. denied, 499 U.S. 961 (1991)

Repetitive Stress

Allen v. International Business Machines Corp., 1997 U.S. Dist. LEXIS 8016 (D. Del. 1997)

Tobacco

Flue-Cured Tobacco Cooperative Stabilization Corp. v. United States Envt’l Protection Agency, 4 F.Supp.2d 435 (M.D.N.C. 1998), vacated by, 313 F.3d 852 (4th Cir. 2002)

Tocolytics – Medical Malpractice

Hurd v. Yaeger, 2009 WL 2516874 (M.D. Pa. 2009)

Toluene

Black v. Rhone-Poulenc, Inc., 19 F.Supp.2d 592 (S.D.W.Va. 1998)

Video Games (Violent Behavior)

Brown v. Entertainment Merchants Ass’n, ___ U.S.___, 131 S.Ct. 2729 (2011)

Entertainment Software Ass’n v. Blagojevich, 404 F.Supp.2d 1051 (N.D. Ill. 2005)

Entertainment Software Ass’n v. Hatch, 443 F.Supp.2d 1065 (D. Minn. 2006)

Video Software Dealers Ass’n v. Schwarzenegger, 556 F.3d 950 (9th Cir. 2009)

Vinyl Chloride

Taylor v. Airco, 494 F. Supp. 2d 21 (D. Mass. 2007)(permitting opinion testimony that vinyl chloride caused intrahepatic cholangiocarcinoma, without commenting upon the reasonableness of reliance upon the meta-analysis cited)

Welding

Cooley v. Lincoln Electric Co., 693 F.Supp.2d 767 (N.D. Ohio. 2010)

Meta-Analysis in Pharmaceutical Cases

February 25th, 2012

The Third Edition of the Reference Manual on Scientific Evidence attempts to cover a lot of ground to give the federal judiciary guidance on scientific, medical, and statistical, and engineering issues.  It has some successes, and some failures.  One of the major problems in coverage in the new Manual is its inconsistent, sparse, and at points out-dated treatment of meta-analysis.   See The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence” (Nov. 14, 2011).

As I have pointed out elsewhere, the gaps and problems in the Manual‘s coverage are not “harmless error,” when some courts have struggled to deal with methodological and evaluative issues in connection with specific meta-analyses.  SeeLearning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion” (Jan. 10, 2011).

Perhaps the reluctance to treat meta-analysis more substantively comes from a perception that the technique for analyzing multiple studies does not come up frequently in litigation.  If so, let me help dispel the notion.  I have collected a partial list of drug and medical device cases that have confronted meta-analysis in one form or another.  In some cases, such as the Avandia MDL, a meta-analysis was a key, or the key, piece of evidence.  In other cases, meta-analysis may have been treated more peripherally.  Still, there are over 20 pharmaceutical cases in the last two decades that have dealt with the statistical techniques involved in meta-analysis.  In another post, I will collect the non-pharmaceutical cases as well.

 

Aredia – Zometa

Deutsch v. Novartis Pharm. Corp., 768 F. Supp. 2d 420 (E.D.N.Y. 2011)

 

Avandia

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011)

Avon Pension Fund v. GlaxoSmithKline PLC, 343 Fed.Appx. 671 (2d Cir. 2009)

 

Baycol

In re Baycol Prods. Litig., 532 F.Supp. 2d 1029 (D. Minn. 2007)

 

Bendectin

Daubert v. Merrell Dow Pharm., 43 F.3d 1311 (9th Cir. 1995) (on remand from Supreme Court)

DePyper v. Navarro, 1995 WL 788828 (Mich.Cir.Ct. 1995)

 

Benzodiazepine

Vinitski v. Adler, 69 Pa. D. & C.4th 78, 2004 WL 2579288 (Phila. Cty. Ct. Common Pleas 2004)

 

Celebrex – Bextra

In re Bextra & Celebrex Marketing Sales Practices & Prod. Liab. Litig., 524 F.Supp.2d 1166 (2007)


E5 (anti-endotoxin monoclonal antibody for gram-negative sepsis)

Warshaw v. Xoma Corp., 74 F.3d 955 (1996)

 

Excedrin vs. Tylenol

McNeil-P.C.C., Inc. v. Bristol-Myers Squibb Co., 938 F.2d 1544 (2d Cir. 1991)

 

Fenfluramine, Phentermine

In re Diet Drugs Prod. Liab. Litig., 2000 WL 1222042 (E.D.Pa. 2000)

 

Fosamax

In re Fosamax Prods. Liab. Litig., 645 F.Supp.2d 164 (S.D.N.Y. 2009)

 

Gadolinium

In re Gadolinium-Based Contrast Agents Prod. Liab. Litig., 2010 WL 1796334 (N.D. Ohio 2010)

 

Neurontin

In re Neurontin Marketing, Sales Pracices, and Products Liab. Litig., 612 F.Supp.2d 116 (D. Mass. 2009)

 

Paxil (SSRI)

Tucker v. Smithkline Beecham Corp., 2010 U.S. Dist. LEXIS 30791 (S.D.Ind. 2010)

 

Prozac (SSRI)

Rimberg v. Eli Lilly & Co., 2009 WL 2208570 (D.N.M.)

 

Seroquel

In re Seroquel Products Liab. Litig., 2009 WL 3806434 *5 (M.D. Fla. 2009)

 

Silicone – Breast Implants

Allison v. McGhan Med. Corp., 184 F.3d 1300, 1315 n. 12 (11th Cir. 1999)(noting, in passing that the district court had found a meta-analysis (the “Kayler study”) unreliable “because it was a re-analysis of other studies that had found no statistical correlation between silicone implants and disease”)

Thimerosal – Vaccine

Salmond v. Sec’y Dep’t of Health & Human Services, 1999 WL 778528 (Fed.Cl. 1999)

Hennessey v. Sec’y Dep’t Health & Human Services, 2009 WL 1709053 (Fed.Cl. 2009)

 

Trasylol

In re Trasylol Prods. Liab. Litig., 2010 WL 1489793 (S.D. Fla. 2010)

 

Vioxx

Merck & Co., Inc. v. Ernst, 296 S.W.3d 81 (Tex. Ct. App. 2009)
Merck & Co., Inc. v. Garza, 347 S.W.3d 256 (Tex. 2011)

 

X-Ray Contrast Media (Nephrotoxicity of Visipaque versus Omnipaque)

Bracco Diagnostics, Inc. v. Amersham Health, Inc., 627 F.Supp.2d 384 (D.N.J. 2009)

Zestril

E.R. Squibb & Sons, Inc. v. Stuart Pharms., 1990 U.S. Dist. LEXIS 15788 (D.N.J. 1990)(Zestril versus Squibb’s competing product,
Capote)

 

Zoloft (SSRI)

Miller v. Pfizer, Inc., 356 F.3d 1326 (10th Cir. 2004)

 

Zymar

Senju Pharmaceutical Co. Ltd. v. Apotex Inc., 2011 WL 6396792 (D.Del. 2011)

 

Zyprexa

In re Zyprexa Products Liab. Litig., 489 F.Supp.2d 230 (E.D.N.Y. 2007) (Weinstein, J.)