TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

When There Is No Risk in Risk Factor

February 20th, 2012

Some of the terminology of statistics and epidemiology is not only confusing, but it is misleading.  Consider the terms “effect size,” “random effects,” and “fixed effect,” which are all used to describe associations even if known to be non-causal.  Biostatisticians and epidemiologists know that the terms are about putative or potential effects, but the sloppy, short-hand nomenclature can be misleading.

Although “risk” has a fairly precise meaning in scientific parlance, the usage for “risk factor” is fuzzy, loose, and imprecise.  Journalists and plaintiffs’ lawyers use “risk factor,” much as they another frequently abused term in their vocabulary:  “link.”  Both “risk factor” and “link” sound as though they are “causes,” or at least as though they have something to do with causation.  The reality is usually otherwise.

The business of exactly what “risk factor” means is puzzling and disturbing.  The phrase seems to have gained currency because it is squishy and without a definite meaning.  Like the use of “link” by journalists, the use of “risk factor” protects the speaker against contradiction, but appears to imply a scientifically valid conclusion.  Plaintiffs’ counsel and witnesses love to throw this phrase around precisely because of its ambiguity.  In journal articles, authors sometimes refer to any exposure inquired about in a case-control study to be a “risk factor,” regardless of the study result.  So a risk factor can be merely an “exposure of interest,” or a possible cause, or a known cause.

The author’s meaning in using the phrase “risk factor” can often be discerned from context.  When an article reports a case-control study, which finds an association with an exposure to some chemical the article will likely report in the discussion section that the study found that chemical to be a risk factor.  The context here makes clear that the chemical was found to be associated with the outcome, and that chance was excluded as a likely explanation because the odds ratio was statistically significant.  The context is equally clear that the authors did not conclude that the chemical was a cause of the outcome because they did not rule out bias or confounding; nor did they do any appropriate analysis to reach a causal conclusion and because their single study would not have justified reaching a causal association.

Sometimes authors qualify “risk factor” with an adjective to give more specific meaning to their usage.  Some of the adjectives used in connection with the phrase include:

– putative, possible, potential, established, well-established, known, certain, causal, and causative

The use of the adjective highlights the absence of a precise meaning for “risk factor,” standing alone.  Adjectives such as “established,” or “known” imply earlier similar findings, which are corroborated by the study at hand.  Unless “causal” is used to modify “risk factor,” however, there is no reason to interpret the unqualified phrase to imply a cause.

Here is how the phrase “risk factor” is described in some noteworthy texts and treatises.

Legal Treatises

Professor David Faigman, and colleagues, with some understatement, note that the term “risk factor is loosely used”:

Risk Factor An aspect of personal behavior or life-style, an environmental exposure, or an inborn or inherited characteristic, which on the basis of epidemiologic evidence is known to be associated with health-related condition(s) considered important to prevent. The term risk factor is rather loosely used, with any of the following meanings:

1. An attribute or exposure that is associated with an increased probability of a specified outcome, such as the occurrence of a disease. Not necessarily a causal factor.

2. An attribute or exposure that increases the probability of occurrence of disease or other specified outcome.

3. A determinant that can be modified by intervention, thereby reducing the probability of occurrence of disease or other specified outcomes.”

David L. Faigman, Michael J. Saks, Joseph Sanders, and Edward Cheng, Modern Scientific Evidence:  The Law and Science of Expert Testimony 301, vol. 1 (2010)(emphasis added).

The Reference Manual on Scientific Evidence (2011) (RMSE3d) does not offer much in the way of meaningful guidance here.  The chapter on statistics in the third edition provides a somewhat circular, and unhelpful definition.  Here is the entry in that chapter’s glossary:

risk factor. See independent variable.

RMSE3d at 295.  If the glossary defined “independent variable” as a simply a quantifiable variable that was being examined for some potential relationship with the outcome, or dependent, variable, the RMSE would have avoided error.  Instead the chapter’s glossary, as well as its text, defines independent variables as “causes,” which begs the question why do a study to determine whether the “independent variable” is even a candidate for a causal factor?  Here is how the statistics chapter’s glossary defines independent variable:

“Independent variables (also called explanatory variables, predictors, or risk factors) represent the causes and potential confounders in a statistical study of causation; the dependent variable represents the effect. ***. “

RMSE3d at 288.  This is surely circular.  Studies of causation are using independent variables that represent causes?  There would be no reason to do the study if we already knew that the independent variables were causes.

The text of the RMSE chapter on statistics propagates the same confusion:

“When investigating a cause-and-effect relationship, the variable that represents the effect is called the dependent variable, because it depends on the causes.  The variables that represent the causes are called independent variables. With a study of smoking and lung cancer, the independent variable would be smoking (e.g., number of cigarettes per day), and the dependent variable would mark the presence or absence of lung cancer. Dependent variables also are called outcome variables or response variables. Synonyms for independent variables are risk factors, predictors, and explanatory variables.”

FMSE3d at 219.  In the text, the identification of causes with risk factors is explicit.  Independent variables are the causes, and a synonym for an independent variable is “risk factor.”  The chapter could have avoided this error simply by the judicious use of “putative,” or “candidate” in front of “causes.”

The chapter on epidemiology exercises more care by using “potential” to modify and qualify the risk factors that are considered in a study:

“In contrast to clinical studies in which potential risk factors can be controlled, epidemiologic investigations generally focus on individuals living in the community, for whom characteristics other than the one of interest, such as diet, exercise, exposure to other environmental agents, and genetic background, may distort a study’s results.”

FMSE3d at 556 (emphasis added).

 

Scientific Texts

Turning our attention to texts on epidemiology written for professionals rather than judges, we find that sometimes the term “risk factor” with a careful awareness of its ambiguity.

Herbert I. Weisberg is a statistician whose firm, Correlation Research Inc., specializes in the applied statistics in legal issues.  Weisberg recently published an interesting book on bias and causation, which is recommended reading for lawyers who litigate claimed health effects.  Weisberg’s book defines “risk factor” as merely an exposure of interest in a study that is looking for associations with a harmful outcome.  He insightfully notes that authors use the phrase “risk factor” and similar phrases to avoid causal language:

“We will often refer to this factor of interest as a risk factor, although the outcome event is not necessarily something undesirable.”

Herbert I. Weisberg, Bias and Causation:  Models and Judgment for Valid Comparisons 27 (2010).

“Causation is discussed elliptically if at all; statisticians typically employ circumlocutions such as ‘independent risk factor’ or ‘explanatory variable’ to avoid causal language.”

Id. at 35.

Risk factor : The risk factor is the exposure of interest in an epidemiological study and often has the connotation that the outcome event is harmful or in some way undesirable.”

Id. at 317.   This last definition is helpful in illustrating a balanced, fair definition that does not conflate risk factor with causation.

*******************

Lemuel A. Moyé is an epidemiologist who testified in pharmaceutical litigation, mostly for plaintiffs.  His text, Statistical Reasoning in Medicine:  The Intuitive P-Value Primer, is in places a helpful source of guidance on key concepts.  Moyé puts no stock in something’s being a risk factor unless studies show a causal relationship, established through a proper analysis.  Accordingly, he uses “risk factor” to signify simply an exposure of interest:

4.2.1 Association versus Causation

An associative relationship between a risk factor and a disease is one in which the two appear in the same patient through mere coincidence. The occurrence of the risk factor does not engender the appearance of the disease.

Causal relationships on the other hand are much stronger. A relationship is causal if the presence of the risk factor in an individual generates the disease. The causative risk factor excites the production of the disease. This causal relationship is tight, containing an embedded directionality in the relationship, i.e., (1) the disease is absence in the patient, (2) the risk factor is introduced, and (3) the risk factor’s presence produces the disease.

The declaration that a relationship is causal has a deeper meaning then the mere statement that a risk factor and disease are associated. This deeper meaning and its implications for healthcare require that the demonstration of a causal relationship rise to a higher standard than just the casual observation of the risk factor and disease’s joint occurrence.

Often limited by logistics and the constraints imposed by ethical research, the epidemiologist commonly cannot carry out experiments that identify the true nature of the risk factor–disease relationship. They have therefore become experts in observational studies. Through skillful use of observational research methods and logical thought, epidemiologists assess the strength of the links between risk factors and disease.”

Lemuel A. Moyé, Statistical Reasoning in Medicine:  The Intuitive P-Value Primer 92 (2d ed. 2006)

***************************

In A Dictionary of Epidemiology, which is put out by the International Epidemiology Association, a range of meanings is acknowledged, although the range is weighted toward causality:

“RISK FACTOR (Syn: risk indicator)

1. An aspect of personal behavior or lifestyle, an environmental exposure, or an inborn or inherited characteristic that, on the basis of scientific evidence, is known to be associated with meaningful health-related condition(s). In the twentieth century multiple cause era, a synonymous with determinant acting at the individual level.

2. An attribute or exposure that is associated with an increased probability of a specified outcome, such as the occurrence of a disease. Not necessarily a causal factor: it may be a risk marker.

3. A determinant that can be modified by intervention, thereby reducing the probability of occurrence of disease or other outcomes. It may be referred to as a modifiable risk factor, and logically must be a cause of the disease.

The term risk factor became popular after its frequent use by T. R. Dawber and others in papers from the Framingham study.346 The pursuit of risk factors has motivated the search for causes of chronic disease over the past half-century. Ambiguities in risk and in risk-related concepts, uncertainties inherent to the concept, and different legitimate meanings across cultures (even if within the same society) must be kept in mind in order to prevent medicalization of life and iatrogenesis.124–128,136,142,240

Miquel Porta, Sander Greenland, John M. Last, eds., A Dictionary of Epidemiology 218-19 (5th ed. 2008).  We might add that the uncertainties inherent in risk concepts should be kept in mind to prevent overcompensation for outcomes not shown to be caused by alleged tortogens.

***************

One introductory text uses “risk factor” as a term to describe the independent variable, while acknowledging that the variable does not become a risk factor until after the study shows an association between factor and the outcome of interest:

“A case-control study is one in which the investigator seeks to establish an association between the presence of a characteristic (a risk factor).”

Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology: A Primer for Health and Biomedical Professionals 104 (3d ed. 2004).  See also id. at 198 (“Here, also, epidemiology plays a central role in identifying risk factors, such as smoking for lung cancer”).  Although it should be clear that much more must happen in order to show a risk factor is causally associated with an outcome, such as lung cancer, it would be helpful to spell this out.  Some texts simply characterize risk factor as associations, not necessarily causal in nature.  Another basic text provides:

“Analytical studies examine an association, i.e. the relationship between a risk factor and a disease in detail and conduct a statistical test of the corresponding hypothesis … .”

Wolfgang Ahrens & Iris Pigeot, eds., Handbook of Epidemiology 18 (2005).  See also id. at 111 (Table describing the reasoning in a case-control study:    “Increased prevalence of risk factor among diseased may indicate a causal relationship.”)(emphasis added).

These texts, both legal and scientific, indicate a wide range of usage and ambiguity for “risk factor.”  There is a tremendous potential for the unscrupulous expert witness, or the uneducated lawyer, to take advantage of this linguistic latitude.  Courts and counsel must be sensitive to the ambiguity and imprecision in usages of “risk factor,” and the mischief that may result.  The Reference Manual on Scientific Evidence needs to sharpen and update its coverage of this and other statistical and epidemiologic issues.

Interstitial Doubts About the Matrixx

February 6th, 2012

Statistics professors are excited that the United States Supreme Court issued an opinion that ostensibly addressed statistical significance.  One such example of the excitement is an article, in press, by Joseph B. Kadane, Professor in the Department of Statistics, in Carnegie Mellon University, Pittsburgh, Pennsylvania.  See Joseph B. Kadane, “Matrixx v. Siracusano: what do courts mean by ‘statistical significance’?” 11[x] Law, Probability and Risk 1 (2011).

Professor Kadane makes the sensible point that the allegations of adverse events did not admit of an analysis that would imply statistical significance or its absence.  Id. at 5.  See Schachtman, “The Matrixx – A Comedy of Errors” (April 6, 2011)”;  David Kaye, ” Trapped in the Matrixx: The U.S. Supreme Court and the Need for Statistical Significance,” BNA Product Safety and Liability Reporter 1007 (Sept. 12, 2011).  Unfortunately, the excitement has obscured Professor Kadane’s interpretation of the Court’s holding, and has led him astray in assessing the importance of the case.

In the opening paragraph of his paper, Professor Kadane quotes from the Supreme Court’s opinion that “the premise that statistical significance is the only reliable indication of causation … is flawed,” Matrixx Initiatives, Inc. v. Siracusano, ___ U.S. ___, 131 S.Ct. 1309 (2011).  The quote is accurate, but Professor Kadane proceeds to claim that this quote represents the holding of the Court. Kadane, supra at 1. The Court held no such thing.

Matrixx was a security fraud class action suit, brought by investors who claimed that the company misled them when they spoke to the market about the strong growth prospects of the company’s product, Zicam cold remedy, when they had information that raised concerns that might affect the product’s economic viability and its FDA license.  The only causation required for the plaintiffs to show was an economic loss caused by management’s intentional withholding of “material” information that should have been disclosed under all the facts and circumstances.  Plaintiffs do not have to prove that the medication causes the harm alleged in personal injury actions.  Indeed, it might turn out to be indisputable that the medication does not cause the alleged harm, but earlier, suggestive studies would provoke regulatory intervention and even a regulatory decision to withdraw the product from the market.  Investors obviously could be hurt under this scenario as much as, if not more than, if the medication caused the harms alleged by personal-injury plaintiffs. 

Kadane’s assessment goes awry in suggesting that the Supreme Court issued a holding about facts that were neither proven nor necessary for it to reach its decision.  Court can, and do, comment, note, and opine about many unnecessary facts or allegations in reaching a holding, but these statements are obiter dicta, if they are not necessary to the disposition of the case. Because medical causation was not required for the Supreme Court to reach its decision, its presence or absence was not, and could not, be part of the Court’s holding. 

Kadane makes a similar erroneous statement that the lower appellate courts, which earlier had addressed “statistical significance,” properly or improperly understood, found that “statistical significance in the strict sense [was] neither necessary … nor sufficient … to require action to remove a drug from the market.”  Id. at 6.  The earlier appellate decisions addressed securities fraud, however, not regulatory action of withdrawal of a product.  Kadane’s statement mistakes what was at issue, and what was decided, in all the cases discussed.

Kadane seems at least implicitly to recognize that medical causation is not at issue when he states that “the FDA does not require proof of causation but rather reasonable evidence of an association before a warning is issued.”  Id. at 7 (internal citation omitted).  All that had to have happened for the investors to have been harmed by the Company’s misleading statements was for Matrixx Initiatives to boast about future sales, and to claim that there were no health issues that would lead to regulatory intervention, when they had information raising doubts about their claim of no health issues. See FDA Regulations, 21 U.S.C. § 355(d), (e)(requiring drug sponsor to show adequate testing, labeling, safety, and efficacy); see also 21 C.F.R. § 201.57(e) (requiring warnings in labeling “as there is reasonable evidence of an association of a serious hazard with a drug; a causal relationship need not have been proved.”); 21 C.F.R. § 803.3 (adverse event reports address events possibly related to the drug or the device); 21 C.F.R. § 803.16 (adverse event report is not an admission of causation).

Kadane’s analysis of the case goes further astray when he suggests that the facts were strong enough for the case to have survived summary judgment.  Id. at 9.  The Matrixx case was a decision on the adequacy of the pleadings, not of the adequacy of the facts proven.  Elsewhere, Kadane acknowledges the difference between a challenge to the pleadings and the legal sufficiency of the facts, id. at 7 & n.8, but Kadane asserts, without explanation, that the difference is “technical” and does not matter.”  Not true.  The motion to dismiss is made upon receipt of the plaintiffs’ complaint, but the motion for summary judgment is typically made at the close of discovery, on the eve of trial.  The allegations can be conclusory, and they need have only plausible support in other alleged facts to survive a motion to dismiss.  The case, however, must have evidence of all material facts, as well as expert witness opinion that survives judicial scrutiny for scientific validity under Rule 702, to survive a motion for summary judgment, which comes much later in the natural course of any litigated case.

Kadane appears to try to support the conflation of dismissals on the pleadings and summary judgments by offering a definition of summary judgment that is not quite accurate, and potentially misleading:  “The idea behind summary judgment is that, even if every fact alleged by the opposing party were found to be true, the case would still fail for legal reasons.” Id. at 2.  The problem is that at the summary judgment stage, as opposed to the pleading stage, the party with the burden of proof cannot rest upon his allegations, but must come forward with facts, not allegations, to support every essential element of his case.  A plaintiff in a personal injury action (not a securities fraud case), for example, may easily survive a motion to dismiss by alleging medical causal connection, but at the summary judgment stage, that plaintiff must serve a report of an appropriately qualified expert witness, who in turn has presented a supporting opinion, reliably ground in science, to survive both evidentiary challenges and a dispositive motion.

Kadane concludes that the Matrixx decision’s “fact-based consideration” is consistent with a “Bayesian decision-theoretic approach that models how to make rational decisions under uncertainty.”  Id. at 9.  I am 99.99999% certain that Justice Sotomayor would not have a clue about what Professor Kadane was saying.  Although statistical significance may have played no role in the Court’s holding, and in Kadane’s Bayesian decision-theoretic approach, I am 100% certain that the irrelevance of statistical significance to the Court’s and Prof. Kadane’s approaches is purely coincidental.

Federal Rule of Evidence 702 Requires Perscrutations — Samaan v. St. Joseph Hospital (2012)

February 4th, 2012

After the dubious decision in Milward, the First Circuit would seem an unlikely forum for perscrutations of expert witness opinion testimony.  Milward v. Acuity Specialty Products Group, Inc., 639 F.3d 11 (1st Cir. 2011), cert. denied, ___ U.S.___ (2012).  SeeMilwardUnhinging the Courthouse Door to Dubious Scientific Evidence” (Sept. 2, 2011).  Late last month, however, a First Circuit panel of the United States Court of Appeals held that Rule 702 required perscrutation of expert witness opinion, and then proceeded to perscrutate perspicaciously, in Samaan v. St. Joseph Hospital, 2012 WL 34262 (1st Cir. 2012).

The plaintiff, Mr. Samaan suffered an ischemic stroke, for which he was treated by the defendant hospital and physician.  Plaintiff claimed that the defendants’ treatment deviated from the standard of care by failing to administer intravenous tissue plasminogen activator (t-PA).  Id. at *1.  The plaintiff’s only causation expert witness, Dr. Ravi Tikoo, opined that the defendants’ failure to administer t-PA caused plaintiffs’ neurological injury.  Id. at *2.   Dr. Tikoo’s opinions, as well as those of the defense expert witness, were based in large part upon data from a study done by one of the National Institutes of Health:  The National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group, “Tissue Plasminogen Activator for Acute Ischemic Stroke,” 333 New Engl. J. Med. 1581 (1995).

Both the District Court and the Court of Appeals noted that the problem with Dr. Tikoo’s opinions lay not in the unreliability of the data, or in the generally accepted view that t-PA can, under certain circumstances, mitigate the sequelae of ischemic stroke; rather the problem lay in the analytical gap between those data and Dr. Tikoo’s conclusion that the failure to administer t-PA caused Mr. Samaan’s stroke-related injuries.

The district court held that Dr. Tikoo’s opinion failed to satisfy the requirements of Rule 702. Id. at *8 – *9.  Dr. Tikoo examined odds ratios from the NINDS study, and others, and concluded that a patient’s chances of improved outcome after stroke increased 50% with t-PA, and thus Mr. Samaan’s healthcare providers’ failure to provide t-PA had caused his poor post-stroke outcome.  Id. at *9.  The appellate court similarly rejected the inference from an increased odds ratio to specific causation:

“Dr. Tikoo’s first analysis depended upon odds ratios drawn from the literature. These odds ratios are, as the term implies, ratios of the odds of an adverse outcome, which reflect the relative likelihood of a particular result.FN5 * * * Dr. Tikoo opined that the plaintiff more likely than not would have recovered had he received the drug.”

Id. at *10.

The Court correctly identified the expert witness’s mistake in inferring specific causation from an odds ratio of about 1.5, without any additional information.  The Court characterized the testimonial flaw as one of “lack of fit,” but it was equally an unreliable inference from epidemiologic data to a conclusion about specific causation.

While the Court should be applauded for rejecting the incorrect inference about specific causation, we might wish that it had been more careful about important details.  The Court misinterpreted the meaning of an odds ratio to be a relative risk.  The NINDS study reported risk ratio results both as an odds ratio and as a relative risk.  The Court’s sloppiness should be avoided; the two statistics are different, especially when the outcome of interest is not particularly rare.

Still, the odds ratio is interesting and important as an approximation for the relative risk, and neither measure of risk can substitute for causation, especially when the magnitude of the risk is small, and less than two-fold.  The First Circuit recognized and focused in on this gap between risk and causal attribution in an individual’s case:

“[Dr. Tikoo’s] reasoning is structurally unsound and leaves a wide analytical gap between the results produced through the use of odds ratios and the conclusions drawn by the witness. When a person’s chances of a better outcome are 50% greater with treatment (relative to the chances of those who were not treated), that is not the same as a person having a greater than 50% chance of experiencing the better outcome with treatment. The latter meets the required standard for causation; the former does not.  To illustrate, suppose that studies have shown that 10 out of a group of 100 people who do not eat bananas will die of cancer, as compared to 15 out of a group of 100 who do eat bananas. The banana-eating group would have an odds ratio of 1.5 or a 50% greater chance of getting cancer than those who eschew bananas. But this is a far cry from showing that a person who eats bananas is more likely than not to get cancer.

Even if we were to look only at the fifteen persons in the banana-eating group who did get cancer, it would not be likely that any particular person in that cohort got it from the consumption of bananas. Correlation is not causation, and a substantial number of persons with cancer within the banana-eating group would in all probability have contracted the disease whether or not they ate bananas.FN6

We think that this example exposes the analytical gap between Dr. Tikoo’s methods and his conclusions.  Although he could present figures ranging higher than 50%, those figures were not responsive to the question of causation. Let us take the “stroke scale” figure from the NINDS study as an example. This scale measures the neurological deficits in different parts of the nervous system. Twenty percent of patients who experienced a stroke and were not treated with t-PA had a favorable outcome according to this scale, whereas that figure escalated to 31% when t-PA was administered.

Although this means that the patients treated with t-PA had over a 50% better chance of recovery than they otherwise would have had, 69% of those patients experienced the adverse outcome (stroke-related injury) anyway.FN7  The short of it is that while the odds ratio analysis shows that a t-PA patient may have a better chance of recovering than he otherwise would have had without t-PA, such an analysis does not show that a person has a better than even chance of avoiding injury if the drug is administered. The odds ratio, therefore, does not show that the failure to give t-PA was more likely than not a substantial factor in causing the plaintiff’s injuries. The unavoidable conclusion from the studies deemed authoritative by Dr. Tikoo is that only a small number of patients overall (and only a small fraction of those who would otherwise have experienced stroke-related injuries) experience improvement when t-PA is administered.”

*11 and n.6 (citing Milward).

The court in Samaan thus suggested, but did not state explicitly, that the study would have to have shown better than a 100% increase in the rate of recovery for attributability to have exceeded 50%.  The Court’s timidity is regrettable. Yes, Dr. Tikoo’s confusing the percentage increased risk with the percentage of attributability was quite knuckleheaded.  I doubt that many would want to subject themselves to Dr. Tikoo’s quality of care, at least not his statistical care.  The First Circuit, however, stopped short of stating what magnitude increase in risk would permit an inference of specifc causation for Mr. Samaan’s post-stroke sequelae.

The Circuit noted that expert witnesses may present epidemiologic statistics in a variety of forms:

“to indicate causation. Either absolute or relative calculations may suffice in particular circumstances to achieve the causation standard. See, e.g., Smith v. Bubak, 643 F.3d 1137, 1141–42 (8th Cir.2011) (rejecting relative benefit testimony and suggesting in dictum that absolute benefit “is the measure of a drug’s overall effectiveness”); Young v. Mem’l Hermann Hosp. Sys., 573 F.3d 233, 236 (5th Cir.2009) (holding that Texas law requires a doubling of the relative risk of an adverse outcome to prove causation), cert. denied, ___ U.S. ___, 130 S.Ct. 1512, 176 L.Ed.2d 111 (2010).”

 Id. at *11.

Although the citation to Texas law with its requirement of a doubling of a relative risk is welcome and encouraging, the Court seems to have gone out of its way to muddle its holding.  First, the Young case involved t-PA and a claimed deviation from the standard of care in a stroke case, and was exactly on point.  The Fifth Circuit’s reliance upon Texas substantive law left unclear to what extent the same holding would have been required by Federal Rule of Evidence 702.

Second, the First Circuit, with its banana hypothetical, appeared to confuse an odds ratio with a relative risk.  The odds ratio is different from a relative risk, and typically an odds ratio will be higher than the corresponding relative risk, unless the outcome is rare.  See Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers at 37 (2d ed. 2001). In studies of medication efficacy, however, the benefit will not be particularly rare, and the rare disease assumption cannot be made.

Third, risk is not causation, regardless of magnitude.  If the magnitude of risk is used to infer specific causation, then what is the basis for the inference, and how large must the risk be?  In what way can epidemiologic statistics be used “to indicate” specific causation?  The opinion tells us that Dr. Tivoo’s reliance upon an odds ratio of 1.5 was unhelpful, but why?  The Court, which spoke so clearly and well in identifying the fallacious reasoning of Dr. Tivoo, faltered in identifying what use of risk statistics would permit an inference of specific causation in this case, where general causation was never in doubt.

The Fifth Circuit’s decision in Young, supra, invoked a greater than doubling of risk required by Texas law.  This requirement is nothing more than a logical, common-sense recognition that risk is not causation, and that small risks alone cannot support an inference of specific causation.  Requiring a relative risk greater than two makes practical sense despite the apoplectic objections of Professor Sander Greenland.  SeeRelative Risks and Individual Causal Attribution Using Risk Size” (Mar. 18, 2011).

Importantly, the First Circuit panel in Samaan did not engage in the hand-waving arguments that were advanced in Milward, and stuck to clear, transparent rational inferences.  In footnote 6, the Samaan Court cited its earlier decision in Milward, but only with double negatives, and for the relevancy of odds ratios to the question of general causation:

“This is not to say that the odds ratio may not help to prove causation in some instances.  See, e.g., Milward v. Acuity Specialty Prods. Group, Inc., 639 F.3d 11, 13–14, 23–25 (1st Cir.2011) (reversing exclusion of expert prepared to testify as to general rather than specific causation using in part the odds ratio).”

Id. at n.6.

The Samaan Court went on to suggest that inferring specific causation from the magnitude of risk was “theoretically possible”:

Indeed, it is theoretically possible that a particular odds ratio calculation might show a better-than-even chance of a particular outcome. Here, however, the odds ratios relied on by Dr. Tikoo have no such probative force.

Id. (emphasis added).  But why and how? The implication of the Court’s dictum is that when the risk ratio is small, less than or equal to two, the ratio cannot be taken to have supported the showing of “better than even chance.” In Milward, one of the key studies relied upon by plaintiff’s expert witness reported an increased risk of only 40%.  Although Milward presented primarily a challenge on general causation, the Samaan decision suggests that the low-dose benzene exposure plaintiffs are doomed, not by benzene, but by the perscrutation required by Rule 702.

Ethics and Statistics

January 21st, 2012

Chance magazine has started a new feature, the “Ethics and Statistics column, which is likely to be of interest to lawyers and to statisticians who work on litigation issues.  The column is edited by Andrew Gelman.  Judging from the Gelman’s first column, I think that the column may well become a valuable forum for important scientific and legal issues arising from studies used in public policy formulation, and in reaching conclusions that are the bases for scientific expert witnesses’ testimony in court.

Andrew Gelman is a professor of statistics and political science in Columbia University.  He is also the director of the University’s Applied Statistics Center.   Gelman’s inaugural column touches on some issues of great importance to legal counsel who litigate scientific issues involving scientific studies:  access to underlying data in the studies that are the bases for expert witness opinions.  See Andrew Gelman, “Open Data and Open Methods,” 24 Chance 51 (2011).

Gelman acknowledges that conflicts are not only driven by monetary gain; they can be potently raised by positions or causes espoused by the writer:

“An ethics problem arises when you are considering an action that

(a) benefits you or some cause you support,

(b) hurts or reduces benefits to others, and

(c) violates some rule.”

Id. at 51a.

Positional conflicts among scientists whose studies touch upon policy issues give rise to “the ethical imperative to share data.”  Id. at 51c.  Naming names, Professor Gelman relates an incident in which he wrote to an  EPA scientist, Carl Blackman, who had presented a study on the supposed health effects of EMF radiation.   Skeptical of how Blackman had analyzed data, Gelman wrote to Blackman to request his data to carry out additional, alternative statistical analyses.  Blackman answered that he did not think these other analyses were needed, and he declined to share his data.

This sort of refusal is all too common, and typical of the arrogance of scientists who do not want others to be able to take a hard look at how they arrived at their conclusions.  Gelman reminds us that:

“Refusing to share your data is improper… .”

* * * *

“[S]haring data is central to scientific ethics.  If you really believe your results, you should want your data out in the open. If, on the other hand, you have a sneaking suspicion that maybe there’s something there you don’t want to see, and then you keep your raw data hidden, it’s a problem.”

* * * *

“Especially for high-stakes policy questions (such as the risks of electric power lines), transparency is important, and we support initiatives for automatically making data public upon publication of results so researchers can share data without it being a burden.”

Id. at 53.

To be sure, there are some problems with sharing data, but none that is insuperable, and none that should be an excuse for withholding data.  The logistical, ethical, and practical problems of data sharing should now be anticipated long before publication and the requests for data sharing arrive.

Indeed, the National Institutes of Health requires data sharing plans to be part of a protocol for a federally funded study.  See Final NIH Statement on Sharing Research Data (Feb. 26, 2003). Unfortunately, the NIH’s implementation and enforcement of its data-sharing policy is as spotty as a Damien Hirst painting.  SeeSeeing Spots” The New Yorker (Jan. 23, 2012).

The Will to Ummph

January 10th, 2012

It has become très chic to criticize and dismiss the concept of statistical significance.

The new Reference Manual on Scientific Evidence contains a sly reference and endorsement to a book by the two would-be statistics experts who submitted an amicus brief to the Supreme Court in Mattrix Initiatives v. Siracusano:

“For a hypercritical assessment of statistical significance testing that nevertheless identifies much inappropriate overreliance on it, see Stephen T. Ziliak & Deidre N. McCloskey, The Cult of Statistical Significance (2008).”

Michael D. Green, D. Michael Freedman, and Leon Gordis, ” Reference Guide on Epidemiology” 549, 579, in Federal Judicial Center and National Research Council, Reference Manual on Scientific Evidence (3d ed. 2011).

The Reference Manual authors are, in fact, hypo-critical of the rhetoric of Ziliak and McCloskey.  I have previously written at some length of these authors’, and brief writers’, submission to the Supreme Court, and their subsequent harrumph in SignificanceSee The Matrixx Oversold; Matrixx Unloaded; The Matrixx – A Comedy of Errors; Matrixx Galvanized – More Errors, More Comedy About Statistics; and Ziliak Gives Legal Advice — Puts His Posterior On the Line (June 2, 2011).

To date, I have not addressed Ziliak and McCloskey’s book-length treatment of statistical significance, and their demonization of Sir Ronald Fisher, their beatification of William Gossett, and the need for a measure of “ummph: The Cult of Statistical Significance.  Thankfully, Professor Deborah Mayo, a professor of statistics and philosophy, has delivered the coup de grâce to Ziliac & McCloskey, in her interesting and timely blog, Error Statistics Philosophy.   See, e.g., Part 2 Prionvac: The Will to Understand Power (October 3, 2011); and Part 3: Prionvac: How the Reformers Should Have done Their Job (October 4, 2011).

Mayo’s “will to understand power” is a nice play on Nietzsche, and a rebuke of Ziliak and McCloskey’s strident call  for a measure of “ummph.”  Most of their argument is beside the point for the current practice of epidemiology, which insists upon reporting a measure of “effect” size, as well as statistical precision in a confidence interval.  The Reference Manual‘s citation to Ziliak & McCloskey’s book thus badly misses the point and the errors of the book’s criticisms of significance.

Mayo’s posts should remove any sense of need or desire to obtain and read Ziliac & McCloskey’s book.  You can safely wait until Cult shows up in on the discount rack, or in the recycling pile.

A Rule of Completeness for Statistical Evidence

December 23rd, 2011

Witnesses swear to tell the “whole” truth, but lawyers are allowed to deal in half truths.  Given this qualification on lawyers’ obligation of truthfulness, the law prudently modifies the law of admissibility for writings to permit an adverse party to require that written statements are not yanked out of context.  Waiting days, if not weeks, in a trial to restore the context is an inadequate remedy for these “half truths.”  If a party introduces all or part of a writing or recorded statement, an adverse party may ” require the introduction, at that time, of any other part — or any other writing or recorded statement — that in fairness ought to be considered at the same time.”  Fed. R. Evid. 106 (Remainder of or Related Writings or Recorded Statements).  See also Fed. R. Civ. Pro. Rule 32(a)(4) (rule of completeness for depositions).

This “rule of completeness” has its roots in the common law, and in the tradition of narrative testimony.  The Advisory Committee notes to Rule 106 comments that the rule is limited to “writings and recorded statements and does not apply to conversations.”  The Rule and the notes ignore that the problematic incompleteness might be in the form of mathematical or statistical evidence.

Confidence Intervals

Consider sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.

The new third edition dilutes this clear prescription, but still conveys the basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The evidentiary point is that the standard error, or the confidence interval (C.I.), is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

Of course, the 95 percent confidence interval is the estimate (the risk ratio, the point estimate) plus or minus 1.96 standard errors.  By analogy to Rule 106, lawyers should insist that the confidence interval, or some similar expression of the size of the standard error, be provided at the time that the examiner asks about, or the witness gives, the sample estimate.  There are any number of consensus position papers, as well as guidelines for authors of papers, which specify that risk ratios should be accompanied by confidence intervals.  Courts should heed those recommendations, and require parties to present the complete statistical idea – estimate and random error – at one time.

One disreputable lawyer trick is to present incomplete confidence intervals.  Plaintiffs’ counsel, for instance, may inquire into the upper bound of a confidence interval, and attempt to silence witnesses when they respond with both the lower and upper bounds.  “Just answer the question, and stop volunteering information not asked.”  Indeed, some unscrupulous lawyers have been known to cut off witnesses from providing the information about both bounds of the interval, on the claim that the witness was being “unresponsive.”  Judges who are impatient with technical statistical testimony may even admonish witnesses who are trying to make sure that they present the “whole truth.”  Here again, the completeness rule should protect the integrity of the fact finding by allowing, and requiring, that the full information be presented at once, in context.

Although I have seen courts permit the partial, incomplete presentation of statistical evidence, I have yet to see a court acknowledge the harm from failing to apply Rule 106 to quantitative, statistical evidence.  One court, however, did address the inherent error of permitting a party to emphasize the extreme values within a confidence interval as “consistent” with the data sample.  Marder v. G.D. Searle & Co., 630 F.Supp. 1087 (D.Md. 1986), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987)(per curiam).

In Marder, the plaintiff claimed pelvic inflammatory disease from a IUD.  The jury was deadlocked on causation, and the trial court decided to grant the defendant’s motion for directed verdict, on grounds that the relative risk involved was less than two. Id. at 1092. (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”)

The plaintiff sought to resist entry of judgment by arguing that although the relative risk was less than two, the court should consider the upper bound of the confidence interval, which ranged from 0.9 to 4.0.  Id.  So in other words, the plaintiff argued that she was entitled to have the jury consider and determine that the actual value was actually 4.0.

The court, fairly decisively, rejected this attempt to isolate the upper bound of the confidence interval:

“The upper range of the confidence intervals signify the outer realm of possibilities, and plaintiffs cannot reasonably rely on these numbers as evidence of the probability of a greater than two fold risk.  Their argument reaches new heights of speculation and has no scientific basis.”

The Marder court could have gone further by pointing out that the confidence interval does not provide a probability for any value within the interval.

Multiple Testing

In some situations, completeness may require more than the presentation of the size of the random error, or the width of the confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate.

Here is the relevant language from Kaye and Freedman’s chapter on statistics, in the Reference Manual (3d ed.):

4. How many tests have been done?

Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.

Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are significant. …”

RMSE 3d ed. at 256-57.

When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.  When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.  Even a nominally statistically significant finding must be understood in the full context of the study.

Many texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves.  Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”

Epidemiology, Risk, and Causation – Report of Workshops

November 15th, 2011

This month’s issue of Preventive Medicine includes a series of papers arising from last year’s workshops on “Epidemiology, Risk, and Causation,” at Cambridge University. The workshops were organized by philosopher Alex Broadbent,  a member of the Department of History and Philosophy of Science, in Cambridge University.  The workshops were financially sponsored by the Foundation for Genomics and Population Health (PHG), a not-for-profit British organization.

Broadbent’s workshops were intended for philosophers of science, statisticians, and epidemiologists, lawyers involved in health effects litigation will find the papers of interest as well.  The themes of workshops included:

  • the nature of epidemiologic causation,
  • the competing claims of observational and experimental research for establishing causation,
  • the role of explanation and prediction in assessing causality,
  • the role of moral values in causal judgments, and
  • the role of statistical and epistemic uncertainty in causal judgments

See Alex Broadbent, ed., “Special Section: Epidemiology, Risk, and Causation,” 53 Preventive Medicine 213-356 (October-November 2011).  Preventive Medicine is published by Elsevier Inc., so you know that the articles are not free.  Still you may want to read these at your local library to determine what may be useful in challenging and defending causal judgments in the courtroom.  One of the interlocutors, Sander Greenland, is of particular interest because he shows up as an expert witness with some regularity.

Here are the individual papers published in this special issue:

Alfredo Morabia, Michael C. Costanza, Philosophy and epidemiology

Alex Broadbent, Conceptual and methodological issues in epidemiology: An overview

Alfredo Morabia, Until the lab takes it away from epidemiology

Nancy Cartwright, Predicting what will happen when we act. What counts for warrant?

Sander Greenland, Null misinterpretation in statistical testing and its impact on health risk assessment

Daniel M. Hausman, How can irregular causal generalizations guide practice

Mark Parascandola, Causes, risks, and probabilities: Probabilistic concepts of causation in chronic disease epidemiology

John Worrall, Causality in medicine: Getting back to the Hill top

Olaf M. Dekkers, On causation in therapeutic research: Observational studies, randomised experiments and instrumental variable analysis

Alexander Bird, The epistemological function of Hill’s criteria

Michael Joffe, The gap between evidence discovery and actual causal relationships

Stephen John, Why the prevention paradox is a paradox, and why we should solve it: A philosophical view

Jonathan Wolff, How should governments respond to the social determinants of health?

Alex Broadbent, What could possibly go wrong? — A heuristic for predicting population health outcomes of interventions, Pages 256-259

The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence

November 14th, 2011

Meta-analysis is a statistical procedure for aggregating data and statistics from individual studies into a single summary statistical estimate of the population measurement of interest.  The first meta-analysis is typically attributed to Karl Pearson, circa 1904, who sought a method to overcome the limitations of small sample size and low statistical power.  Statistical methods for meta-analysis, however, did not mature until the 1970s.  Even then, the biomedical scientific community remained skeptical of, if not out rightly hostile to, meta-analysis until relatively recently.

The hostility to meta-analysis, especially in the context of observational epidemiologic studies, was colorfully expressed by Samuel Shapiro and Alvan Feinstein, as late as the 1990s:

“Meta-analysis begins with scientific studies….  [D]ata from these studies are then run through computer models of bewildering complexity which produce results of implausible precision.”

* * * *

“I propose that the meta-analysis of published non-experimental data should be abandoned.”

Samuel Shapiro, “Meta-analysis/Smeta-analysis,” 140 Am. J. Epidem. 771, 777 (1994).  See also Alvan Feinstein, “Meta-Analysis: Statistical Alchemy for the 21st Century,” 48 J. Clin. Epidem. 71 (1995).

The professional skepticism about meta-analysis was reflected in some of the early judicial assessments of meta-analysis in court cases.  In the 1980s and early 1990s, some trial judges erroneously dismissed meta-analysis as a flawed statistical procedure that claimed to make something out of nothing. Allen v. Int’l Bus. Mach. Corp., No. 94-264-LON, 1997 U.S. Dist. LEXIS 8016, at *71–*74 (suggesting that meta-analysis of observational studies was controversial among epidemiologists).

In In re Paoli Railroad Yard PCB Litigation, Judge Robert Kelly excluded plaintiffs’ expert witness Dr. William Nicholson and his testimony based upon his unpublished meta-analysis of health outcomes among PCB-exposed workers.  Judge Kelly found that the meta-analysis was a novel technique, and that Nicholson’s meta-analysis was not peer reviewed.  Furthermore, the meta-analysis assessed health outcomes not experienced by any of the plaintiffs before the trial court.  706 F. Supp. 358, 373 (E.D. Pa. 1988).

The Court of Appeals for the Third Circuit reversed the exclusion of Dr. Nicholson’s testimony, and remanded for reconsideration with instructions.  In re Paoli R.R. Yard PCB Litig., 916 F.2d 829, 856-57 (3d Cir. 1990), cert. denied, 499 U.S. 961 (1991); Hines v. Consol. Rail Corp., 926 F.2d 262, 273 (3d Cir. 1991).  The Circuit noted that meta-analysis was not novel, and that the lack of peer-review was not an automatic disqualification.  Acknowledging that a meta-analysis could be performed poorly using invalid methods, the appellate court directed the trial court to evaluate the validity of Dr. Nicholson’s work on his meta-analysis.

In one of many squirmishes over colorectal cancer claims in asbestos litigation, Judge Sweet in the Southern District of New York was unimpressed by efforts to aggregate data across studies.  Judge Sweet declared that “no matter how many studies yield a positive but statistically insignificant SMR for colorectal cancer, the results remain statistically insignificant. Just as adding a series of zeros together yields yet another zero as the product, adding a series of positive but statistically insignificant SMRs together does not produce a statistically significant pattern.”  In In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1042 (S.D.N.Y. 1993).  The plaintiffs’ expert witness who had offered the unreliable testimony, Dr. Steven Markowitz, like Nicholson, another foot soldier in Dr. Irving Selikoff’s litigation machine, did not offer a formal meta-analysis to justify his assessment that multiple non-significant studies, taken together, rule out chance as a likely explanation for an aggregate finding of an increased risk.

Judge Sweet was quite justified in rejecting this back of the envelope, non-quantitative meta-analysis.  His suggestion, however, that multiple non-significant studies could never collectively serve to rule out chance as an explanation for an overall increased rate of disease in the exposed groups is wrong.  Judge Sweet would have better focused on the validity issues in key studies, the presence of bias and confounding, and the completeness of the proffered meta-analysis.  The Second Circuit reversed the entry of summary judgment, and remanded the colorectal cancer claim for trial.  52 F.3d 1124 (2d Cir. 1995).  Over a decade later, with even more accumulated studies and data, the Institute of Medicine found the evidence for asbestos plaintiffs’ colorectal cancer claims to be scientifically insufficient.  Institute of Medicine, Asbestos: Selected Cancers (Wash. D.C. 2006).

Courts continue to go astray with an erroneous belief that multiple studies, all without statistically significant results, cannot yield a statistically significant summary estimate of increased risk.  See, e.g., Baker v. Chevron USA, Inc., 2010 WL 99272, *14-15 (S.D.Ohio 2010) (addressing a meta-analysis by Dr. Infante on multiple myeloma outcomes in studies of benzene-exposed workers).  There were many sound objections to Infante’s meta-analysis, but the suggestion that multiple studies without statistical significance could not yield a summary estimate of risk with statistical significance was not one of them.

In the last two decades, meta-analysis has emerged as an important technique for addressing random variation in studies, as well as some of the limitations of frequentist statistical methods.  In 1980s, articles reporting meta-analyses were rare to non-existent.  In 2009, there were over 2,300 articles with “meta-analysis” in their title, or in their keywords, indexed in the PubMed database of the National Library of Medicine.  See Michael O. Finkelstein and Bruce Levin, “Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases” (2011) (forthcoming in Jurimetrics).

The techniques for aggregating data have been studied, refined, and employed extensively in thousands of methods and application papers in the last decade. Consensus guideline papers have been published for meta-analyses of clinical trials as well as observational studies.  See Donna Stroup, et al., “Meta-analysis of Observational Studies in Epidemiology: A Proposal for Reporting,” 283 J. Am. Med. Ass’n 2008 (2000) (MOOSE statement); David Moher, Deborah Cook, Susan Eastwood, Ingram Olkin, Drummond Rennie, and Donna Stroup, “Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement,” 354 Lancet 1896 (1999).  See also Jesse Berlin & Carin Kim, “The Use of Meta-Analysis in Pharmacoepidemiology,” in Brian Strom, ed., Pharmacoepidemiology 681, 683–84 (4th ed. 2005); Zachary Gerbarg & Ralph Horwitz, “Resolving Conflicting Clinical Trials: Guidelines for Meta-Analysis,” 41 J. Clin. Epidemiol. 503 (1988).

Meta-analyses, of observational studies and of randomized clinical trials, routinely are relied upon by expert witnesses in pharmaceutical and so-called toxic tort litigation. Id. See also In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1174, 1184 (N.D. Cal. 2007) (holding that reliance upon “[a] meta-analysis of all available published and unpublished randomized clinical trials” was reasonable and appropriate, and criticizing the expert witnesses who urged the complete rejection of meta-analysis of observational studies)

The second edition of the Reference Manual on Scientific Evidence gave very little attention to meta-analysis.  With this historical backdrop, it is interesting to see what the new third edition provides for guidance to the federal judiciary on this important topic.

STATISTICS CHAPTER

The statistics chapter of the third edition gives continues to give scant attention to meta-analysis.  The chapter notes, in a footnote, that there are formal procedures for aggregating data across studies, and that the power of the aggregated data will exceed the power of the individual, included studies.  The footnote then cautions that meta-analytic procedures “have their own weakness,” without detailing what that one weakness is.  RMSE 3d at 254 n. 107.

The glossary at the end of the statistics chapter offers a definition of meta-analysis:

“meta-analysis. Attempts to combine information from all studies on a certain topic. For example, in the epidemiological context, a meta-analysis may attempt to provide a summary odds ratio and confidence interval for the effect of a certain exposure on a certain disease.”

Id. at 289.

This definition is inaccurate in ways that could yield serious mischief.  Virtually all meta-analyses are built upon a systematic review that sets out to collect all available studies on a research issue of interest.  It is a rare meta-analysis, however, that includes “all” studies in its quantitative analysis.  The meta-analytic process involves a pre-specification of inclusionary and exclusionary criteria for the quantitative analysis of the summary estimate of risk.  Those criteria may limit the quantitative analysis to randomized trials, or to analytical epidemiologic studies.  Furthermore, meta-analyses frequently and appropriately have pre-specified exclusionary criteria that relate to study design or quality.

On a more technical note, the offered definition suggests that the summary estimate of risk will be an odds ratio, which may or may not be true.  Meta-analyses of risk ratios may yield summary estimates of risk in terms of relative risk or hazard ratios, or even of risk differences.  The meta-analysis may combine data of means rather than proportions as well.

EPIDEMIOLOGY CHAPTER

The chapter on epidemiology delves into meta-analysis in greater detail than the statistics chapter, and offers apparently inconsistent advice.  The overall gist of the chapter, however, can perhaps best be summarized by the definition offered in this chapter’s glossary:

“meta-analysis. A technique used to combine the results of several studies to enhance the precision of the estimate of the effect size and reduce the plausibility that the association found is due to random sampling error.  Meta-analysis is best suited to pooling results from randomly controlled experimental studies, but if carefully performed, it also may be useful for observational studies.”

Reference Guide on Epidemiology, RSME3d at 624.  See also id. at 581 n. 89 (“Meta-analysis is better suited to combining results from randomly controlled experimental studies, but if carefully performed it may also be helpful for observational studies, such as those in the epidemiologic field.”).  The epidemiology chapter appropriately notes that meta-analysis can help address concerns over random error in small studies.  Id. at 579; see also id. at 607 n. 171.

Having told us that properly conducted meta-analyses of observational studies can be helpful, the chapter hedges considerably:

“Meta-analysis is most appropriate when used in pooling randomized experimental trials, because the studies included in the meta-analysis share the most significant methodological characteristics, in particular, use of randomized assignment of subjects to different exposure groups. However, often one is confronted with nonrandomized observational studies of the effects of possible toxic substances or agents. A method for summarizing such studies is greatly needed, but when meta-analysis is applied to observational studies – either case-control or cohort – it becomes more controversial.174 The reason for this is that often methodological differences among studies are much more pronounced than they are in randomized trials. Hence, the justification for pooling the results and deriving a single estimate of risk, for example, is problematic.175

Id. at 607.  The stated objection to pooling results for observational studies is certainly correct, but many research topics have sufficient studies available to allow for appropriate selectivity in framing inclusionary and exclusionary criteria to address the objection.  The chapter goes on to credit the critics of meta-analyses of observational studies.  As they did in the second edition of the RSME, the authors repeat their cites to, and quotes from, early papers by John Bailar, who was then critical of such meta-analyses:

“Much has been written about meta-analysis recently and some experts consider the problems of meta-analysis to outweigh the benefits at the present time. For example, John Bailar has observed:

‘[P]roblems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with the method. For the present . . . I still prefer the thoughtful, old-fashioned review of the literature by a knowledgeable expert who explains and defends the judgments that are presented. We have not yet reached a stage where these judgments can be passed on, even in part, to a formalized process such as meta-analysis.’

John Bailar, “Assessing Assessments,” 277 Science 528, 529 (1997).”

Id. at 607 n.177.  Bailar’s subjective preference for “old-fashioned” reviews, which often cherry picked the included studies is, well, “old fashioned.”  More to the point, it is questionable science, and a distinctly minority viewpoint in the light of substantial improvements in the conduct and reporting of meta-analyses of observational studies.  Bailar may be correct that some meta-analyses should have never left the protocol stage, but the RMSE 3d fails to provide the judiciary with the tools to appreciate the distinction between good and bad meta-analyses.

This categorical rejection, cited with apparent approval, is amplified by a recitation of some real or apparent problems with meta-analyses of observational studies.  What is missing is a discussion of how many of these problems can be and are dealt with in contemporary practice:

“A number of problems and issues arise in meta-analysis. Should only published papers be included in the meta-analysis, or should any available studies be used, even if they have not been peer reviewed? Can the results of the meta-analysis itself be reproduced by other analysts? When there are several meta-analyses of a given relationship, why do the results of different meta-analyses often disagree? The appeal of a meta-analysis is that it generates a single estimate of risk (along with an associated confidence interval), but this strength can also be a weakness, and may lead to a false sense of security regarding the certainty of the estimate. A key issue is the matter of heterogeneity of results among the studies being summarized.  If there is more variance among study results than one would expect by chance, this creates further uncertainty about the summary measure from the meta-analysis. Such differences can arise from variations in study quality, or in study populations or in study designs. Such differences in results make it harder to trust a single estimate of effect; the reasons for such differences need at least to be acknowledged and, if possible, explained.176 People often tend to have an inordinate belief in the validity of the findings when a single number is attached to them, and many of the difficulties that may arise in conducting a meta-analysis, especially of observational studies such as epidemiologic ones, may consequently be overlooked.177

Id. at 608.  The authors are entitled to their opinion, but their discussion leaves the judiciary uninformed about current practice, and best practices, in epidemiology.  A categorical rejection of meta-analyses of observational studies is at odds with the chapter’s own claim that such meta-analyses can be helpful if properly performed.  What was needed, and is missing, is a meaningful discussion to help the judiciary determine whether a meta-analysis of observational studies was properly performed.

MEDICAL TESTIMONY CHAPTER

The chapter on medical testimony is the third pass at meta-analysis in RMSE 3d.   The second edition’s chapter on medical testimony ignored meta-analysis completely; the new edition addresses meta-analysis in the context of the hierarchy of study designs:

“Other circumstances that set the stage for an intense focus on medical evidence included

(1) the development of medical research, including randomized controlled trials and other observational study designs;

(2) the growth of diagnostic and therapeutic interventions;141

(3) interest in understanding medical decision making and how physicians reason;142 and

(4) the acceptance of meta-analysis as a method to combine data from multiple randomized trials.143

RMSE 3d at 722-23.

The chapter curiously omits observational studies, but the footnote reference (note 143) then inconsistently discusses two meta-analyses of observational, rather than experimental, studies:

“143. Video Software Dealers Ass’n v. Schwarzenegger, 556 F.3d 950, 963 (9th Cir. 2009) (analyzing a meta-analysis of studies on video games and adolescent behavior); Kennecott Greens Creek Min. Co. v. Mine Safety & Health Admin., 476 F.3d 946, 953 (D.C. Cir. 2007) (reviewing the Mine Safety and Health Administration’s reliance on epidemiological studies and two meta-analyses).”

Id. at 723 n.143.

The medical testimony chapter then provides further confusion by giving a more detailed listing of the hierarchy of medical evidence in the form of different study designs:

3. Hierarchy of medical evidence

With the explosion of available medical evidence, increased emphasis has been placed on assembling, evaluating, and interpreting medical research evidence.  A fundamental principle of evidence-based medicine (see also Section IV.C.5, infra) is that the strength of medical evidence supporting a therapy or strategy is hierarchical.  When ordered from strongest to weakest, systematic review of randomized trials (meta-analysis) is at the top, followed by single randomized trials, systematic reviews of observational studies, single observational studies, physiological studies, and unsystematic clinical observations.150 An analysis of the frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized controlled trials in the medical evidence hierarchy.151 Although they are at the bottom of the evidence hierarchy, unsystematic clinical observations or case reports may be the first signals of adverse events or associations that are later confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless, subsequent studies may not confirm initial reports (e.g., the putative association between coffee consumption and pancreatic cancer).154

Id. at 723-24.  This discussion further muddies the water by using a parenthetical to suggest that meta-analyses of randomized clinical trials are equivalent to systematic reviews of such studies — “systematic review of randomized trials (meta-analysis).” Of course, systematic reviews are not meta-analyses, although they are a necessary precondition for conducting a meta-analysis.  The relationship between the procedures for a systematic review and a meta-analysis are in need of clarification, but the judiciary will not find it in the new Reference Manual.

OSHA’s HazCom Standard — Statistical and Scientific Nonsense

November 13th, 2011

Almost 28 years ago, the United States Department of Labor (Occupational Safety and Health Administration or OSHA) promulgated The Hazard Communication Standard. 29 C.F.R. § 1910.1200 (November 1983; effective date November 25, 1985) (HazCom standard).  Initially the HazCom standard applied to importers and manufacturers of chemicals.  Starting one year later, November 25, 1986, the standard covered manufacturing employers, under OSHA jurisdiction, by defining their duties to protect and inform employees.

The HazCom standard applies to all chemical manufacturers and distributors and to

“any chemical which is known to be present in the workplace in such a manner that employees may be exposed under normal conditions of use or in a foreseeable emergency.”

29 C.F.R. § 1910.1200(b)(1), and (b)(2).  The standard requires manufacturers and distributors of hazardous chemicals inform not only their own employees of the dangers posed by the chemicals, but downstream employers and employees as well.  The standard implements this duty to warn downstream employers’ employees by requiring that containers of hazardous chemicals leaving the workplace are labeled with “appropriate hazard warnings.”  See Martin v. American Cyanamid Co., 5 F.3d 140, 141-42 (6th Cir. 1993) (reviewing agency’s interpretation of the standard).

The HazCom standard attempts to provide some definition of the health hazards for which warnings are required:

“For health hazards, evidence which is statistically significant and which is based on at least one positive study conducted in accordance with established scientific principles is considered to be sufficient to establish a hazardous effect if the results of the study meet the definitions of health hazards in this section.”

29 C.F.R. § 1910.1200(d)(2).

This regulatory language is troubling. What does statistically significant mean?  The concept remains important in health effects research, but several writers have subjected the use of significance testing specifically, and frequentist statistics generally, to criticisms.  See, e.g., Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Ann Arbor 2008) (example of one of the more fringe, and not particularly cogent, criticisms of frequentist statistics).  And what are the “established scientific principles,” which would allow a single “positive study” to “establish” a hazardous “effect”?

The HazCom standard is important not only for purposes of regulatory compliance, but for its potential implications for products liability law, as well.  With its importance in mind, what can be said about the definition of health hazard, provided in 29 C.F.R. § 1910.1200(d)(2)?

Perhaps a good place to start is with the guidance provided by OSHA on compliance with the HazCom standard.  To be sure, like most agency guidance statements, this one is prefaced with caveats and cautions:

“This guidance is not a standard or regulation, and it creates no new legal obligations. It is advisory in nature, informational in content, and is intended to assist employers in providing a safe and healthful workplace. Pursuant to the Occupational Safety and Health Act, employers must comply with safety and health standards promulgated by OSHA or by a state with an OSHA-approved state plan. In addition, pursuant to Section 5(a)(1), the General Duty Clause of the Act, employers must provide their employees with a workplace free from recognized hazards likely to cause death or serious physical harm. Employers can be cited for violating the General Duty Clause if there is a recognized hazard and they do not take reasonable steps to prevent or abate the hazard. However, failure to implement any specific recommendations in this guidance is not, in itself, a violation of the General Duty Clause. Citations can only be based on standards, regulations, and the General Duty Clause.”

U.S. Dep’t of Labor, Guidance for Hazard Determination for Compliance with the OSHA Hazard Communication Standard (29 CFR § 1910.1200) (July 6, 2007).

Section II of the Guidance describes how manufacturers may assess whether their chemicals are “hazardous.”  A health hazard is defined as a chemical

“for which there is statistically significant evidence based on at least one study conducted in accordance with established scientific principles that acute or chronic health effects may occur in exposed employees.”

A fair-minded person might object that this is no guidance at all.  Statistically significant is not defined in the regulations. Study is not defined.  The guidance specifies that the study or studies must be conducted in accordance with “established scientific principles,” but must the interpretation or judgment of causality be made similarly in accordance with such principles? One would hope so, but the Guidance does not really specify.  The use of “may” seems to inject a level of conjecture or speculation into the hazard assessment.

Section V of the Guidance addresses data analysis, and here the agency attempts to provide some meaning to statistical significance and other terms in the regulation, but in doing so, the Guidance offers incoherent, incredible advice.

The Guidance notes that the regulation specifies one “positive study,” which presumably is a study that is some evidence in favor of an “effect.”  Because we are dealing with chemical exposures in occupational settings, the studies at issue will be, at best, observational studies.  Randomized clinical trials are out.  The one study (at least) at issue must be sufficient to establish a hazardous effect if that effect is considered a “health hazard” within the meaning of the regulations.  This is problematic on many levels.  What sort of study are we discussing?  An experimental study in planaria worms, a case study of a single human, an ecological study, or an analytical epidemiologic (case-control or cohort) study?  Whatever the study is, it would be a most remarkable study if it alone were “sufficient” to “establish” an “effect.”

A reasonable manufacturer or disinterested administrator surely would interpret the sufficiency requirement to mean that the entire evidentiary display must be considered rather than whether one study, taken in isolation, ripped from its scientific context, should be used to suggest a duty to warn.  The Guidance, and the regulations, however, never address the real-world complexity of hazard assessment.

Section V of the Guidance offers a failed attempt to illuminate the meaning of statistical significance:

“Statistical significance is a mathematical determination of the confidence in the outcome of a test. The usual criterion for establishing statistical significance is the p-value (probability value). A statistically significant difference in results is generally indicated by p < 0.05, meaning there is less than a 5% probability that the toxic effects observed were due to chance and were not caused by the chemical. Another way of looking at it is that there is a 95% probability that the effect is real, i.e., the effect seen was the result of the chemical exposure.”

Few statisticians or scientists would accept the proffered definition as acceptable.  The Guidance’s statement that a p-value is equivalent to the probability of the “toxic effect” occurring by chance is unacceptable for several reasons.

First, it is a notoriously incorrect, fallacious statement of the meaning of a p-value:

“Since p is calculated by assuming the null hypothesis is correct (that there is no difference [between observed and expected] in the full population), the p-value cannot give the chance that this hypothesis is true.  The p-value merely gives the chance of getting evidence against the null hypothesis as strong or stronger than the evidence at hand — assuming that the null hypothesis … is correct.”

David H. Kaye, David E. Bernstein, and Jennifer L. Mnookin, The New Wigmore: Expert Evidence § 12.8.2, at 559 (2d ed. 2010) (discussing the transpositional fallacy).

Second, even if we could ignore the statistical solecism, the Guidance’s use of a mechanical test for statistical significance is troubling.  The p-value is not necessarily an appropriate protection against Type I error, or a “false alarm” that there is an association between the exposure and outcome of interest.  Multiple testing and other aspects of a study may inflate the number of false alarms to the point that a study with a low p-value, even one much lower than 5%, will not rule out the likely role of chance as an explanation for the study’s result.

Third, the Guidance’s suggestion that “statistical significance” boils down to a conclusion that the “effect is real” may be its greatest offense against scientific and statistical methodology.  Section V of the Guidance emphasizes that the HazCom standard states that

“evidence that is statistically significant and which is based on at least one positive study conducted in accordance with established scientific principles is considered to be sufficient to establish a hazardous effect if the results of the study meet the [HCS] definitions of health hazards.”

This is nothing more than semantic fiat and legerdemain.

Statistical significance may, in some circumstances, permit an inference that the divergence from the expected was not likely due to chance, but it cannot, in the context of observational studies, allow for a conclusion that the divergence resulted because of a cause-effect relationship between the exposure and the outcome.  Statistical significance cannot rule out systemic bias or confounding in the study; nor can it help us reconcile inconsistencies across studies.  The study may have identified an association, which must be assessed for its causal or non-causal nature, in the context of all relevant evidence.  See Arthur Bradford Hill, “The Environment and Disease: Association or Causation?” 58 Proc. Royal Soc’y Med. 295 (1965).”

The OSHA Guidance is really no guidance at all.  Ensuring worker health and safety by requiring employers to provide industrial hygiene protections for workers is an exceedingly important task, but this aspect of the HazCom standard is incoherent and incompetent. Workers and employers are in the dark, and product suppliers are vulnerable to arbitrary and capricious enforcement.

A New Day – A New Edition of the Reference Manual of Scientific Evidence

September 28th, 2011

It’s an event that happens about once every ten years – a new edition of the Reference Manual on Scientific Evidence.  This sort of thing gets science-geek-nerd lawyers excited.  Today, the National Academies of Science released the new, third edition of the Manual today.  The work has grown from the second edition into a hefty volume, now over 1,000 pages. Shorter than War and Peace, but easier to read than Gravity’s Rainbow.  Paperback volumes are available for $71.96, but a PDF file is available for free.

Unlike the first two editions, which were products of the Federal Judicial Center, the new edition was produced under the supervision of an ad hoc committee – the Committee on the Development of the Third Edition of the Reference Manual on Scientific Evidence – under the auspices of the National Academies’ Committee on Science, Technology and the Law.  See Media Release from the National Academies.

The release of the Third Edition of the Manual was accompanied by a ceremony today at the National Academies’ Keck Center in Washington, DC.   Dr. Cicerone, President of the National Academies, and Judge Barbara Rothstein, director of the Federal Judicial Center, gave inspirational talks on the rightful, triumphant, aspirational role of science in the courtroom, and the Manual’s role of ensuring that science gets its day in the adversarial push and shove of litigation.

Co-chairs, Judge Kessler and Dr. Kassirer introduced the other members of the ad hoc committee, and the substantive work that makes up the Third Edition.

Other members of the ad hoc committee were:

  • Hon. Ming Chin, Associate Justice of the Supreme Court of California.
  • Hon. Pauline Newman, Judge, U.S. Court of Appeals for the Federal Circuit.
  • Hon. Kathleen O’Malley, Judge, U.S. Court of Appeals for the Federal Circuit.
  • Hon. Jed S. Rakoff, Judge, U.S. District Court for the Southern District of New York.
  • Channing R. Robertson, Ph.D., School of Engineering, Stanford University.
  • Joseph V. Rodricks, Ph.D., Environ Corp.
  • Allen J. Wilcox, Ph.D., Senior Investigator in the Epidemiology Branch, of the NIEHS.
  • Sandy L. Zabell, Professor of Statistics and Mathematics, Northwestern University.

The Third Edition has some notable new chapters on “Forensic Identification Expertise” (Paul Giannelli, Edward Imwinkelried, and Joseph Peterson), on “Neuroscience” (Henry Greely and Anthony Wagner), on “Mental Health Evidence” (Paul Appelbaum), and on “Exposure Science” (Joseph Rodricks).

Other chapters that were present in the Second Edition are substantial revised and updated, including the chapters on “Statistics” (David Kaye and the late David Freedman), on “DNA Identification Evidence” (David Kaye and George Sensabaugh), on “Epidemiology” (Michael Green, Michal Freedman, and Leon Gordis), and on “Engineering” (Channing R. Robertson, John E. Moalli, and David L. Black).

The chapter on “Medical Testimony” (John Wong, Lawrence Gostin, and Oscar Cabrera) is also substantially revised and expanded, with new authors, all for a welcomed change.

A substantial portion of the new addition incorporates chapters from the second addition with little or no change:

Justice Stephen Breyer’s “Introduction,” the late Professor Margaret Berger’s essay on “The Admissibility of Expert Testimony,” David Goodstein’s primer on “How Science Works,” as well the chapters on “Multiple Regression” (Daniel L. Rubinfeld), on “Survey Research” (Shari Diamond), and on “Toxicology” (Bernard Goldstein and Mary Sue Henifin).

Also new in this edition is the support of private organizations, the Starr Foundation and the Carnegie Corporation of New York.

Judge Kessler explained how the Supreme Court’s decision in Daubert introduced “rigorous standards” for scientific evidence in federal courtrooms.  Expert witnesses must follow the reasoning, principles, and procedures of modern science.  Lovely sentiments, and wonderful if heeded.

The public release ceremony ended with questions from the audience, both live and from the webcast.

To provide some comic relief to the serious encomia to science in the law, Dr. Tee Guidotti rose to give a deadpan imitation of Emily Litella.  Dr. Guidotti started by lecturing the gathering on the limitations of frequentist statistics and the need to accommodate Bayesian statistical analyses in medical testimony.  Dr. Kassirer politely interrupted Dr. Guidotti to point out that Bayesian analyses were covered in some detail in the chapters on statistics and on medical testimony.  Without missing a beat, Dr. Guidotti shifted blame to the index, which he claimed failed to identify these discussions for him.  Judge Kessler delivered the coup de grace by pointing out that the discussions of Bayesian statistics were amply referenced in the index.  Ooops; this was a hot court.  A chorus of sotto voce “never minds” rippled through the audience.

One question not asked was whether there are mandatory minimum sentences for judges who fail to bother to read the relevant sections before opining on science issues.

The Third Edition of the Manual appears to be a substantial work.  Lawyers will ignore this work at their peril.  Trial judges of course have immunity except to the extent appellate judges are paying attention.  Like its predecessors, the new edition is likely to have a few quibbles, quirks, and quarrels lurking in the 1,000 plus pages here, and I am sure that the scrutiny of the bar and the academy will find them.  Let’s hope that it takes fewer than a decade to see them corrected in a Fourth Edition.