TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

A Rule of Completeness for Statistical Evidence

December 23rd, 2011

Witnesses swear to tell the “whole” truth, but lawyers are allowed to deal in half truths.  Given this qualification on lawyers’ obligation of truthfulness, the law prudently modifies the law of admissibility for writings to permit an adverse party to require that written statements are not yanked out of context.  Waiting days, if not weeks, in a trial to restore the context is an inadequate remedy for these “half truths.”  If a party introduces all or part of a writing or recorded statement, an adverse party may ” require the introduction, at that time, of any other part — or any other writing or recorded statement — that in fairness ought to be considered at the same time.”  Fed. R. Evid. 106 (Remainder of or Related Writings or Recorded Statements).  See also Fed. R. Civ. Pro. Rule 32(a)(4) (rule of completeness for depositions).

This “rule of completeness” has its roots in the common law, and in the tradition of narrative testimony.  The Advisory Committee notes to Rule 106 comments that the rule is limited to “writings and recorded statements and does not apply to conversations.”  The Rule and the notes ignore that the problematic incompleteness might be in the form of mathematical or statistical evidence.

Confidence Intervals

Consider sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.

The new third edition dilutes this clear prescription, but still conveys the basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The evidentiary point is that the standard error, or the confidence interval (C.I.), is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

Of course, the 95 percent confidence interval is the estimate (the risk ratio, the point estimate) plus or minus 1.96 standard errors.  By analogy to Rule 106, lawyers should insist that the confidence interval, or some similar expression of the size of the standard error, be provided at the time that the examiner asks about, or the witness gives, the sample estimate.  There are any number of consensus position papers, as well as guidelines for authors of papers, which specify that risk ratios should be accompanied by confidence intervals.  Courts should heed those recommendations, and require parties to present the complete statistical idea – estimate and random error – at one time.

One disreputable lawyer trick is to present incomplete confidence intervals.  Plaintiffs’ counsel, for instance, may inquire into the upper bound of a confidence interval, and attempt to silence witnesses when they respond with both the lower and upper bounds.  “Just answer the question, and stop volunteering information not asked.”  Indeed, some unscrupulous lawyers have been known to cut off witnesses from providing the information about both bounds of the interval, on the claim that the witness was being “unresponsive.”  Judges who are impatient with technical statistical testimony may even admonish witnesses who are trying to make sure that they present the “whole truth.”  Here again, the completeness rule should protect the integrity of the fact finding by allowing, and requiring, that the full information be presented at once, in context.

Although I have seen courts permit the partial, incomplete presentation of statistical evidence, I have yet to see a court acknowledge the harm from failing to apply Rule 106 to quantitative, statistical evidence.  One court, however, did address the inherent error of permitting a party to emphasize the extreme values within a confidence interval as “consistent” with the data sample.  Marder v. G.D. Searle & Co., 630 F.Supp. 1087 (D.Md. 1986), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987)(per curiam).

In Marder, the plaintiff claimed pelvic inflammatory disease from a IUD.  The jury was deadlocked on causation, and the trial court decided to grant the defendant’s motion for directed verdict, on grounds that the relative risk involved was less than two. Id. at 1092. (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”)

The plaintiff sought to resist entry of judgment by arguing that although the relative risk was less than two, the court should consider the upper bound of the confidence interval, which ranged from 0.9 to 4.0.  Id.  So in other words, the plaintiff argued that she was entitled to have the jury consider and determine that the actual value was actually 4.0.

The court, fairly decisively, rejected this attempt to isolate the upper bound of the confidence interval:

“The upper range of the confidence intervals signify the outer realm of possibilities, and plaintiffs cannot reasonably rely on these numbers as evidence of the probability of a greater than two fold risk.  Their argument reaches new heights of speculation and has no scientific basis.”

The Marder court could have gone further by pointing out that the confidence interval does not provide a probability for any value within the interval.

Multiple Testing

In some situations, completeness may require more than the presentation of the size of the random error, or the width of the confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate.

Here is the relevant language from Kaye and Freedman’s chapter on statistics, in the Reference Manual (3d ed.):

4. How many tests have been done?

Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.

Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are significant. …”

RMSE 3d ed. at 256-57.

When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.  When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.  Even a nominally statistically significant finding must be understood in the full context of the study.

Many texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves.  Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”

The Integrity of Facts in Judicial Decisions

December 21st, 2011

One of the usual tasks of an appellate judge’s law clerk is to read the record – the entire record.  In my clerking experience, the law clerk who had the assignment for a case in which the judge was writing an opinion was responsible for knowing every detail of the record.  The judge believed that fidelity to the factual record was an absolute.

Not so for other appellate judges.  See, e.g., Jacoby, “Judicial Opinions as “Minefields of Misinformation: Antecedents, Consequences and Remedies,” University Public Law and Legal Theory Working Papers Paper 35 (N.Y. 2006).

Some important cases turn on facts misunderstood or misrepresented by appellate courts.  A few days ago, Kyle Graham blogged about a startling discovery in the Summers v. Tice case, which is covered in every first-year torts class.  Kyle Graham, “Summers v. Tice: The Rest of the Story” (Dec. 1, 2011).

Summers v. Tice, 33 Cal.2d 80, 199 P.2d 1 (1948), is a leading California tort law case that shifted the burden of proof on causation to the two defendants.  The rationale for shifting the burden was the gross negligence of both defendants, and the plaintiff’s faultless inability to identify which of the two defendants, Simonson or Tice, was responsible for shooting the plaintiff with a shotgun in their ill-fated quail hunt.

Professor Graham did something unusual:  he actually read the record of the bench trial.  It turns out that the facts were different from, and much more interesting than, those presented by the California Supreme Court.  Simonson admitted shooting Summers, and implicated Tice.  Tice denied shooting.  The trial judge resolved credibility issue against Tice, although it seems to have been a close issue.

More important, Tice testified that his gun was loaded with No. 6 shot, whereas Simonson had used No. 7.5 shot.  Summers admitted that the pellets had been given to him after his medical treatment, but he could not find them at the time of trial.  Had he kept the pellets, Summers would have been able to distinguish between the gunfeasors.

Spoliation anyone?  Missing evidence?  Adverse inference?

Even if the trial judge was unimpressed with Tice’s denial of having discharged his shotgun, Tice’s lack of credibility could not turn into affirmative evidence that he had used number 7.5 shot, as had Simonson.  This was a contested issue, on which the plaintiff could have adduced evidence.  The plaintiff’s failure to do so was the result of his own post-accident carelessness (or worse) in not keeping important evidence.  Tice’s testimony on the size of the shot in his gun was undisputed, even if the trial court thought that he was not a credible witness.

Thus, on the real facts, the shifting of the burden of proof, on the rationale that the plaintiff was without fault for his inability to produce evidence against Summers or Tice, was quite unjustified.  The plaintiff was culpable for the failure of proof, and there was no affirmative evidence that the two potential causative agents were indistinguishable. The defendants were not in a better position than the plaintiff to identify who had been the cause of plaintiff’s wounds.

The trial court’s credibility assessment of Tice, for having denied a role in shooting, did not turn the absence of evidence into affirmative evidence that both defendants used the same size pellets in their shotguns.  What makes for a great law school professor’s hypothetical was the result of an obviously fallacious inference, and a factual fabrication, borne of sloppy judicial decision making.

We can see a similar scenario play out in the New Jersey decisions that reversed directed verdicts in asbestos colorectal cancer cases.  Landrigan v. Celotex Corp., 127 NJ. 404, 605 A2d 1079 (1992); Caterinicchio v. Pittsburgh Corning Corp., 127 NJ. 428, 605 A.2d 1092 (1992). In both cases, the trial courts directed verdicts, assuming arguenda that asbestos can cause colorectal cancer (a dubious proposition), on the ground that the low relative risk cited by plaintiffs’ expert witnesses (about 1.5) was factually insufficient to support a verdict for plaintiffs on specific causation.  Indeed, the relative risk suggested that the odds were about 2 to 1 in defendants’ favor that the plaintiffs’ colorectal cancers were not caused by asbestos.

The intermediate appellate courts affirmed the directed verdicts, but the New Jersey Supreme Court reversed and remanded both judgments on curious grounds.  According to the Court, there were other probative factors that the juries could have used to make out specific causation:

“Dr. Wagoner did not rely exclusively on epidemiological studies in addressing that issue.   In addition to relying on such studies, he, like Dr. Sokolowski, reviewed specific evidence about decedent’s medical and occupational histories.   Both witnesses also excluded certain known risk factors for colon cancer, such as excessive alcohol consumption, a high-fat diet, and a positive family history.   From statistical population studies to the conclusion of causation in an individual, however, is a broad leap, particularly for a witness whose training, unlike that of a physician, is oriented toward the study of groups and not of individuals.   Nonetheless, proof of causation in toxic-tort cases depends largely on inferences derived from statistics about groups.”

Landrigan, 127 N.J. at 422.  The NJ Supreme Court held that the plaintiffs’ failure to show a relative risk in excess of 2.0 was not fatal to their cases, when there was other evidence that the jury could consider, in addition to the relative risks.

Well, actually there was no expert witness support for the assertion.  Completely absent from the evidentiary displays in both the Landrigan and Caterinicchio cases was any evidence, apart from plaintiffs’ expert witnesses’ hand waving, that a higher relative risk existed among the subcohort of asbestos insulators who had had heavier exposure or who had concomitant pulmonary disease.  There was no evidence that those exposed workers who lacked “excessive alcohol consumption, a high-fat diet, and a positive family history” had any increase risk.  Indeed, the Selikoff study relied upon extensively by plaintiffs’ expert witnesses failed to make any adjustment for the noted risk factors, as well as for the greater prevalence of smoking histories among the insulators than among the unexposed comparator population.  The Court turned the absence of evidence into the factual predicate for its holding that defendants were not entitled to judgment.

Now that’s judicial activism.

Silica Science – Junk Science is Not Limited to The Courts

December 12th, 2011

“Clowns to the left of me; Jokers to the right; here I am, stuck in the middle with you.”


David Michaels, head of OSHA, back in October, was testifying at a House congressional oversight hearing, “Workplace Safety: Ensuring a Responsible Regulatory Environment.” The Congressmen were inquiring into OSHA’s enforcement and regulatory initiatives on several fronts, including silica exposures.

This is the same David Michaels who used to be a hired expert witness for plaintiffs in toxic tort cases. SeeDavid Michaels’ Public Relations Problem,” (Dec. 2, 2011).

Not surprisingly, when the questioning turned to silica, Michaels played the cancer card:  crystalline silica is a “known” human carcinogen.

Republican congressman Larry Bucshon (R-IN), a surgeon when he is not holding forth in Congress, found the talk of cancer to be provocative.  Buchson scolded Michaels:

“I don’t like it when people use buzz words that try to get people’s attention, and cancer is one of those.”

* * * * *

“…I’m a thoracic surgeon, so I want to focus a little bit on what you said earlier as it relates to silica dust. I’m curious about your comment about silica-dust related lung cancer, because I’ve been a thoracic surgeon for 15 years and I’ve done a lot of lung cancer surgery, and I haven’t seen one patient that’s got it from silica dust.”

A fascinating exchange for several reasons.

First, we could expect Michaels to play the cancer card, just as he has in his role as plaintiffs’ expert witness.  As we will see, his cancer evidence is not far fetched, although it is also not particularly convincing.

Second, the junk science from Congressman Buchson is distressing.  As a physician, he should know better that his experience in surgery has no relevance at all to the question whether crystalline silica can cause lung cancer.

Back in 1996, a working group of the World Health Organization’s International Agency for Research on Cancer (IARC) voted to reclassify crystalline silica, the most ubiquitous mineral on the face of Planet Earth, a known human carcinogen.  Michaels recited this “evidence,” but he failed to mention that the evidence was conflicting, as were the votes of the working group. The response of the scientific community to the IARC pronouncement was highly critical.  See Patrick A. Hessel, John F. Gamble, J. Bernard L. Gee, Graham Gibbs, Francis H.Y. Green, W. Keith C. Morgan, and Brooke T. Mossman, “Silica, Silicosis, and Lung Cancer: A Response to a Recent Working Group Report,” 42 J. Occup. Envt’l Med. 704 (2000).

The vote of the working group was very close; indeed, the swing of a single vote would have changed the outcome. One of the working group members later wrote:

“Some equally expert panel of scientists presented with the same information on another occasion could of course have reached a different verdict. The evidence was conflicting and difficult to assess and such judgments are essentially subjective.”

Corbett McDonald & Nicola Cherry, “Crystalline Silica and Lung Cancer:  The Problem of Conflicting Evidence,” 8 Indoor Built Environment 121, 121 (1999).  Remarkably, this panel member explained his decision to vote for reclassification as follows:

“The basic problem was that the evidence for carcinogenicity was conflicting – generally absent in situations of high and widespread exposure and strong only in a few rather special occupations.  The advice by the IARC to consider hazard rather than risk did much to resolve the difficulty.”

Id. at 125.  I suspect that the evidence for a difference in meaning between “hazard” and “risk” is even more tenuous and conflicting than the evidence in favor of carcinogenicity.

IARC classifications, however, take on a life of their own.  They are an invitation to stop thinking, and to stop analyzing the evidence.  Federal bureaucrats and staff scientists love them for exactly this reason:  they can hide behind the authority of the WHO without having to work on reviewing the evidence, or updating their judgment when new studies come out.

It should not be surprising, therefore, that the National Institutes of Health’s National Toxicology Program (NTP), working off the WHO decision, recognized crystalline silica as a human carcinogen. Other groups followed in lock step.  Other agencies and medical groups followed.

What you will not hear from Michaels or his followers is that when the National Institute for Occupational Safety and Health conducted the largest mortality study on the issue, it found a decreased lung cancer risk among men who actually had sufficient silica exposure to develop silicosis. See Geoffrey Calvert, et al., “Occupational silica exposure and risk of various diseases:  an analysis using death certificates from 27 states of the United States,” 60 Occup. Envt’l Med. 122 (2003).  Cf. “Congressman tells OSHA chief not to use “buzz” words like cancer.” (Oct. 10, 2011).

To give the devil his due, at least Michaels had “some” evidence to support his pronouncement, even if the evidence was incomplete and contradicted by other important evidence.  Congressman Bucshon’s recitation of his experience as a surgeon was completely off the mark.  His staffers obviously failed him in their research, and Bucshon’s reliance upon his own anecdotal experience was quite inappropriate to rebut the dubious judgment of the OSHA Administrator.

Some people might describe the exchange between Bucshon and Michaels as resembling two monkeys playing chess.  I think of it as exemplifying the scientific illiteracy in all three branches of our government.

David Michaels’ Public Relations Problem

Scientific American(s) and the other 99%

December 7th, 2011

If you have an interest in the history of science, especially as it plays out in the so-called state-of-the-art defense in products liability litigation, you may find the following offer helpful.  For the remainder of the month, Scientific American, which is now published by Nature, is making its archived issues, 1845-1909, available free of charge.

There is more fascinating than to read what people were thinking, saying, and writing, at times past.  Most of what we think we know about the past is filtered by historians rather than being obtained by accessing primary sources.  The Scientific American archive is a useful corrective measure, especially in the contentious area of health-effects litigation.

Here are some of the interesting historical insights.  In 1871, 140 years ago, Scientific American ran an article on the ill-health effects of smoking.  “To smoke or not to smoke,” Scientific American 375 (Dec. 9, 1871).  Here are some highlights:

“M. Beau notices eight cases of angina pectoris caused by the use of tobacco.

Professor Lizars records several cases of cancer of the tongue and lips caused by the use of the pipe. The writer has known one such ill stance, and never wishes to see another example of such terrible suffering resulting from a worse than useless habit.”

These pronouncements might not pass muster under today’s evidence-based medicine, but they were astute observations in need of testing, in 1871.

Not all the medical observations and claims were equally prescient.  Our forebears were not immune from the idiocies and enthusiasms of medical quackery.  Cancer remedies seemed to be a particular focus of much unenlightened attention:

“Col. Ussery, of the parish of De Soto, informs the Editor of the Caddo Gazette that he fully tested a remedy for this troublesome disease, recommended to him by a Spanish woman, a native of the country. The remedy is this:  Take an egg and break it, then pour out the white, retaining the yolk in the shell, put in salt and mix with the yolk as long as it will receive it, stir them together until the salve is formed, put a portion of this on a piece of sticking plaster and apply it to the cancer about twice a day. He has made the experiment twice in his own family with complete success.”

Remedy for Cancer,” Scientific American 298 (June 12, 1847).

Or this forerunner of the clinical trial:

“The Tuscaloosa Observer says it has seen it stated, more than once, that the common cranberry was efficacious in the cure of cancer, but have never, until very recently, been an eye-witness to the fact. Mr. Middleton Belk, residing within four or five miles of this city, who was afflicted with a cancer on the nose for the last eight years, was induced to try cranberries applied as a poultice; and to his great joy and satisfaction, has experienced a perfect and radical cure. We mention this fact at the instanee of Mr. Belk, who is desirous that others suffering under the same affliction, may avail themselves of this simple, but valuable remedy.”

Cranberries a Cure for Cancer,” 3 Scientific American 408 (Sept. 9, 1848).  Another article, three years later, touted mineral naptha as a cancer cure.  “Mineral Naptha,” 6 Scientific American 243 (April 19, 1851).

The pages of Scientific American document the rise of asbestos use and the growing awareness of asbestos’ great utility to help control and prevent fire and burns.  For instance, in 1876, the magazine described the utility of asbestos in roofing materials and in pipecovering.  “The Industrial Uses of Asbestos,” Scientific American 258 (April 22, 1876).

A few years later, an article described the widespread use of asbestos in industrial applications, both in Europe and in the United States:

“For some time past Toope’s covering for steam surfaces has been in use in England, giving great satisfaction and receiving the indorsement of many prominent English engineers.  The business of manufacturing and selling it is conducted there by a limited company located in London.
In this country Mr. Charles Toope, manufacturing agent, having an office and works at 353 East 78th street, New York City, is making and introducing the covering.  The covering is readily applied, requires no previous preparation, and when in place is permanent, being incapable of injury by jarring or pounding.”

Felt and Asbestos Covering for Steam Surfaces,” Scientific American 357 (December 4, 1880). [353 East 78th is right around the corner from me.  I doubt that many of the residents of this mid-rise apartment building know that an asbestos factory once graced their property.]  See also The Prevention of Fires in Theaters,” 35 Scientific American 401 (Dec. 23,1876); Insulated Coverings for Pipes, Boilers, Etc.,” 59 Scientific American 355, 355 (Dec. 8, 1888).

Federal Rules Get a Makeover

December 2nd, 2011

Bellbottoms are out; cuffs are in.  Robert Frost is out; Philip Levine is in.

So too with the Federal Rules.

The Federal Rules of Evidence have been “restyled.” Yesterday, the new, restyled Federal Rules of Evidence went into effect.

A PDF of the new rules is available at several places on the web, including the Federal Evidence Review website, which also has also links to the legislative history and guiding principles for this restyling.   The Legal Information Institute (LII) at Cornell Law School helpfully has posted ebooks, as ePub or mobi files, of the restyled Federal Rules of Civil Procedure, Criminal Procedure, and Evidence.

The legislative history of the restyled Evidence Rules 101-1103 make clear that the changes were designed to make the rules simpler, more readable and understandable, without changing their substantive meaning.  Was this effort worth the time and money?

The rules on expert witness opinion testimony are my particular interest.

Rule 703. Bases of an Expert’s Opinion Testimony

An expert may base an opinion on facts or data in the case that the expert has been made aware of or personally observed. If experts in the particular field would reasonably rely on those kinds of facts or data in forming an opinion on the subject, they need not be admissible for the opinion to be admitted. But if the facts or data would otherwise be inadmissible, the proponent of the opinion may disclose them to the jury only if their probative value in helping the jury evaluate the opinion substantially outweighs their prejudicial effect.

(Legislative History: Pub. L. 93-595, Jan. 2, 1975; Mar. 2, 1987, eff. Oct. 1, 1987; Apr. 17, 2000, eff. Dec. 1, 2000; Apr. 26, 2011, eff. Dec. 1, 2011.)

The rule specifies what happens “[i]f experts in the particular field would reasonably rely on those kinds of facts or data in forming an opinion on the subject,” but what happens “if not“?  The common reading interpolates “only” before “if,” but Rule 703 before and after restyling misses this drafting point.

So too does Rule 702:

Rule 702. Testimony by Expert Witnesses

A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if:

(a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;

(b) the testimony is based on sufficient facts or data;

(c) the testimony is the product of reliable principles and methods; and

(d) the expert has reliably applied the principles and methods to the facts of the case.

(Legislative History: Pub. L. 93-595, Jan. 2, 1975; Apr. 17, 2000, eff. Dec. 1, 2000; Apr. 26, 2011, eff. Dec. 1, 2011.)

And if not?

The enumeration of (a) through (d) in Rule 702, however, is an improvement for reading and comprehension, especially with the conjunction connecting the last member of the series.

I suppose at age 36, everyone is entitled to a makeover.

Epidemiology, Risk, and Causation – Report of Workshops

November 15th, 2011

This month’s issue of Preventive Medicine includes a series of papers arising from last year’s workshops on “Epidemiology, Risk, and Causation,” at Cambridge University. The workshops were organized by philosopher Alex Broadbent,  a member of the Department of History and Philosophy of Science, in Cambridge University.  The workshops were financially sponsored by the Foundation for Genomics and Population Health (PHG), a not-for-profit British organization.

Broadbent’s workshops were intended for philosophers of science, statisticians, and epidemiologists, lawyers involved in health effects litigation will find the papers of interest as well.  The themes of workshops included:

  • the nature of epidemiologic causation,
  • the competing claims of observational and experimental research for establishing causation,
  • the role of explanation and prediction in assessing causality,
  • the role of moral values in causal judgments, and
  • the role of statistical and epistemic uncertainty in causal judgments

See Alex Broadbent, ed., “Special Section: Epidemiology, Risk, and Causation,” 53 Preventive Medicine 213-356 (October-November 2011).  Preventive Medicine is published by Elsevier Inc., so you know that the articles are not free.  Still you may want to read these at your local library to determine what may be useful in challenging and defending causal judgments in the courtroom.  One of the interlocutors, Sander Greenland, is of particular interest because he shows up as an expert witness with some regularity.

Here are the individual papers published in this special issue:

Alfredo Morabia, Michael C. Costanza, Philosophy and epidemiology

Alex Broadbent, Conceptual and methodological issues in epidemiology: An overview

Alfredo Morabia, Until the lab takes it away from epidemiology

Nancy Cartwright, Predicting what will happen when we act. What counts for warrant?

Sander Greenland, Null misinterpretation in statistical testing and its impact on health risk assessment

Daniel M. Hausman, How can irregular causal generalizations guide practice

Mark Parascandola, Causes, risks, and probabilities: Probabilistic concepts of causation in chronic disease epidemiology

John Worrall, Causality in medicine: Getting back to the Hill top

Olaf M. Dekkers, On causation in therapeutic research: Observational studies, randomised experiments and instrumental variable analysis

Alexander Bird, The epistemological function of Hill’s criteria

Michael Joffe, The gap between evidence discovery and actual causal relationships

Stephen John, Why the prevention paradox is a paradox, and why we should solve it: A philosophical view

Jonathan Wolff, How should governments respond to the social determinants of health?

Alex Broadbent, What could possibly go wrong? — A heuristic for predicting population health outcomes of interventions, Pages 256-259

The Treatment of Meta-Analysis in the Third Edition of the Reference Manual on Scientific Evidence

November 14th, 2011

Meta-analysis is a statistical procedure for aggregating data and statistics from individual studies into a single summary statistical estimate of the population measurement of interest.  The first meta-analysis is typically attributed to Karl Pearson, circa 1904, who sought a method to overcome the limitations of small sample size and low statistical power.  Statistical methods for meta-analysis, however, did not mature until the 1970s.  Even then, the biomedical scientific community remained skeptical of, if not out rightly hostile to, meta-analysis until relatively recently.

The hostility to meta-analysis, especially in the context of observational epidemiologic studies, was colorfully expressed by Samuel Shapiro and Alvan Feinstein, as late as the 1990s:

“Meta-analysis begins with scientific studies….  [D]ata from these studies are then run through computer models of bewildering complexity which produce results of implausible precision.”

* * * *

“I propose that the meta-analysis of published non-experimental data should be abandoned.”

Samuel Shapiro, “Meta-analysis/Smeta-analysis,” 140 Am. J. Epidem. 771, 777 (1994).  See also Alvan Feinstein, “Meta-Analysis: Statistical Alchemy for the 21st Century,” 48 J. Clin. Epidem. 71 (1995).

The professional skepticism about meta-analysis was reflected in some of the early judicial assessments of meta-analysis in court cases.  In the 1980s and early 1990s, some trial judges erroneously dismissed meta-analysis as a flawed statistical procedure that claimed to make something out of nothing. Allen v. Int’l Bus. Mach. Corp., No. 94-264-LON, 1997 U.S. Dist. LEXIS 8016, at *71–*74 (suggesting that meta-analysis of observational studies was controversial among epidemiologists).

In In re Paoli Railroad Yard PCB Litigation, Judge Robert Kelly excluded plaintiffs’ expert witness Dr. William Nicholson and his testimony based upon his unpublished meta-analysis of health outcomes among PCB-exposed workers.  Judge Kelly found that the meta-analysis was a novel technique, and that Nicholson’s meta-analysis was not peer reviewed.  Furthermore, the meta-analysis assessed health outcomes not experienced by any of the plaintiffs before the trial court.  706 F. Supp. 358, 373 (E.D. Pa. 1988).

The Court of Appeals for the Third Circuit reversed the exclusion of Dr. Nicholson’s testimony, and remanded for reconsideration with instructions.  In re Paoli R.R. Yard PCB Litig., 916 F.2d 829, 856-57 (3d Cir. 1990), cert. denied, 499 U.S. 961 (1991); Hines v. Consol. Rail Corp., 926 F.2d 262, 273 (3d Cir. 1991).  The Circuit noted that meta-analysis was not novel, and that the lack of peer-review was not an automatic disqualification.  Acknowledging that a meta-analysis could be performed poorly using invalid methods, the appellate court directed the trial court to evaluate the validity of Dr. Nicholson’s work on his meta-analysis.

In one of many squirmishes over colorectal cancer claims in asbestos litigation, Judge Sweet in the Southern District of New York was unimpressed by efforts to aggregate data across studies.  Judge Sweet declared that “no matter how many studies yield a positive but statistically insignificant SMR for colorectal cancer, the results remain statistically insignificant. Just as adding a series of zeros together yields yet another zero as the product, adding a series of positive but statistically insignificant SMRs together does not produce a statistically significant pattern.”  In In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1042 (S.D.N.Y. 1993).  The plaintiffs’ expert witness who had offered the unreliable testimony, Dr. Steven Markowitz, like Nicholson, another foot soldier in Dr. Irving Selikoff’s litigation machine, did not offer a formal meta-analysis to justify his assessment that multiple non-significant studies, taken together, rule out chance as a likely explanation for an aggregate finding of an increased risk.

Judge Sweet was quite justified in rejecting this back of the envelope, non-quantitative meta-analysis.  His suggestion, however, that multiple non-significant studies could never collectively serve to rule out chance as an explanation for an overall increased rate of disease in the exposed groups is wrong.  Judge Sweet would have better focused on the validity issues in key studies, the presence of bias and confounding, and the completeness of the proffered meta-analysis.  The Second Circuit reversed the entry of summary judgment, and remanded the colorectal cancer claim for trial.  52 F.3d 1124 (2d Cir. 1995).  Over a decade later, with even more accumulated studies and data, the Institute of Medicine found the evidence for asbestos plaintiffs’ colorectal cancer claims to be scientifically insufficient.  Institute of Medicine, Asbestos: Selected Cancers (Wash. D.C. 2006).

Courts continue to go astray with an erroneous belief that multiple studies, all without statistically significant results, cannot yield a statistically significant summary estimate of increased risk.  See, e.g., Baker v. Chevron USA, Inc., 2010 WL 99272, *14-15 (S.D.Ohio 2010) (addressing a meta-analysis by Dr. Infante on multiple myeloma outcomes in studies of benzene-exposed workers).  There were many sound objections to Infante’s meta-analysis, but the suggestion that multiple studies without statistical significance could not yield a summary estimate of risk with statistical significance was not one of them.

In the last two decades, meta-analysis has emerged as an important technique for addressing random variation in studies, as well as some of the limitations of frequentist statistical methods.  In 1980s, articles reporting meta-analyses were rare to non-existent.  In 2009, there were over 2,300 articles with “meta-analysis” in their title, or in their keywords, indexed in the PubMed database of the National Library of Medicine.  See Michael O. Finkelstein and Bruce Levin, “Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases” (2011) (forthcoming in Jurimetrics).

The techniques for aggregating data have been studied, refined, and employed extensively in thousands of methods and application papers in the last decade. Consensus guideline papers have been published for meta-analyses of clinical trials as well as observational studies.  See Donna Stroup, et al., “Meta-analysis of Observational Studies in Epidemiology: A Proposal for Reporting,” 283 J. Am. Med. Ass’n 2008 (2000) (MOOSE statement); David Moher, Deborah Cook, Susan Eastwood, Ingram Olkin, Drummond Rennie, and Donna Stroup, “Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement,” 354 Lancet 1896 (1999).  See also Jesse Berlin & Carin Kim, “The Use of Meta-Analysis in Pharmacoepidemiology,” in Brian Strom, ed., Pharmacoepidemiology 681, 683–84 (4th ed. 2005); Zachary Gerbarg & Ralph Horwitz, “Resolving Conflicting Clinical Trials: Guidelines for Meta-Analysis,” 41 J. Clin. Epidemiol. 503 (1988).

Meta-analyses, of observational studies and of randomized clinical trials, routinely are relied upon by expert witnesses in pharmaceutical and so-called toxic tort litigation. Id. See also In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1174, 1184 (N.D. Cal. 2007) (holding that reliance upon “[a] meta-analysis of all available published and unpublished randomized clinical trials” was reasonable and appropriate, and criticizing the expert witnesses who urged the complete rejection of meta-analysis of observational studies)

The second edition of the Reference Manual on Scientific Evidence gave very little attention to meta-analysis.  With this historical backdrop, it is interesting to see what the new third edition provides for guidance to the federal judiciary on this important topic.

STATISTICS CHAPTER

The statistics chapter of the third edition gives continues to give scant attention to meta-analysis.  The chapter notes, in a footnote, that there are formal procedures for aggregating data across studies, and that the power of the aggregated data will exceed the power of the individual, included studies.  The footnote then cautions that meta-analytic procedures “have their own weakness,” without detailing what that one weakness is.  RMSE 3d at 254 n. 107.

The glossary at the end of the statistics chapter offers a definition of meta-analysis:

“meta-analysis. Attempts to combine information from all studies on a certain topic. For example, in the epidemiological context, a meta-analysis may attempt to provide a summary odds ratio and confidence interval for the effect of a certain exposure on a certain disease.”

Id. at 289.

This definition is inaccurate in ways that could yield serious mischief.  Virtually all meta-analyses are built upon a systematic review that sets out to collect all available studies on a research issue of interest.  It is a rare meta-analysis, however, that includes “all” studies in its quantitative analysis.  The meta-analytic process involves a pre-specification of inclusionary and exclusionary criteria for the quantitative analysis of the summary estimate of risk.  Those criteria may limit the quantitative analysis to randomized trials, or to analytical epidemiologic studies.  Furthermore, meta-analyses frequently and appropriately have pre-specified exclusionary criteria that relate to study design or quality.

On a more technical note, the offered definition suggests that the summary estimate of risk will be an odds ratio, which may or may not be true.  Meta-analyses of risk ratios may yield summary estimates of risk in terms of relative risk or hazard ratios, or even of risk differences.  The meta-analysis may combine data of means rather than proportions as well.

EPIDEMIOLOGY CHAPTER

The chapter on epidemiology delves into meta-analysis in greater detail than the statistics chapter, and offers apparently inconsistent advice.  The overall gist of the chapter, however, can perhaps best be summarized by the definition offered in this chapter’s glossary:

“meta-analysis. A technique used to combine the results of several studies to enhance the precision of the estimate of the effect size and reduce the plausibility that the association found is due to random sampling error.  Meta-analysis is best suited to pooling results from randomly controlled experimental studies, but if carefully performed, it also may be useful for observational studies.”

Reference Guide on Epidemiology, RSME3d at 624.  See also id. at 581 n. 89 (“Meta-analysis is better suited to combining results from randomly controlled experimental studies, but if carefully performed it may also be helpful for observational studies, such as those in the epidemiologic field.”).  The epidemiology chapter appropriately notes that meta-analysis can help address concerns over random error in small studies.  Id. at 579; see also id. at 607 n. 171.

Having told us that properly conducted meta-analyses of observational studies can be helpful, the chapter hedges considerably:

“Meta-analysis is most appropriate when used in pooling randomized experimental trials, because the studies included in the meta-analysis share the most significant methodological characteristics, in particular, use of randomized assignment of subjects to different exposure groups. However, often one is confronted with nonrandomized observational studies of the effects of possible toxic substances or agents. A method for summarizing such studies is greatly needed, but when meta-analysis is applied to observational studies – either case-control or cohort – it becomes more controversial.174 The reason for this is that often methodological differences among studies are much more pronounced than they are in randomized trials. Hence, the justification for pooling the results and deriving a single estimate of risk, for example, is problematic.175

Id. at 607.  The stated objection to pooling results for observational studies is certainly correct, but many research topics have sufficient studies available to allow for appropriate selectivity in framing inclusionary and exclusionary criteria to address the objection.  The chapter goes on to credit the critics of meta-analyses of observational studies.  As they did in the second edition of the RSME, the authors repeat their cites to, and quotes from, early papers by John Bailar, who was then critical of such meta-analyses:

“Much has been written about meta-analysis recently and some experts consider the problems of meta-analysis to outweigh the benefits at the present time. For example, John Bailar has observed:

‘[P]roblems have been so frequent and so deep, and overstatements of the strength of conclusions so extreme, that one might well conclude there is something seriously and fundamentally wrong with the method. For the present . . . I still prefer the thoughtful, old-fashioned review of the literature by a knowledgeable expert who explains and defends the judgments that are presented. We have not yet reached a stage where these judgments can be passed on, even in part, to a formalized process such as meta-analysis.’

John Bailar, “Assessing Assessments,” 277 Science 528, 529 (1997).”

Id. at 607 n.177.  Bailar’s subjective preference for “old-fashioned” reviews, which often cherry picked the included studies is, well, “old fashioned.”  More to the point, it is questionable science, and a distinctly minority viewpoint in the light of substantial improvements in the conduct and reporting of meta-analyses of observational studies.  Bailar may be correct that some meta-analyses should have never left the protocol stage, but the RMSE 3d fails to provide the judiciary with the tools to appreciate the distinction between good and bad meta-analyses.

This categorical rejection, cited with apparent approval, is amplified by a recitation of some real or apparent problems with meta-analyses of observational studies.  What is missing is a discussion of how many of these problems can be and are dealt with in contemporary practice:

“A number of problems and issues arise in meta-analysis. Should only published papers be included in the meta-analysis, or should any available studies be used, even if they have not been peer reviewed? Can the results of the meta-analysis itself be reproduced by other analysts? When there are several meta-analyses of a given relationship, why do the results of different meta-analyses often disagree? The appeal of a meta-analysis is that it generates a single estimate of risk (along with an associated confidence interval), but this strength can also be a weakness, and may lead to a false sense of security regarding the certainty of the estimate. A key issue is the matter of heterogeneity of results among the studies being summarized.  If there is more variance among study results than one would expect by chance, this creates further uncertainty about the summary measure from the meta-analysis. Such differences can arise from variations in study quality, or in study populations or in study designs. Such differences in results make it harder to trust a single estimate of effect; the reasons for such differences need at least to be acknowledged and, if possible, explained.176 People often tend to have an inordinate belief in the validity of the findings when a single number is attached to them, and many of the difficulties that may arise in conducting a meta-analysis, especially of observational studies such as epidemiologic ones, may consequently be overlooked.177

Id. at 608.  The authors are entitled to their opinion, but their discussion leaves the judiciary uninformed about current practice, and best practices, in epidemiology.  A categorical rejection of meta-analyses of observational studies is at odds with the chapter’s own claim that such meta-analyses can be helpful if properly performed.  What was needed, and is missing, is a meaningful discussion to help the judiciary determine whether a meta-analysis of observational studies was properly performed.

MEDICAL TESTIMONY CHAPTER

The chapter on medical testimony is the third pass at meta-analysis in RMSE 3d.   The second edition’s chapter on medical testimony ignored meta-analysis completely; the new edition addresses meta-analysis in the context of the hierarchy of study designs:

“Other circumstances that set the stage for an intense focus on medical evidence included

(1) the development of medical research, including randomized controlled trials and other observational study designs;

(2) the growth of diagnostic and therapeutic interventions;141

(3) interest in understanding medical decision making and how physicians reason;142 and

(4) the acceptance of meta-analysis as a method to combine data from multiple randomized trials.143

RMSE 3d at 722-23.

The chapter curiously omits observational studies, but the footnote reference (note 143) then inconsistently discusses two meta-analyses of observational, rather than experimental, studies:

“143. Video Software Dealers Ass’n v. Schwarzenegger, 556 F.3d 950, 963 (9th Cir. 2009) (analyzing a meta-analysis of studies on video games and adolescent behavior); Kennecott Greens Creek Min. Co. v. Mine Safety & Health Admin., 476 F.3d 946, 953 (D.C. Cir. 2007) (reviewing the Mine Safety and Health Administration’s reliance on epidemiological studies and two meta-analyses).”

Id. at 723 n.143.

The medical testimony chapter then provides further confusion by giving a more detailed listing of the hierarchy of medical evidence in the form of different study designs:

3. Hierarchy of medical evidence

With the explosion of available medical evidence, increased emphasis has been placed on assembling, evaluating, and interpreting medical research evidence.  A fundamental principle of evidence-based medicine (see also Section IV.C.5, infra) is that the strength of medical evidence supporting a therapy or strategy is hierarchical.  When ordered from strongest to weakest, systematic review of randomized trials (meta-analysis) is at the top, followed by single randomized trials, systematic reviews of observational studies, single observational studies, physiological studies, and unsystematic clinical observations.150 An analysis of the frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized controlled trials in the medical evidence hierarchy.151 Although they are at the bottom of the evidence hierarchy, unsystematic clinical observations or case reports may be the first signals of adverse events or associations that are later confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless, subsequent studies may not confirm initial reports (e.g., the putative association between coffee consumption and pancreatic cancer).154

Id. at 723-24.  This discussion further muddies the water by using a parenthetical to suggest that meta-analyses of randomized clinical trials are equivalent to systematic reviews of such studies — “systematic review of randomized trials (meta-analysis).” Of course, systematic reviews are not meta-analyses, although they are a necessary precondition for conducting a meta-analysis.  The relationship between the procedures for a systematic review and a meta-analysis are in need of clarification, but the judiciary will not find it in the new Reference Manual.

Lording the Data – Scientific Fraud

November 10th, 2011

Last week, the New York Times published a news story about psychologist Diederik Stapel, of the Netherlands.  Tilburg University accused him of having committed research fraud  in several dozen published papers, including the journal Science, the official journal of the AAAS.  See Benedict Carey, “Fraud Case Seen as a Red Flag for Psychology Research: Noted Dutch Psychologist, Stapel, Accused of Research Fraud,” New York Times (Nov. 2, 2011).  The Times expressed surprise over the suggestion that psychology is plagued by fraud and sloppy research.  The surprise is that there are not more stories in the lay media over the poor quality of scientific research.  The readers of Retraction Watch, and the Office of Research Integrity’s blog will recognize how commonplace Stapel’s fraud is.

Stapel’s fraud has wide-ranging implications for the doctoral students, whose dissertations he supervised, and for colleagues, with whom he collaborated.  Stapel apologized and expressed his regret, but his conduct leaves a large body of his work, and that of others, under a cloud of suspicion.

Lording the Data

The University committee reported that Stapel had escaped detection for a long time because he was “lord of the data,” by refusing to disclose and share the data.

“Outright fraud may be rare, these experts say, but they contend that Dr. Stapel took advantage of a system that allows researchers to operate in near secrecy and massage data to find what they want to find, without much fear of being challenged.”

Benedict Carey, “Fraud Case,” New York Times (Nov. 2, 2011).  Data sharing is preached but rarely practice.

In a recent publication, Dr. Wicherts and his colleagues, at the University of Amsterdam, reported that two-thirds of his sample of Dutch research psychologists refused to share their data, in contravention of the established ethical rules of the discipline. Remarkably, many of the refuseniks had explicit contractual obligations with their publishing journals to provide data.  Jelte Wicherts, Marjan Bakker, Dylan Molenaar, “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results,” PLoS ONE 6(11): e26828 (Nov. 2, 2011)

Scientific fraud seems no more common among scientists with industry ties, which are so often the subject of ad hominem conflict of interest claims.  Instead, fraudfeasors such as Stapel or Hwang Woo-suk are more often simply egotistical, narcissistic, self-aggrandizing, self-promoting, or delusional.  In the United States, litigation, occasionally has brought out charlatans, but it has also resulted in high-quality studies that have provided strong evidence for or against litigation claims.  Compare Hon. Jack B. Weinstein, “Preliminary Reflections on Administration of Complex Litigation” 2009 Cardozo L. Rev. de novo 1, 14 (2009) (describing plaintiffs’ expert witnesses in silicone litigation as “charlatans” and the litigation as largely based upon fraud) with Committee on the Safety of Silicone Breast Implants, Institute of Medicine, Safety of Silicone Breast Implants (Wash. D.C. 1999) (reviewing studies, many of which were commissioned by litigation defendants, and which collectively showed lack of association between silicone and autoimmune diseases).

The relation between litigation and research is one that has typically been approached by self-righteous voices, such as David Michaels and David Egilman, and others who have their own deep conflicts of interest.  What is clear is that all litigants, as well as the public, would benefit from enforcing data sharing requirements.  SeeLitigation and Research” (April 15, 2007) (science should not be built upon blind trust of scientists: “Nullius in verba.”).

The Times article emphasized Wicherts’ research about lack of data sharing, and suggested that data sharing could improve the quality of scientific publications.  The time may have come, however, for sterner measures of civil and criminal penalties for scientists who abuse and waste governmental funding, or who aid and abet fraudulent litigation.

New-Age Levellers – Flattening Hierarchy of Evidence

October 30th, 2011

The Levelers were political dissidents in England, in the middle of the 17th century.  Among their causes, Levelers advanced popular sovereignty, equal protection of the law, and religious tolerance.

The political agenda of the Levelers sounds quite noble to 21st century Americans, but their ideals have no place in the world of science:  not all opinions or scientific studies are created equally; not all opinions are worthy of being taken seriously in scientific discourse or in courtroom presentations of science; and not all opinions should be tolerated, especially when they claim causal conclusions based upon shoddy or inadequate evidence.

In some litigations, legal counsel set out to obscure the important quantitative and qualitative distinctions among scientific studies.  Sometimes, lawyers find cooperative expert witnesses, willing to engage in hand waving about “the weight of the evidence,” where the weights are assigned post hoc, in a highly biased fashion.  No study (that favors the claim) left behind.  This is not science, and it is not how science operates, even though some expert witnesses, such as Professor Cranor in the Milward case, have been able to pass off their views as representative of scientific practice.

A sound appreciation of how scientists evaluate studies, and of why not all studies are equal, is essential to any educated evaluation of scientific controversies.  Litigants who face high-quality studies, with results inconsistent with their litigation claims, may well resort to “leveling” of studies.  This leveling may be advanced out of ignorance, but more likely the leveling is an attempt to snooker courts with evidence from exploratory, preliminary, and hypothesis-generating studies as somehow equal to, or greater than, the value of hypothesis-testing studies.

Some of the leveling tactics that have become commonplace in litigation include asserting that:

  • All experts witnesses are the same;
  • All expert witnesses conduct the same analysis;
  • All expert witnesses read articles, interpret them, and offer opinions;
  • All expert witnesses are inherently biased;
  • All expert witnesses select the articles to read and interpret in line with their biases;
  • All epidemiologic studies are the same;
  • All studies are flawed; and
  • All opinions are, in the final analysis, subjective.

This leveling strategy can be seen in Professor Margaret Berger’s introduction to the Reference Manual on Scientific Evidence (RMSE 3d), where she supported an ill-defined “weight-of-the-evidence” approach to causal judgments. SeeLate Professor Berger’s Introduction to the Reference Manual on Scientific Evidence” (Oct. 23, 2011).

Other chapters in the RMSE 3d are at odds with Berger’s introduction.  The epidemiology chapter does not explicitly address the hierarchy of studies, but it does describe cross-sectional, ecological, and secular trend studies are less able to support causal conclusions.  Cross-sectional studies are described as “rarely useful in identifying toxic agents,” RMSE 3d at 556, and as “used infrequently when the exposure of interest is an environmental toxic agent,” RMSE 3d at 561.  Cross-sectional studies are described as hypothesis-generating as opposed to hypothesis testing, although not in those specific terms.  Id. (describing cross-sectional studies as providing valuable leads for future research).  Ecological studies are described as useful for identifying associations, but not helpful in determining whether such associations are causal; and ecological studies are identified as a fertile source of error in the form of the “ecological fallacy.”  Id. at 561 -62.

The epidemiology chapter perhaps weakens its helpful description of the limited role of ecological studies by citing, with apparent approval, a district court that blinked at its gatekeeping responsibility to ensure that testifying expert witnesses did, in fact, rely upon “sufficient facts or data,” as well as upon studies that are “of a type reasonably relied upon by experts in the particular field in forming opinions or inferences upon the subject.” Rule 703. RMSE 3d at 561 n.34 (citing Cook v. Rockwell International Corp., 580 F. Supp. 2d 1071, 1095–96 (D. Colo. 2006), where the district court acknowledged the severe limitations of ecological studies in supporting causal inferences, but opined that the limitations went to the weight of the study). Of course, the insubstantial weight of an ecological study is precisely what may result in the study’s failure to support a causal claim.

The ray of clarity in the epidemiology chapter about the hierarchical nature of studies is muddled by an attempt to level epidemiology and toxicology.  The chapter suggests that there is no hierarchy of disciplines (as opposed to studies within a discipline).  RMSE 3d at 564 & n.48 (citing and quoting symposium paper that “[t]here should be no hierarchy [among different types of scientific methods to determine cancer causation]. Epidemiology, animal, tissue culture and molecular pathology should be seen as integrating evidences in the determination of human carcinogenicity.” Michele Carbone et al., “Modern Criteria to Establish Human Cancer Etiology,” 64 Cancer Res. 5518, 5522 (2004).)  Carbone, of course, is best known for his advocacy of a viral cause (SV40), of human mesothelioma, a claim unsupported, and indeed contradicted, by epidemiologic studies.  His statement does not support the chapter’s leveling of epidemiology and toxicology, and Carbone is, in any event, an unlikely source to cite.

The epidemiology chapter undermines its own description of the role of study design in evaluating causality by pejoratively asserting that most epidemiologic studies are “flawed”:

“It is important to emphasize that all studies have ‘flaws’ in the sense of limitations that add uncertainty about the proper interpretation of the results.9 Some flaws are inevitable given the limits of technology, resources, the ability and willingness of persons to participate in a study, and ethical constraints. In evaluating epidemiologic evidence, the key questions, then, are the extent to which a study’s limitations compromise its findings and permit inferences about causation.”

RSME 3d at 553.  This statement is actually a significant improvement over the second edition, where the authors of the epidemiology chapter asserted, without qualification:

“It is important to emphasize that most studies have flaws.”

RMSE 2d 337.  The “flaws” language from the earlier chapter was used on occasion by courts that were set on ignoring competing interpretations of epidemiologic studies.  Since all or most studies are flawed, why bother figuring out what is valid and reliable?  Just let the jury sort it out.  This is not an aid to gatekeeping, but rather a prescription for allowing the gatekeeper to call in sick.

The current epidemiology chapter essentially backtracks from the harsh connotations of its use of the term “flaws,” by now equating the term with “limitations.”  Flaws and limitations, however, are quite different from one another.  What is left out in the third edition’s description is the sense that there are indeed some studies that are so flawed that they must be disregarded altogether.  There may also be limitations in studies, especially observational studies, which is why the party with the burden of proof should generally not be allowed to proceed with only one or two epidemiologic studies.  Rule 702, after all, requires that an expert opinion to be based upon “sufficient facts or data.”

The RSME 3d chapter on medical evidence is a refreshing break from the leveling approach seen elsewhere.  Here at least, the chapter authors devote several pages to explaining the role of study design in assessing an etiological issue:

3. Hierarchy of medical evidence

With the explosion of available medical evidence, increased emphasis has been placed on assembling, evaluating, and interpreting medical research evidence.  A fundamental principle of evidence-based medicine (see also Section IV.C.5, infra) is that the strength of medical evidence supporting a therapy or strategy is hierarchical.

When ordered from strongest to weakest, systematic review of randomized trials (meta-analysis) is at the top, followed by single randomized trials, systematic reviews of observational studies, single observational studies, physiological studies, and unsystematic clinical observations.150 An analysis of the frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized controlled trials in the medical evidence hierarchy.151 Although they are at the bottom of the evidence hierarchy, unsystematic clinical observations or case reports may be the first signals of adverse events or associations that are later confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless, subsequent studies may not confirm initial reports (e.g., the putative association between coffee consumption and pancreatic cancer).154

John B. Wong, Lawrence O. Gostin, and Oscar A. Cabrera, “Reference Guide on Medical Testimony,” RMSE 3d 687, 723 -24 (2011).  The third edition’s chapter is a significant improvement of the second edition’s chapter on medical testimony, which does not mention the hierarchy of evidence.  Mary Sue Henifin, Howard M. Kipen, and Susan R. Poulter, ” Reference Guide on Medical Testimony,” RMSE 2d 440 (2000).  Indeed, the only time the word “hierarchy” appeared in the entire second edition was in connection with the hierarchy of the federal judiciary.

The tension, contradictions, and differing emphases among the various chapters of the RSME 3d point to an important “flaw” in the new edition.  The chapters appear to have been written largely in isolation, and without much regard for what the other chapters contain.  The chapters overlap, and indeed contradict one another on key points.  Witness Berger’s rejection of the hierarchy of evidence, the epidemiology chapter’s inconstant presentation of the concept without mentioning it by name, and the medical testimony chapter’s embrace and explicit presentation of the hierarchical nature of medical study evidence.  Fortunately, the laissez-faire editorial approach allowed the disagreement to remain, without censoring any position, but the federal judiciary is not aided by the contradiction and tension in the approaches.

Given the importance of the concept, even the medical testimony chapter in RSME 3d may seem to be too little, too late to be helpful to the judiciary.  There are book-length treatments of systematic reviews and “evidence-based medicine”: the three pages in Wong’s chapter barely scratch the surface of this important topic of how evidence is categorized, evaluated, and synthesized in making judgments of causality.

There are many textbooks and articles available to judges and lawyers on how to assess medical studies.  Recently, John Cherrie has posted on his blog, OH-world, about a series of 17 articles, in the journal Aerzteblatt International, on the proper evaluation of medical and epidemiologic studies.

These papers, overall, make the point that not all studies are equal, and that not all evidentiary displays are adequate to support conclusions of causal association.  The papers are available without charge from the journal’s website:

01. Critical Appraisal of Scientific Articles

02. Study Design in Medical Research

03. Types of Study in Medical Research

04. Confidence Interval or P-Value?

05. Requirements and Assessment of Laboratory Tests: Inpatient Admission Screening

06. Systematic Literature Reviews and Meta-Analyses

07. The Specification of Statistical Measures and Their Presentation in Tables and Graphs

08. Avoiding Bias in Observational Studies

09. Interpreting Results in 2×2 Tables

10. Judging a Plethora of p-Values: How to Contend With the Problem of Multiple Testing

11. Data Analysis of Epidemiological Studies

12. Choosing statistical tests

13. Sample size calculation in clinical trials

14. Linear regression analysis

15. Survival analysis

16. Concordance analysis

17. Randomized controlled trials

This year, the Journal of Clinical Epidemiology began publishing a series of papers, known by the acronym GRADE, which aim to provide guidance on how studies are categorized and assessed for their evidential quality in supporting treatments and intervention.  The GRADE project is led by Gordon Guyatt, who is known for having coined the term “evidence-based medicine,” and written widely on the subject.  Guyatt, along with his colleagues including Peter Tugwell (who was one of the court-appointed expert witnesses in MDL 926), has described the GRADE project:

“The ‘Grades of Recommendation, Assessment, Development, and Evaluation’ (GRADE) approach provides guidance for rating quality of evidence and grading strength of recommendations in health care. It has important implications for those summarizing evidence for systematic reviews, health technology assessment, and clinical practice guidelines. GRADE provides a systematic and transparent framework for clarifying questions, determining the outcomes of interest, summarizing the evidence that addresses a question, and moving from the evidence to a recommendation or decision. Wide dissemination and use of the GRADE approach, with endorsement from more than 50 organizations worldwide, many highly influential   http://www.gradeworkinggroup.org/), attests to the importance of this work. This article introduces a 20-part series providing guidance for the use of GRADE methodology that will appear in the Journal of Clinical Epidemiology.”

Gordon Guyatt, Andrew D. Oxman, Holger Schünemann, Peter Tugwell, Andre Knottnerus, “GRADE guidelines – new series of articles in Journal of Clinical Epidemiology,” 64 J. Clin. Epidem. 380 (2011).  See also Gordon Guyatt, Andrew Oxman, et al., for the GRADE Working Group, “Rating quality of evidence and strength of recommendations GRADE: an emerging consensus on rating quality of evidence and strength of recommendations,” 336 Brit. Med. J. 924 (2008).  [pdf]

Of the 20 papers planned, 9 of the GRADE papers have been published to date in the Journal of Clinical Epidemiology:

01 Intro – GRADE evidence profiles & summary of findings tables

02 Framing question & deciding on important outcomes

03 Rating quality of evidence

04 Rating quality of evidence – study limitations (risk of bias)

05 Rating the quality of evidence—publication bias

06 Rating up quality of evidence – imprecision

07 Rating quality of evidence – inconsistency

08 Rating quality of evidence – indirectness

09 Rating up quality of evidence

The GRADE guidance papers focus on the efficacy of treatments and interventions, but in doing so, they evaluate “effects” and are thus applicable to the etiologic issues of alleged harm that find their way into court.  The papers build on other grading systems advanced previously by the Oxford Center for Evidence-Based Medicine, the U.S. Preventive Services Task Force (Agency for Healthcare Research and Quality AHRQ), the Cochrane Collaboration, as well as many individual professional organizations.

GRADE has had some success in harmonizing disparate grading systems, and forging a consensus among organizations that had been using their own systems, such as the  World Health Organization, the American College of Physicians, the American Thoracic Society, the Cochrane Collaboration, the American College of Chest Physicians, the British Medical Journal, and Kaiser Permanente.

There are many other important efforts to provide consensus support for improving the quality of the design, conduct, and reporting of published studies, as well as the interpretation of those studies once published.  Although the RSME 3d does a good job of introducing its readers to the basics of study design, it could have done considerably more to help judges become discerning critics of scientific studies and of conclusions based upon individual or multiple studies.

New Reference Manual on Scientific Evidence Short Shrifts Rule 703

October 16th, 2011

In “RULE OF EVIDENCE 703 — Problem Child of Article VII (Sept. 19, 2011),” I wrote about how Federal Rule of Evidence 703 is generally ignored and misunderstood in current federal practice.  The Supreme Court, in deciding Daubert, shifted the focus to Rule 702, as the primary tool to deploy in admitting, as well as limiting and excluding, expert witness opinion testimony.  The Court’s decision, however, did not erase the need for an additional, independent rule to control the quality of inadmissible materials upon which expert witnesses rely.  Indeed, Rule 702 as amended in 2000, incorporated much of the learning of the Daubert decision, and then some, but it does not address the starting place of any scientific opinion:  the data, the analyses (usually statistical) of data, and the reasonableness of relying upon those data and analyses.  Instead, Rule 702 asks whether the proffered testimony is based upon:

  1. sufficient facts or data,
  2. the product of reliable principles and methods, and
  3. a reliable application of principles and methods to the facts of the case

Noticeably absent from Rule 702, in its current form, is any directive to determine whether the proffered expert witness opinion is based upon facts or data of the sort upon which experts in the pertinent field would reasonably rely.  Furthermore,  Daubert did not address the fulsome importation and disclosure of untrustworthy hearsay opinions through Rule 703.  See Problem Child (discussing the courts’ failure to appreciate the structure of peer-reviewed articles, and the need to ignore the discussion and introduction sections of such articles as often containing speculative opinions and comments).  See also Luciana B. Sollaci & Mauricio G. Pereira, “The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey,” 92 J. Med. Libr. Ass’n 364 (2004); Montori, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Br. Med. J. 1093, 1093 (2004) (advising readers on how to avoid being misled by published literature, and counseling readers to “Read only the Methods and Results sections; bypass the Discuss section.”)  (emphasis added).

Given this background, it is disappointing but not surprising that the new Reference Manual on Scientific Evidence severely slights Rule 703.  Using either a word search in the PDF version or the index at end of book tells the story:  There are five references to Rule 703 in the entire RMSE!  The statistics chapter has an appropriate but fleeting reference:

“Or the study might rest on data of the type not reasonably relied on by statisticians or substantive experts and hence run afoul of Federal Rule of Evidence 703. Often, however, the battle over statistical evidence concerns weight or sufficiency rather than admissibility.”

RMSE 3d at 214. At least this chapter acknowledges, however briefly, the potential problem that Rule 703 poses for expert witnesses.  The chapter on survey research similarly discusses how the data collected in a survey may “run afoul” of Rule 703.  RMSE 3d at 361, 363-364.

The chapter on epidemiology takes a different approach by interpreting Rule 703 as a rule of admissibility of evidence:

“An epidemiologic study that is sufficiently rigorous to justify a conclusion that it is scientifically valid should be admissible,184 as it tends to make an issue in dispute more or less likely.185

Id. at 610.  This view is mistaken.  Sufficient rigor in an epidemiologic study is certainly needed for reliance by an expert witness, but such rigor does not make the study itself admissible; the rigor simply permits the expert witness to rely upon a study that is typically several layers of inadmissible hearsay.  See Reference Manual on Scientific Evidence v3.0 – Disregarding Study Validity in Favor of the “Whole Gamish” (Oct. 14, 2011) (discussing the argument put forward by the epidemiology chapter for considering Rule 703 as an exception to the rule against hearsay).

While the treatment of Rule 703 in the epidemiology chapter is troubling, the introductory chapter on the admissibility of expert witness opinion testimony by the late Professor Margaret Berger really sets the tone and approach for the entire volume. See Berger, “The Admissibility of Expert Testimony,” RSME 3d 11 (2011).  Professor Berger never mentions Rule 703 at all!  Gone and forgotten. The omission is not, however, an oversight.  Rule 703, with its requirement of qualifying each study relied upon as having been “reasonably relied upon,” as measured by what experts in the appropriate discipline, is the refutation of Berger’s argument that somehow a pile of weak, flawed studies, taken together can yield a scientifically reliable conclusion. SeeWhole Gamish,” (Oct. 14th, 2011).

Rule 703 is not merely an invitation to trial judges; it is a requirement to look at the discrete studies relied upon to determine whether the building blocks are sound.  Only then can the methods and procedures of science begin to analyze the entire evidentiary display to yield reliable scientific opinions and conclusions.