A Rule of Completeness for Statistical Evidence

Witnesses swear to tell the “whole” truth, but lawyers are allowed to deal in half truths.  Given this qualification on lawyers’ obligation of truthfulness, the law prudently modifies the law of admissibility for writings to permit an adverse party to require that written statements are not yanked out of context.  Waiting days, if not weeks, in a trial to restore the context is an inadequate remedy for these “half truths.”  If a party introduces all or part of a writing or recorded statement, an adverse party may ” require the introduction, at that time, of any other part — or any other writing or recorded statement — that in fairness ought to be considered at the same time.”  Fed. R. Evid. 106 (Remainder of or Related Writings or Recorded Statements).  See also Fed. R. Civ. Pro. Rule 32(a)(4) (rule of completeness for depositions).

This “rule of completeness” has its roots in the common law, and in the tradition of narrative testimony.  The Advisory Committee notes to Rule 106 comments that the rule is limited to “writings and recorded statements and does not apply to conversations.”  The Rule and the notes ignore that the problematic incompleteness might be in the form of mathematical or statistical evidence.

Confidence Intervals

Consider sampling estimates of means or proportions.  The Reference Manual on Scientific Evidence (2d ed. 2000) urges that:

“[w]henever possible, an estimate should be accompanied by its standard error.”

RMSE 2d ed. at 117-18.

The new third edition dilutes this clear prescription, but still conveys the basic message:

What is the standard error? The confidence interval?

An estimate based on a sample is likely to be off the mark, at least by a small amount, because of random error. The standard error gives the likely magnitude of this random error, with smaller standard errors indicating better estimates.”

RMSE 3d ed. at 243.

The evidentiary point is that the standard error, or the confidence interval (C.I.), is an important component of the sample statistic, without which the sample estimate is virtually meaningless.  Just as a narrative statement should not be truncated, a statistical or numerical expression should not be unduly abridged.

Of course, the 95 percent confidence interval is the estimate (the risk ratio, the point estimate) plus or minus 1.96 standard errors.  By analogy to Rule 106, lawyers should insist that the confidence interval, or some similar expression of the size of the standard error, be provided at the time that the examiner asks about, or the witness gives, the sample estimate.  There are any number of consensus position papers, as well as guidelines for authors of papers, which specify that risk ratios should be accompanied by confidence intervals.  Courts should heed those recommendations, and require parties to present the complete statistical idea – estimate and random error – at one time.

One disreputable lawyer trick is to present incomplete confidence intervals.  Plaintiffs’ counsel, for instance, may inquire into the upper bound of a confidence interval, and attempt to silence witnesses when they respond with both the lower and upper bounds.  “Just answer the question, and stop volunteering information not asked.”  Indeed, some unscrupulous lawyers have been known to cut off witnesses from providing the information about both bounds of the interval, on the claim that the witness was being “unresponsive.”  Judges who are impatient with technical statistical testimony may even admonish witnesses who are trying to make sure that they present the “whole truth.”  Here again, the completeness rule should protect the integrity of the fact finding by allowing, and requiring, that the full information be presented at once, in context.

Although I have seen courts permit the partial, incomplete presentation of statistical evidence, I have yet to see a court acknowledge the harm from failing to apply Rule 106 to quantitative, statistical evidence.  One court, however, did address the inherent error of permitting a party to emphasize the extreme values within a confidence interval as “consistent” with the data sample.  Marder v. G.D. Searle & Co., 630 F.Supp. 1087 (D.Md. 1986), aff’d mem. on other grounds sub nom. Wheelahan v. G.D.Searle & Co., 814 F.2d 655 (4th Cir. 1987)(per curiam).

In Marder, the plaintiff claimed pelvic inflammatory disease from a IUD.  The jury was deadlocked on causation, and the trial court decided to grant the defendant’s motion for directed verdict, on grounds that the relative risk involved was less than two. Id. at 1092. (“In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof—a showing of causation by the preponderance of the evidence or, in other words, a probability of greater than 50%.”)

The plaintiff sought to resist entry of judgment by arguing that although the relative risk was less than two, the court should consider the upper bound of the confidence interval, which ranged from 0.9 to 4.0.  Id.  So in other words, the plaintiff argued that she was entitled to have the jury consider and determine that the actual value was actually 4.0.

The court, fairly decisively, rejected this attempt to isolate the upper bound of the confidence interval:

“The upper range of the confidence intervals signify the outer realm of possibilities, and plaintiffs cannot reasonably rely on these numbers as evidence of the probability of a greater than two fold risk.  Their argument reaches new heights of speculation and has no scientific basis.”

The Marder court could have gone further by pointing out that the confidence interval does not provide a probability for any value within the interval.

Multiple Testing

In some situations, completeness may require more than the presentation of the size of the random error, or the width of the confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate.

Here is the relevant language from Kaye and Freedman’s chapter on statistics, in the Reference Manual (3d ed.):

4. How many tests have been done?

Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.

Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are significant. …”

RMSE 3d ed. at 256-57.

When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.  When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.  Even a nominally statistically significant finding must be understood in the full context of the study.

Many texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves.  Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”