Schachtman Law » Rule 702

TORTINI

For your delectation and delight, desultory dicta on the law of delicts.

Judicial Innumeracy and the MDL Process

February 26th, 2011

In writing previously about the Avandia MDL Court’s handling of the defendants’ Daubert motion, I noted the trial court’s erroneous interpretation of statistical evidence. See “Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion” (Jan. 10, 2011). In fact, the Avandia court badly misinterpreted the meaning of a p-value, a basic concept in statistics:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance).”

In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576, *12 (E.D. Pa. 2011) (internal citation omitted). The Avandia MDL court was not, however, the first to commit this howler. Professor David Kaye collected examples of statistical blunders from published cases in a 1986 law review, and again in his chapter on statistical evidence in the Federal Judicial Center’s Reference Manual on Scientific Evidence created a list of erroneous interpretations:

United States v. Georgia Power Co., 474 F.2d. 906, 915 (5^th Cir. 1973)

National Lime Ass’n v. EPA, 627 F.2d 416, 453 (D.C. Cir. 1980)

Rivera v. City of Wichita Falls, 665 F.2d 531, 545 n.22 (5th Cir. 1982) (“A variation of two standard deviations would indicate that the probability of the observed outcome occurring purely by chance would be approximately five out of 100; that is, it could be said with a 95% certainty that the outcome was not merely a fluke.”);

Vuyanich v. Republic Nat’l Bank, 505 F. Supp. 224, 272 (N.D. Tex. 1980) (“[I]f a 5% level of significance is used, a sufficiently large t-statistic for the coefficient indicates that the chances are less than one in 20 that the true coefficient is actually zero.”), vacated, 723 F.2d 1195 (5th Cir. 1984)

Craik v. Minnesota State Univ. Bd., 731 F.2d 465, 476n.13 (8^th Cir. 1984)(“[a] finding that a disparity is statistically significant at the 0.095 or 0.01 level means that there is a 5 per cent. Or 1 per cent. Probability, respectively, that the disparity is due to chance.” See also id. at 510 (Swygert, J., dissenting)(stating that coefficients were statistically significant at 1% level, allowing him to say that “we can be 99% confident that each was different from zero.”)

Sheehan v. Daily Racing Form, Inc., 104 F.3d 940, 941 (7th Cir. 1997) (“An affidavit by a statistician . . . states that the probability that the retentions . . . are uncorrelated with age is less than 5 percent.”)

Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991) (“Social scientists consider a finding of two standard deviations significant, meaning there is about one chance in 20 that the explanation for a deviation could be random . . . .”)

David H. Kaye & David A. Freedman, “Reference Guide on Statistics,” in Reference Manual on Scientific Evidence 83, 122-24 (2^nd ed. 2000); David H. Kaye, “Is Proof of Statistical Significance Relevant?” 61 Wash. L. Rev. 1333, 1347 (1986)(pointing out that before 1970, there were virtually no references to “statistical significance” or p-values in reported state or federal cases.

Notwithstanding the educational efforts of the Federal Judicial Center, the innumeracy continues, and with the ascent of the MDL model for addressing mass torts, many recent howlers have come from trial judges given responsibility for overseeing the pretrial coordination of thousands of lawsuits. In addition to the Avandia MDL Court, here are some other recent erroneous statements that can be added to Professor Kaye’s lists:

“Scientific convention defines statistical significance as “P ≤ .05,” i.e., no more than one chance in twenty of a finding a false association due to sampling error. Plaintiffs, however, need only prove that causation is more-probable-than-not.”

In re Ephedra Prods. Liab. Litig., 393 F.Supp.2d 181, 193 (S.D.N.Y. 2005)(confusing the standard for Type I statistical error with the burden of proof).

“More-probable-than-not might be likened to P < .5, so that preponderance of the evidence is nearly ten times less significant (whatever that might mean) than the scientific standard.”

Id. at 193 n.9 (same).

In the Phenylpropanolamine litigation, the error was even more clearly stated, for both p-values and confidence intervals:

“P-values measure the probability that the reported association was due to chance… .”

“… while confidence intervals indicate the range of values within which the true odds ratio is likely to fall.”

In re Phenylpropanolamine Products Liab. Litig., 289 F. 2d 1230, 1236n.1 (2003)

These misstatements raise important questions about judicial competency for gatekeeping, the selection, education, and training of judges, the assignment of MDL cases to individual trial judges, and the aggregation of Rule 702 motions to a trial judge for a single, one-time decision that will control hundreds if not thousands of cases.

Recently, a student published a bold note that argued for the dismantling of judicial gatekeeping. Note, “Admitting Doubt: A New Standards for Scientific Evidence,” 123 Harvard Law Review 2021 (2010). With all the naiveté of someone who has never tried a jury trial, the student argued that juries are at least as good, if not better, at handling technical questions. The empirical evidence for such a suggestion is slim, and ignores the geographic variability in jury pools. The above instances of erroneous statistical interpretations might seem to support the student’s note, but the argument would miss two important points:

these errors are put on display for all to see, and for commentators to note and correct, whereas jury decisions obscure their mistakes; and
judges can be singled out for their technical competencies, and given appropriate assignments (which hardly ever happens at present), and judges can be required to partake in professional continuing legal education, which might well include training in technical areas to improve their decision making.

The Federal Judicial Center, and its state court counterparts, have work to do. Lawyers also have an obligation to help courts get difficult, technical issue right. Finally, courts, lawyers, and commentators need to rethink how the so-called Daubert process works, and does not work, especially in the high-stakes arena of multi-district litigation.

Posted in Rule 702 | No Comments »

Can Daubert Survive the Multi-District Litigation Process?

February 23rd, 2011

The so-called Daubert process, by which each side in a lawsuit may challenge and seek preclusion of the other side’s expert witnesses, arose in the setting of common-law judges making rulings in individual cases. Indeed, the Daubert case itself, although one of many cases involving claims of birth defects allegedly caused by Bendectin, was an individual case.

In the silicone gel breast implant (SGBI) litigation, the process evolved over time, with decisions from different judges, each of whom saw the evidence differently. The different judges brought different insights and aptitudes to bear on the evidence, and the expert witnesses themselves may have varied in their approaches and reliance upon different studies. This incrementalist approach, in the context of the SBGI litigation, worked to the benefit of the defendants, in part because their counsel learned about the fraudulent evidence underlying certain studies, and about serious lapses in the standard of research care on the part of some investigators whose studies were prominently relied upon by plaintiffs’ counsel. In the case of one dubious study, one of its authors, Marc Lappe, a prominent expert witness for plaintiffs, withdrew his support from the conclusions advanced in the study.

Early decisions in the SGBI cases (shortly after the Supreme Court’s decision in Daubert, in 1993) denied the defendants’ applications to preclude plaintiffs’ expert witnesses’ opinion testimony. Later decisions converged upon the unavoidable truth that the case for SGBIs causing atypical or typical connective tissue diseases was a house of cards, built mostly with jokers. If the Daubert process had been censored after the first hearing, the result would have been to deem all the breast implant cases trial and jury worthy, to the detriment of the judicial process, to the public’s interest in knowing the truth about silicone biomaterials, to the defendants’ reputational and financial interests, and to the interests of the claimants who had been manipulated by their counsel and support group leaders.

The evolutionary approach taken in the SGBI litigation was indirectly supported by the late Judge Sam Pointer, who presided over the SGBI federal multi-district litigation (MDL). Judge Pointer strongly believed that the decision to exclude expert testimony belonged to individual trial judges, who received cases on remand from the MDL 926, when the cases were ready for trial. Judge Pointer ruled on expert witness challenges in cases set for trial before him, but he was not terribly enthusiastic about the Daubert process, and denied most of the motions in a fairly perfunctory fashion. Because of this procedural approach, Judge Pointer’s laissez-faire attitude towards expert witness testimony did not interfere with the evolutionary process that allowed other courts to see through the dense fog in the plaintiffs’ case.

Since MDL 926, the MDL process has absorbed the ritual of each side’s challenging the other’s expert witnesses, and MDL judges view their role as including the hearing and deciding all pre-trial Daubert challenges. It has been over 17 years since the Supreme Court decided Daubert, and in that time, the MDL model, both state and federal, has become dominant. As a result, the Daubert process has often been truncated and abridged to a single motion, decided at one time, by one judge. The results of this abridgement have not always been happy for ensuring reliable and accurate gatekeeping.

The MDL process appears to have broken the promise of Rule 702 in many cases. By putting the first and only Rule 702 gatekeeping decision in the hands of a single judge, charged with making pre-trial rulings in the entire MDL, the MDL process has sapped the gatekeeping process of its dynamic, evolutionary character. No longer can litigants and judges learn from previous efforts, as well as from commentary by scientists and legal scholars on the prior outcomes. For judges who lack scientific and analytical acumen, this isolation from the scientific community works to the detriment of the entire process.

To be sure, the MDL process for deciding Rule 702 is efficient. In many cases, expensive motions, briefings, and hearings are reduced to one event. The incorporation of expert challenges into an MDL may improve fairness in some instances by allowing well-qualified plaintiffs’ counsel to wrest control of the process from unprepared plaintiffs’ counsel who are determined to control their individual cases. Defendants may embrace the MDL process because it permits a single, unified document production and discovery schedule of corporate executives. Perhaps defendants see the gains from MDL process as sufficiently important to forgo the benefit of a fuller opportunity to litigate the expert witness issues. Whatever can be said in favor of using the MDL forum to resolve expert witness challenges, it is clear that MDL procedures limit the parties’ ability to refine their challenges over time, and to incorporate new evidence and discovery gained after the first challenges are resolved. In the SGBI litigation, for instance, the defendants learned of significant scientific malfeasance and misfeasance that undermined key studies relied upon by plaintiffs, including some studies done by apparently neutral, well-credential scientists. The omnibus MDL Daubert motion prevents either side, or the judiciary, from learning from the first and only motion.

Another example of an evidentiary display that has changed over time comes from the asbestos litigation, where plaintiffs continue to claim that asbestos causes gastrointestinal cancer. The first such cases were pressed by plaintiffs in the early 1980s, with the support of Dr Selikoff and his cadre of testifying physicians and scientists. A few years ago, however, the Institutes of Medicine convened a committee to review non-pulmonary cancers and asbestos, and concluded that the studies, now accumulated over 35 years since Dr Selikoff’s ipse dixit, do not support a conclusion that asbestos causes colorectal cancer. Institute of Medicine of the National Academies, Asbestos: Selected Health Effects (2006).

Unfortunately, many trial judges view the admissibility and sufficiency of causation opinions on asbestos and colorectal cancer as “grandfathered” by virtue of the way business has been conducted in trial courts for over three decades. Still, defendants have gained the opportunity to invoke an important systematic review, which shows that the available evidence does not reliably support the conclusion urged by plaintiffs’ expert witnesses.

The current approach of using the MDL as the vehicle for resolving expert witness challenges raises serious questions about how MDLs are assigned to judges, and whether those judges have the analytical or quantitative skills to resolve Daubert challenges. Assigning an MDL to a judge, who will have to rule on the admissibility of expert witness opinion testimony she or he does not understand, does not inspire confidence in the judicial process. At least in the ad hoc approach employed in the SGBI, the parties could size up their trial judge, and decide that they would forgo their expert challenges based upon their assessment. Furthermore, an anomalous outcome could be corrected over a series of decisions. The MDL process, on the other hand, frequently places the Rule 702 decision in the discretion of a single judge. The selection criteria for that sole decision maker becomes critical. As equity in days of old varied with the size of the Chancellor’s foot, today’s scientific equity under Rule 702 may vary with accuracy of the trial judge’s slide rule.

Posted in Rule 702 | No Comments »

The Other Shoe Drops for GSK in Avandia MDL — Hand Waving on Specific Causation

January 24th, 2011

For GSK, the other shoe dropped in the Avandia multi-district litigation, on January 13, 2011, when the presiding judge denied the defense challenge to plaintiff’s expert witness specific causation opinions, in the first case set for trial. Burford v. GlaxoSmithKline, PLC, 2011 WL 135017 (E.D.Pa. 2011).

In the MDL court’s opinion on general causation, In re Avandia Marketing, Sales Practices and Product Liability Litigation, 2011 WL 13576 (E.D. Pa. 2011), Judge Rufe determined that she was bound to apply a “Third Circuit” approach to expert witness gatekeeping, which focused on the challenged expert witnesses’ methodology, not their conclusions. In Burford, Judge Rufe, citing two Third Circuit cases were decided after Daubert, but before Joiner, repeats this basic mistake. Burford, 2011 WL 135017, *2. Remarkably, the court’s opinion in Burford recites the current version of Federal Rule of Evidence 702, which states that the court must analyze expert witnesses’ conclusions for being based upon “sufficient facts or data,” as well as for being “the product of reliable principle and methods.” The statute mandates consideration of the reliability and validity of the witness’s conclusions, if those conclusion are in his testimony. This Rule, enacted by Congress in 2000, is a statute, and thus supersedes prior case law, although the Advisory Notes explain that the language of the rule draws heavily from the United States Supreme Court’s decisions in Daubert, Joiner, and Kumho Tire. The Avandia MDL court ignored both the post-Daubert decisions of the Supreme Court, as well as the controlling language of the statute, in gatekeeping opinions on general and specific causation.

Two expert witnesses on specific causation were the subject of GSK’s challenge in Burford: Dr. Nicholas DePace and Dr. Judy Melinek. The court readily dispatches Dr. Melinek, who opines that Mr. Burford’s fatal cardiac event, which she characterizes as a heart attack, was caused by Avandia because Avandia causes heart attacks. The court correctly noted that this inference was improper because risk does not equal causation in a specific case.

As one well-known epidemiologist has put it:

“An elementary but essential principal that epidemiologists must keep in mind is that a person may be exposed to an agent and then develop disease without there being any causal connection between exposure and disease.”

* * *

“In a courtroom, experts are asked to opine whether the disease of a given patient has been caused by a specific exposure. This approach of assigning causation in a single person is radically different from the epidemiologic approach, which does not attempt to attribute causation in any individual instance. Rather, the epidemiologic approach is to evaluate the proposition that the exposure is a cause of the disease in a theoretical sense, rather than in a specific person.”

Kenneth Rothman, Epidemiology: An Introduction 44 (Oxford 2002)(emphasis added).

In addressing the admissibility of Dr. DePace’s expert opinion, however, the MDL Court is led astray by Dr. DePace’s handwaving about having considered and “ruled out” Mr. Burford’s other risk factors.

To be sure, Dr. DePace has some ideas about how Avandia may, plausibly, cause heart attacks. In particular, Dr. DePace identified three plausible mechanisms, each of which would have had been accompanied by some biomarker (elevated blood lipids, elevated Lp-PLA2, or hypoglycemia). This witness, however, could not opine that any of these mechanisms was in operation in producing Mr. Burford’s fatal cardiac event. Burford, at *3.

Undaunted, Dr. DePace opined that he had ruled out Mr. Burford’s other risk factors, but his opinion, even from Judge Rufe’s narrative is clearly hand waving and dissembling. First, everyone, including every middle age man, has a risk of heart attack or cardiac arrest, although that risk may be modified – increased or lowered – by risks or preventive factors. Mr. Burford had severe diabetes, which in and of itself, is a risk factor, commonly recognized to equal the size of the risk from having had a previous heart attack. So Mr. Burford was not at baseline risk; indeed, he started all his diabetes medications with the equivalent risk of someone who had had a heart attack already.

Dr. DePace apparently opined that Mr. Burford’s diabetes, his blood sugar level, was well controlled. The court accepted this contention at face value, although the reader of the court’s opinion will know that it is rubbish. Although the court does not recite any blood sugar levels, its narrative of facts includes the following course of medications for Mr. Burford:

June 2004, diagnosed with type II diabetes, and treated with metformin
April 2005, dose of metformin doubled
August 2005, Avandia added to double dose of metformin
December 2005, Avandia dose doubled as well
June 2006, metformin dose doubled again
October or November 2006, sulfonylurea added to Avandia and metformin

This narrative hardly suggests good control. Mr. Burford was on a downward spiral of disease, which in a little over two years took him from diagnosis to three medications to try to control his diabetes. Despite adding Avandia to metformin, doubling the doses of Avandia, doubling and then quadrupling doses of metformin, Mr. Burford still required yet another, third medication, to achieve glycemic control. Of course, an expert witness can say anything, but the federal district court is supposed to act as a gatekeeper, to protect juries and parties from their ipse dixit. Many opinions will be difficult to evaluate, but here, Dr. DePace’s opinion about glycemic control in Mr. Burford comes with a banner headline, which shouts “bogus.”

The addition of a third medication, a sulfonylurea, known to cause hypoglycemia (dangerously low blood sugar), which in turn can cause cardiac events and myocardial infarction, is particularly troubling. See “Sulfonylurea,” in Wikipedia January 24, 2011. Sulfonylureas act by stimulating the pancreas to produce more insulin, and the sudden addition of this medication to an already aggressive regime of medication clearly had the ability to induce hypoglycemia in Mr. Burford. Dr. DePace notes that there is no evidence of an hypoglycemic event, which is often true in diabetic patients who experience a sudden death, but the gatekeeping court should have noticed that Dr. DePace’s lack of evidence did not equate to evidence that the risk or actual causal role (of hypoglycemia) was lacking. Again, the trial court appeared to be snookered by an expert witness’s hand waving. Surely gatekeepers must be made of sterner stuff.

Perhaps the most wrongheaded is the MDL court’s handling, or its failure to handle, risk as causation, in Dr. DePace’s testimony.

In his deposition, Dr. DePace testified that a heart attack in a 49 year-old man was “very unusual.” Such a qualitative opinion does not help the finder of fact. A heart attack is more likely in any 49 year-old man than in any 21 year-old man, although men of both ages can and do suffer heart attacks. Clearly, a heart attack is more likely in a 49-year old man who has had diabetes, which has required intensive medication for even a semblance of control, than in a 49 year-old man who has never had diabetes. Dr. DePace’s opinions fail to show that Mr. Burford had no base-line risk in absence of one particular medication, or that this base-line risk was not operating to produce, sufficiently, his alleged heart attack.

Rather than being a high-risk group with respect to his Avandia use, according to the FDA’s 2007 meta-analysis, Mr. Burford and other patients on “triple therapy” (Avandia + metformin + sulfonylurea), would have had an odds ratio of 1.1 for any myocardial ischemic event, not statistically significant, as a result of their Avandia use. Mr. Burford’s additional use of an ACE-inhibitor, along with this three diabetic medications, would place him into yet another sub-subgroup. Whatever modification or interaction this additional medication created in combination with Avandia, the confidence intervals, which were wide for the odds ratio of 1.1, would become extremely wide, allowing no meaningful inference. In any event, the court in Burford does not tell us what the risk was opined to be, and whether there were good data and facts to support such an opinion. Remarkably absent from the court’s opinion in Burford is any consideration of the actual magnitude of the claimed risk (in terms of a hazard ratio, relative risk, odds ratio, risk difference, etc.) for patients like Mr. Burford. Further absent is any consideration of whether any study showing risk has further shown the risk to be statistically different from 1.0 (no increased risk at all).

As Ted Frank has noted on PointofLaw Forum, the Avandia MDL raises serious questions about the allocation of technical multi-district litigation cases to judges in the federal system. “It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law” (January 21, 2011). The Avandia experience also raises questions about the efficacy of the Federal Judicial Center’s program to train judges in the basic analytical, statistical, and scientific disciplines needed in their gatekeeping capacity.

Although the Avandia MDL court’s assessment that Dr. DePace’s opinion was suboptimal, Burford at * 4, may translate into GSK’s ability to win before a jury, the point of Rule 702 is that a party should not have to stand trial on such shoddy evidence.

Posted in Rule 702 | No Comments »

Power in the Courts — Part Two

January 21st, 2011

Post hoc calculations of power were once in vogue, but have now routinely been condemned by biostatisticians and epidemiologists in studies that report confidence intervals around estimates of associations, or “effect sizes.” Power calculations require an alternative hypothesis against which to measure the rejection of the null hypothesis, and the choice of the alternative is subjective and often arbitrary. Furthermore, the power calculation must make assumptions about the anticipated variance of the data to be obtained. Once the data are in fact obtained, those assumptions may be shown wrong. In other words, sometimes the investigators are “lucky,” and their data are less variable than anticipated. The variance of the data actually obtained, rather than hypothesized, can best be appreciated from the confidence interval around the actually measured point estimate of risk.

In Part One of “Power in Courts,” I addressed the misplaced emphasis the Avandia MDL court put upon the concept of statistical power. The court apparently accepted at face value the plaintiffs’ argument that GSK’s clinical trials were “underpowered,” which claim was very misleading. Power calculations were no doubt done to choose sample size for GSK’s clinical trials, but those a priori estimates were based upon assumptions. In the case of one very large trial, RECORD, many fewer events occurred than anticipated (which is generally a good thing to happen, and not unusual in the context of a clinical trial that gives patients in all arms of the trial better healthcare than available to the general population). In one sense, those plaintiffs’ expert witnesses are correct to say that RECORD was “underpowered,” but once the study is done, the real measure of statistical precision is given by the confidence interval.

Because the Avandia MDL is not the only litigation in which courts and lawyers have mistakenly urged power concepts for studies that have already been completed, I have collected some key statements that reflect the general consensus and reasoning against what the Court did.

To be fair, the Avandia court did not fault the defense for not having analyzed and calculated post-hoc power of the clinical trials, all of which failed to find statistically significant associations between Avandia and heart attacks. The court, however, did appear to embrace the plaintiffs’ rhetoric that all the Avandia trials were underpowered, without any consideration given to the width and the upper bounds of the confidence intervals around those trials’ estimates of risk ratios for heart attack. Remarkably, the Avandia court did not present any confidence intervals for any estimates of effect size, although it did present p-values, which it then badly misinterpreted. Many of the Avandia trials (and the resulting meta-analyses) confidently ruled out risk ratios, for heart attacks, under 2.0. The court’s conclusions about power are thus misleading at best.

Several consensus statements address whether considerations of power, after studies are completed and the data are analyzed, are appropriate. The issue has also been addressed extensively in textbooks and in articles. I have collected some of the relevant statements, below. To the extent that the Federal Judicial Center’s Reference Manual on Scientific Evidence appears to urge post hoc power calculations, I hope that the much anticipated Third Edition will correct the error.

CONSENSUS STATEMENTS

CONSORT

The CONSORT group (Consolidated Standards of Reporting Trials) is a world-wide group that sets quality standard for randomized trials in testing of pharmaceuticals. CONSORT’s lead author is Douglas Altman, a well-respected biostatistician from Oxford University. The advice of the CONSORT group is clear:

“There is little merit in calculating the statistical power once the results of the trial are known, the power is then appropriately indicated by confidence intervals.”

Douglas Altman, et al., “The Revised CONSORT Statement for Reporting Randomized Trials: Explanation and Elaboration,” 134 Ann. Intern. Med. 663, 670 (2001). See alsoDouglas Altman, et al., “Reporting power calculations is important,” 325 Br. Med. J. 1304 (2002).

STROBE

An effort similar to the CONSORT group has been put together by investigators interested in observational studies, the STROBE group (the Strengthening the Reporting of Observational Studies in Epidemiology). The STROBE group was made up of leading epidemiologists and biostatisticians, who addressed persistent issues and errors in the reporting of observational studies. Their advice was equally unequivocal on the issue of post hoc power considerations:

“Do not bother readers with post hoc justifications for study size or retrospective power calculations. From the point of view of the reader, confidence intervals indicate the statistical precision that was ultimately obtained. It should be realized that confidence intervals reflect statistical uncertainty only, and not all uncertainty that may be present in a study (see item 20).”

Vandenbroucke, et al., “Strengthening the reporting of observational studies in epidemiology (STROBE): Explanation and elaboration,” 18 Epidemiology 805, 815 (2007) (Section 10, sample size).

American Psychological Association

In 1999, a committee of the American Psychological Association met to discuss various statistical issues in psychological research papers. With respect to power analysis, the committee concluded:

“Once the study is analyzed, confidence intervals replace calculated power in describing the results.”

Wilkinson, Task Force on Statistical Inference, “Statistical methods in psychology journals: guidelines and explanations,” 54 Am. Psychol. 594-604 (1999)

TEXTBOOKS

Modern Epidemiology

Kenneth Rothman and Sander Greenland are known for many contributions, not the least of which is their textbook on epidemiology. In the second edition of Modern Epidemiology, the authors explain how and why confidence intervals replace power considerations, once the study is completed and the data are analyzed:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. * * * In planning a study, it is reasonable to make conjectures about the magnitude of an effect in order to compute sample-size requirements or power.

In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with sample-size or power calculations (Smith & Bates 1992; Goodman & Berlin 1994). * * * Confidence limits convey much more of the essential information by indicating a range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level). They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman & Sander Greenland, Modern Epidemiology 192 – 193 (1998)

And in 2008, with the addition of Timothy Lash as a co-author, Modern Epidemiology continued its guidance on power as only a pre-study consideration:

“Standard statistical advice states that when the data indicate a lack of significance, it is important to consider the power of the study to detect as significant a specific alternative hypothesis. The power of a test, however, is only an indirect indicator of precision, and it requires an assumption about the magnitude of the effect. In planning a study, it is reasonable to make conjectures about the magnitude of an effect to compute study-size requirements or power. In analyzing data, however, it is always preferable to use the information in the data about the effect to estimate it directly, rather than to speculate about it with study-size or power calculations (Smith and Bates, 1992; Goodman and Berlin, 1994; Hoening and Heisey, 2001). Confidence limits and (even more so) P-value functions convey much more of the essential information by indicating the range of values that are reasonably compatible with the observations (albeit at a somewhat arbitrary alpha level), assuming the statistical model is correct. They can also show that the data do not contain the information necessary for reassurance about an absence of effect.”

Kenneth Rothman, Sander Greenland, and Timothy Lash, Modern Epidemiology 160 (3d ed. 2008)

A Short Introduction to Epidemiology

Neil Pierce, an epidemiologist, citing Smith & Bates 1992, and Goodman & Berlin 1994, infra, describes the standard method:

“Once a study has been completed, there is little value in retrospectively performing power calculations since the confidence limits of the observed measure of effect provide the best indication of the range of likely value for true association.”

Neil Pierce, Introduction to Epidemiology (2d ed. 2005)

Statistics at Square One

The British Medical Journal publishes a book, Statistics at Square One, which addresses the issue of post hoc power:

“The concept of power is really only relevant when a study is being planned. After a study has been completed, we wish to make statements not about hypotheses but about the data, and the way to do this is with estimates and confidence intervals.”

T. Swinscow, Statistics at Square One42 (9thed. London 1996) (citing to a book by Martin Gardiner and Douglas Altman, both highly accomplished biostatisticians).

How to Report Statistics in Medicine

Two authors from the Cleveland Clinic, in a guidebook published by the American College of Physicians:

“Until recently, authors were urged to provide ‘post hoc power calculations’ for non-significant differences. That is, if the results of the study were negative, a power calculation was to be performed after the fact to determine the adequacy of the sample size. Confidence intervals also reflect sample size, however, and are more easily interpreted, so the requirement of a post hoc power calculation for non-statistically significant results has given way to reporting the confidence interval (32).”

Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 58 (2d ed. 2006)(citing to Goodman & Berlin, infra). See also Thomas Lang & Michelle Secic, How to Report Statistics in Medicine 78 (1^st ed. 1996)

Clinical Epidemiology: The Essentials

The Fletchers, both respected clinical epidemiologists, describe standard method and practice:

“Statistical Power Before and After a Study is Done

Calculation of statistical power based on the hypothesis testing approach is done by the researchers before a study is undertaken to ensure that enough patients will be entered to have a good chance of detecting a clinically meaningful effect if it is present. However, after the study is completed this approach is no longer relevant.” There is no need to estimate effect size, outcome event rates, and variability among patients, they are now known.

Therefore, for researchers who report the results of clinical research and readers who try to understand their meaning, the confidence interval approach is more relevant. One’s attention should shift from statistical power for a somewhat arbitrarily chosen effect size, which may be relevant in the planning stage, to the actual effect size observed in the study and the statistical precision of that estimate of the true value.”

R. Fletcher, et al., Clinical Epidemiology: The Essentials at 200 (3d ed. 1996)

The Planning of Experiments

Sir David Cox is one of the leading statisticians in the world. In his classic 1958 text, The Planning of Experiments, Sir David wrote:

“Power is important in choosing between alternative methods of analyzing data and in deciding on an appropriate size of experiment. It is quite irrelevant in the actual analysis of data.”

David Cox, The Planning of Experiments 161 (1958)

ARTICLES

Cummings & Rivara (2003)

“Reporting of power calculations makes little sense once the study has been done. We think that reviewers who request such calculations are misguided.”

* * *

“Point estimates and confidence intervals tell us more than any power calculations about the range of results that are compatible with the data.”

Cummings & Rivara, “Reporting statistical information in medical journal articles,” 157 Arch. Pediatric Adolesc. Med. 321, 322 (2003)

Senn (2002)

“Power is of no relevance in interpreting a completed study.

* * *

“The definition of a medical statistician is one who not accept that Columbus discovered America because he said he was looking for India in the trial plan. Columbus made an error in his power calculation – – he relied on an estimate of the size of the Earth that was too small, but he made one none the less, and it turned out to have very fruitful consequences.”

Senn, “Power is indeed irrelevant in interpreting completed studies,” 325 Br. Med. J. 1304 (2002).

Hoenig & Heisey (2001)

“Once we have constructed a C.I., power calculations yield no additional insight. It is pointless to perform power calculations for hypotheses outside of the C.I. because the data have already told us that these are unlikely values.” p. 22a

Hoenig & Heisey, “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis”? American Statistician (2001)

Zumbo & Hubley (1998)

In The Statistician, published by the Royal Statistical Society, these authors roundly condemn post hoc power calculations:

“We suggest that it is nonsensical to make power calculations after a study has been conducted and a statistical decision has been made. Instead, the focus after a study has been conducted should be on effect size . . . .”

Zumbo & Hubley, “A note on misconceptions concerning prospective and retrospective power,” 47-2 The Statistician 385 (1998)

Goodman & Berlin (1994)

Professor Steven Goodman is a professor of epidemiology in Johns Hopkins University, and the Statistical Editor for the Annals Internal Medicine. Interestingly, Professor Goodman appeared as an expert witness, opposite Sander Greenland, in hearings on Thimerosal. His article, with Jesse Berlin, has been frequently cited in support of the irrelevance of post hoc power considerations:

“Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant.”

(p. 200a, ¶1)

“Studies with low statistical power have sample sizes that are too small, producing results that have high statistical variability (low precision). Confidence intervals are a convenient way to express that variability.”

(p. 200a, ¶2)

“Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected . . . .”

(p. 200 b, top)

“Power is exclusively a pretrial concept; it is the probability of a group of possible results (namely all statistically significant outcomes) under a specified alternative hypothesis. A study produces only one result.”

(p. 201a, ¶2)

“The perspective after the experiment differs from that before that experiment simply because the result is known. That may seem obvious, but what is less apparent is that we cannot cross back over the divide and use pre-experiment numbers to interpret the result. That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit a lottery jackpot (the after-experiment perspective).”

(p. 201a-b)

“For interpretation of observed results, the concept of power has no place, and confidence intervals, likelihood, or Bayesian methods should be used instead.”

(p. 205)

Goodman & Berlin, “The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results,” 121 Ann. Intern. Med. 200, 200, 201, 205 (1994).

Smith & Bates (1992)

This article was published in the journal, Epidemiology, which was founded and edited by Professor Kenneth Rothman:

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned.”

“Generally, a negative study with low power will be regarded as providing little evidence against the existence of a causal association. Often overlooked, however, is that otherwise well-conducted studies of low power can be informative: the upper bound of the (1 – α)% confidence intervals provides a limit on the likely magnitude of any actual effect.

The purpose of this paper is to extend this argument to show that the use of traditional power calculations is causal inference (that is, after a study has been carried out) can be misleading and inferior to the use of upper confidence limits of estimates of effect. The replacement of post-study power calculations with confidence interval estimates is not a new idea.”

(p. 449a)

* * *

“It is clear, then, that the use of the upper confidence limit conveys considerable information for the purposes of causal inference; by contrast, the power calculation can be quite misleading.”

(p. 451b)

* * *

“In conclusion, we recommend that post-study epidemiologic power calculations be abandoned. As we have demonstrated, they have little, if any, value. We propose that, in their place, (1 – α)% upper confidence limits be calculated.”

(p. 451b)

Smith & Bates, “Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies,” 3 Epidemiology 449-52 (1992)

Greenland (1988)

“the arbitrariness of power specification is of course absent once the data are collected, since the statistical power refers to the probability of obtaining a particular type of data. It is thus not a property of particular data sets. Statistical power of collected data, as the probability of heads on a coin toss that has already taken place, can, at best, meaningfully refer only to one’s ignorance of the result and loses all meaning when one examines the result.”

Greenland, “On Sample Size and Power Calculations for Studies Using Confidence Limits,” Am. J. Epidem. 236 (1988)

Simon (1986)

“Although power is a useful concept for initially planning the size of a medical study, it is less relevant for interpreting studies at the end. This is because power takes no account of the actual results obtained.”

***

“[I]n general, confidence intervals are more appropriate than power figures for interpreting results.”

Richard Simon, “Confidence intervals for reporting results of clinical trials,” 105 Ann. Intern. Med. 429, 433 (1986) (internal citation omitted).

Rothman (1986)

“[Simon] rightly dismisses calculations of power as a weak substitute for confidence intervals, because power calculations address only the qualitative issue of statistical significance and do not take account of the results already in hand.”

Kenneth J. Rothman, “Significance Questing,” 105 Ann. Intern. Med. 445, 446 (1986)

Makuch & Johnson (1986)

“[the] confidence interval approach, the method we recommend for interpreting completed trials in order to judge the range of true treatment differences that is reasonable consistent with the observed data.”

Robert W. Makuch & Mary F. Johnson, “Some Issues in the Design and Interpretation of ‘Negative’ Clinical Studies,” 146 Arch. Intern. Med. 986, 986 (1986).

Detsky & Sackett (1985)

“Negative clinical trials that conclude that neither of the treatments is superior are often criticized for having enrolled too few patients. These criticisms usually are based on formal sample size calculations that compute the number of patients required prospectively, as if the trial had not yet been carried out. We suggest that this ‘prospective’ sample size calculation is incorrect, for once the trial is over we have ‘hard’ data from which to estimate the actual size of the treatment effect. We can either generate confidence limits around the observed treatment effect or retrospectively compare it with the effect hypothesized before the trial.”

Detsky & Sackett, “When was a ‘negative’ clinicaltrial big enough? How many patients you need depends on what you found,” 145 Arch. Intern. Med. 709 (1985).

Posted in Rule 702 | No Comments »

Power in the Courts — Part One

January 18th, 2011

The Avandia MDL court, in its recent decision to permit plaintiffs’ expert witnesses to testify about general causation, placed substantial emphasis on the statistical concept of power. Plaintiffs’ key claim is that the Avandia causes heart attacks, yet no clinical trial of the oral anti-diabetic medication Avandia found a statistically significant increased risk of heart attacks. Plaintiffs’ expert witnesses argued that all the clinical trials of Avandia were “underpowered,” and thus the failure to find an increased risk was a Type II (false-negative) error that resulted from the small size of the clinical trials:

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(emphasis in original).

The true effect, according to plaintiffs’ expert witnesses, could be seen only through aggregating the data, across clinical trials, in a meta-analysis. The proper conduct, reporting, and interpretation of meta-analyses were thus crucial issues for the Avandia MDL court, which appeared to have difficulty with statistical concepts. The court’s difficulty, however, may have had several sources beyond misleading plaintiffs’ expert witness testimony, and the defense’s decision not to call an expert in biostatistics and meta-analysis at the Rule 702 hearing.

Another source of confusion about statistical power may well have come from the very reference work designed to help judges address statistical and scientific evidence in their judicial capacities: The Reference Manual on Scientific Evidence.

Statistical power is discussed in the both the chapters on statistics and on epidemiology in The Reference Manual on Scientific Evidence. The chapter on epidemiology, however, provides misleading guidance on the use of power:

“When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially inconclusive with regard to toxicity. The concept of power can be helpful in evaluating whether a study’s outcome is exonerative or inconclusive.⁷⁹ The power of a study expresses the probability of finding a statistically significant association of a given magnitude (if it exists) in light of the sample sizes used in the study. The power of a study depends on several factors: the sample size; the level of alpha, or statistical significance, specified; the background incidence of disease; and the specified relative risk that the researcher would like to detect.⁸⁰ Power curves can be constructed that show the likelihood of finding any given relative risk in light of these factors. Often power curves are used in the design of a study to determine what size the study populations should be.⁸¹”

Michael D. Green, D. Michael Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in Federal Judicial Center, The Reference Manual on Scientific Evidence 333, 362-63 (2ed. 2000). See also David H. Kaye and David A. Freedman, Reference Guide on Statistics,” Federal Judicial Center, Reference Manual on Scientific Evidence 83, 125-26 (2ed. 2000)

This guidance is misleading in the context of epidemiologic studies because power curves are rarely used any more to assess completed studies. Power calculations are, of course, used to help determine sample size for a planned study. After the data are collected, however, the appropriate method to evaluate the “resolving power” of a study is to examine the confidence interval around the study’s estimate of risk size.

The authors of the chapter on epidemiology cite to a general review paper, id. at p. 362n.79, which does indeed address the concept of statistical power, but the author, a well-known statistician, addresses the issue primarily in the context of planning a statistical analysis, and in discrimination litigation, where the test result will be expressed in a p-value, without a measure of “effect size,” and more important, without a measure of a “confidence interval” around the estimate of effect size:

“The chance of rejecting the false null hypothesis, under the assumptions of an alternative, is called the power of the test. Simply put, among many ways in which we can test a null hypothesis, we want to select a test that has a large power to correctly distinguish between two alternatives. Generally speaking, the power of a test increases with the size of the sample, and tests have greater power, and therefore perform better, the more extreme the alternative considered becomes.

Often, however, attention is focused on the first type of error and the level of significance. If the evidence, then, is not statistically significant, it may be because the null hypothesis is true or because our test did not have sufficient power to discern a difference between the null hypothesis and an alternative explanation. In employment discrimination cases, for example, separate tests for small samples of employees may not yield statistically significant results because each test may not have the ability to discern the null hypothesis of nondiscriminatory employment from illegal patterns of discrimination that are not extreme. On the other hand, a test may be so powerful, for example, when the sample size is very large, that the null hypothesis may be rejected in favor of an alternative explanation that is substantively of very little difference. ***

Attention must be paid to both types of errors and the risks of each, the level of significance, and the power. The trier of fact can better interpret the result of a significance test if he or she knows how powerful the test is to discern alternatives. If the power is too low against alternative explanations that are illegal practices, then the test may fail to achieve statistical significance even though the illegal practices may be operating. If the power is very large against a substantively small and legally permissible difference from the null hypothesis, then the test may achieve statistical significance even though the employment practices are legal.”

Stephen E. Fienberg, Samuel H. Krislov, and Miron L. Straf, “Understanding and Evaluating Statistical Evidence in Litigation,” 36 Jurimetrics J. 1, 22-23 (1995).

Professor Fienberg’s characterization is accurate, but his description of “post-hoc” assessment of power was not provided for the context of edemiologic studies, which today virtually always report confidence intervals around the studies’ estimates of effect size. These confidence intervals allow a concerned reader to evaluate what can reasonably ruled out by the data in a given study. Post-hoc power calculations or considerations fail to provide meaningful consideration because they require a specified alternative hypothesis. A wily plaintiff’s expert witness can always arbitrarily select a sufficiently low alternative hypothesis, say a relative risk of 1.01, such that any study would have a vanishingly small probability of correctly distinguishing the null and alternative hypotheses.

The Reference Manual is now undergoing a revision, for an anticipated third edition. A saner appreciation of the concept of power as it is used in epidemiologic studies and clinical trials would be helpful to courts and to lawyers who litigate cases involving this kind of statistical evidence.

Posted in Rule 702 | No Comments »

Learning to Embrace Flawed Evidence – The Avandia MDL’s Daubert Opinion

January 10th, 2011

If GlaxoSmithKline (GSK) did not have bad luck when it comes to its oral anti-diabetic medication Avandia, it would have no luck at all.

On January 4, 2011, the federal judge who oversees the Avandia multi-district litigation (MDL) in Philadelphia entered an order denying GSK’s motion to exclude the causation opinion testimony of plaintiffs’ expert witnesses. In re Avandia Marketing, Sales Practices, and Products Liab. Litig., MDL 1871, Mem. Op. and Order (E.D.Pa. Jan. 3, 2011)(Rufe, J.)[cited as “Op.”]. The decision is available on the CBS Interactive Business Network news blog, BNET.

Based largely upon a meta-analysis of randomized clinical trials (RCTs) by Dr Steven Nissen and Ms Kathleen Wolski, plaintiffs’ witnesses opined that Avandia (rosiglitizone) causes heart attacks and strokes. Because meta-analysis has received so little serious judicial attention in connection with Rule 702 or 703 motions, this opinion by the Hon. Cynthia Rufe, deserves careful attention by all students of “Daubert” law. Unfortunately, that attention is likely to be critical — Judge Rufe’s opinion fails to engage the law and facts of the case, while committing serious mistakes on both fronts.

The Law

The reader will know that things are not going well for a sound legal analysis when the trial court begins by misstating the controlling law for decision:

“Under the Third Circuit framework, the focus of the Court’s inquiry must be on the experts’ methods, not their conclusions. Therefore, the fact that Plaintiffs’ experts and defendants’ experts reach different conclusions does not factor into the Court’s assessment of the reliability of their methods.”

Op. at 2 (internal citation omitted).

and

“As noted, the experts are not required to use the best possible methods, but rather are required to use scientifically reliable methods.”

Op. at 26.

Although the United States Supreme Court attempted, in Daubert, to draw a distinction between the reliability of an expert witness’s methodology and conclusion, that Court soon realized that the distinction is flawed. If an expert witness’s proffered testimony is discordant from regulatory and scientific conclusions, a reasonable, disinterested scientists would be led to question the reliability of the testimony’s methodology and its inferences from facts and data, to its conclusion. The Supreme Court recognized this connection in General Electric v. Joiner, and the connection between methodology and conclusions was ultimately incorporated into a statute, the revised Federal Rule of Evidence 702:

“[I]f scientific, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training or education, may testify thereto in the form of an opinion or otherwise, if

the testimony is based upon sufficient fact or data,

the testimony is the product of reliable principles and methods; and

the witness has applied the principles and methods reliably to the facts.”

The Avandia MDL court thus ignored the clear mandate of a statute, Rule 702(1), and applied an unspecified “Third Circuit” framework, which is legally invalid to the extent it departs from the statute.

The Avandia court’s ruling, however, goes beyond this clear error in applying the wrong law. Judge Rufe notes that:

“The experts must use good grounds to reach their conclusions, but not necessarily the best grounds or unflawed methods.”

Op. at 2-3 (internal citations omitted).

Here the trial court’s double negative is confusing. The court clearly suggests that plaintiffs’ experts must use “good grounds,” but that their methods can be flawed and still survive challenge. We can certainly hope that the trial court did not intend to depart so far from the statute, scientific method, and common sense, but the court’s own language suggests that it abused its discretion in applying a clearly incorrect standard.

Misstatements of Fact

The apparent errors of the Avandia decision transcend mistaken legal standards, and go to key facts of the case. Some errors perhaps show inadvertence or inattention, for instance, when the court states that the RECORD trial, an RCT conducted by GSK, set out “specifically to compare the cardiovascular safety of Avandia to that of Actos (a competitor medication in the same class). Op. at 4. In fact, Actos (or pioglitazone) was not involved in the RECORD trial, which involved Avandia, along with two other oral anti-diabetic medications, metformin and sulfonylurea.

Erroneous Reliance upon p-values to the exclusion of Confidence Intervals

Other misstatements of fact, however, suggest that the trial court did not understand the scientific evidence in the case. By way of example, the trial court erroneously over-emphasized p-values, and ignored the important interpretative value of the corresponding confidence intervals. For example, we are told that “[t]he NISSEN meta-analysis combined 42 clinical trials, including the RECORD trial and other RCTs, and found that Avandia increased the risk of myocardial infarction by 43%, a statistically significant result (p = .031).” Op. at 5. Ignoring for the moment that the cited meta-analysis did not include the RECORD RCT, the Court should have have reported the p-value along with the corresponding two-sided 95% confidence interval:

“the odds ratio for myocardial infarction was 1.43 (95% confidence interval [CI], 1.03 to 1.98; P = 0.03).”

Steven E. Nissen, M.D., and Kathy Wolski, M.P.H., “Effect of Rosiglitazone on the Risk of Myocardial Infarction and Death from Cardiovascular Causes,” 356 New Engl. J. Med. 2457, 2457 (2007).

The Court repeats this error later in its opinion:

“In 2007, the New England Journal of Medicine published the NISSEN meta-analysis, which combined results from 42 double-blind RCTs and found that patients taking Avandia had a statistically significant 43% increase in myocardial ischemic events. NISSEN used all publicly available data from double-blind RCTs of Avandia in which cardiovascular disease events were recorded, thereby eliminating one major drawback of meta-analysis: the biased selection of studies.”

Op. at 17. The second time, however, the Court introduced new factual errors. The Court erred in suggesting that Nissen uses all publicly available data. There were, in fact, studies available to Nissen and to the public, which met Nissen’s inclusion criteria, but which he failed to include in his meta-analysis. Nissen’s meta-analysis was thus biased by its failure to have conducted a complete, thorough review of the medical literature for qualifying RCTs. Furthermore, contrary to the Court’s statement, Nissen included non-double-blinded RCTs, as his own published paper makes clear.

Erroneous Interpretation of p-values

The court erred in its interpretation of p-values:

“The DREAM and ADOPT studies were designed to study the impact of Avandia on prediabetics and newly diagnosed diabetics. Even in these relatively low-risk groups, there was a trend towards an adverse outcome for Avandia users (e.g., in DREAM, the p-value was .08, which means that there is a 92% likelihood that the difference between the two groups was not the result of mere chance). “

Op. at 25 (internal citation omitted). The p-value is, of course, the probability that results as large or larger would have been observed, given the truth of the null hypothesis that there is no difference between Avandia and its comparator medications. The p-value does not permit a probabilistic assessment of the correctness of the null hypothesis; nor does it permit a straightforward probabilistic assessment of the correctness of the alternative hypothesis of rejecting the null hypothesis.

See Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 122, 357 (2000).

Hand Waiving over Statin Use

The Court appeared to have been confused by plaintiffs’ rhetoric that statin use masked a real risk of heart attacks in the Avandia RCTs.

“It is not clear whether statin use was allowed in the DREAM study.”

Op. at 25. The problem is that the Court fails to point to any evidence that the use of statins differed between the Avandia and comparator arms of the RCTs. Statins have been one of the great pharmaceutical success stories of the last 15 years, and it is reasonable to believe that today most diabetic patients (who often high blood fats) would taking statins. At the time of the DREAM study, the prevalence of use would have been lower than today, but there was no evidence mentioned that the use was different between the Avandia and other arms of the DREAM trial.

Errors in Interpreting RCTs by Intention to Treat Analyses

For unexplained reasons, the court was impressed by what it called a high dropout rate in one of the larger Avandia RCTs:

“The ADOPT study was marred by a very high dropout rate (more than 40% of the subjects did not complete the four year follow up) and the use of statins during the trial.”

Op. at 25. Talk about being hoisted with one’s own petard! The high dropout rate in ADOPT resulted from the fact that this RCT was a long-term test of “glycemic control.” Avandia did better with respect to durable glycemic control than two major, accepted medications, metformin and sulfonylurea, and thus the dropouts came mostly in the comparator arms as patients not taking Avandia required more and stronger medications, or even injected insulin. The study investigators were obligated to analyze their data in accord with “intention to treat” principles, and so patients removed from the trial due to lack of glycemic control could no longer be counted with respect to any outcome of interest. Avandia patients thus had longer follow-up time, and more opportunity to have events due to their underlying pathologic physiology (diabetes and diabetic-related heart attacks).

Ignoring Defense Arguments

GSK may have hurt itself by electing not to call an expert witness at the Daubert hearing in this MDL. Still, the following statement by the Court is hard to square with opening argument given at the hearing:

“GSK points out no specific flaws or limitations in the design or implementation of the NISSEN meta-analysis”

Op. at 6. If true, then shame on GSK; but somehow this statement seems too incredible to be true.

Ignoring the Difference between myocardial ischemic events and myocardial infarction (MI)

MI occurs when heart muscle dies as a result of a blockage in a blood vessel that brings oxygenated blood. An ischemic event is defined very broadly in GSK’s study:

“To minimize the possibility of missing events of interest, all events coded with broadly inclusive AE terms captured from investigator reports were reviewed. SAEs identified from the trials database included cardiac failure, angina pectoris, acute pulmonary edema, all cases of chest pain without a clear non-cardiac etiology and myocardial infarction/myocardial ischemia.”

Alexander Cobitz MD, PhD, et al., “A retrospective evaluation of congestive heart failure and myocardial ischemia events in 14 237 patients with type 2 diabetes mellitus enrolled in 42 short-term, double-blind, randomized clinical studies with rosiglitazone,” 17 Pharmacoepidem. & Drug Safety 769, 770 (2008).

In its pooled analysis, GSK was clearly erring on the side of safety in creating its composite end point, but the crucial point is that GSK included events that had nothing to do with MI. The MDL court appears to have accepted uncritically the plaintiffs’ expert witnesses’ claim that the difference between myocardial ischemic events and MI is only a matter of degree. The Court found “that the experts were able to draw reliable conclusions about myocardial infarction” from a meta-analysis about a different end point, “by virtue of their expertise and the available data.” Op. at 10. This is hand waiving or medical alchemy.

Uncritical Acceptance of Mechanistic Evidence Related to Increased Congestive Heart Failure (CHF) in Avandia Users

The court noted that plaintiffs’ expert witnesses relied upon a well-established relationship between Avandia and congestive heart failure (CHF). Op. at 14. True, true, but immaterial. Avandia causes fluid retention, but so do other drugs in this class of drugs as well. Actos causes fluid retention, and carries the same warning for CHF, but there is no evidence that Actos causes MI or stroke. Although the Court’s desire to have a mechanism of causation is understandable, that desire cannot substitute for actual evidence.

Misuse of Power Analyses

The Avandia MDL Court mistakenly referred to inadequate statistical power in the context of interpreting data of heart attacks in Avandia RCTs.

“If the sample size is too small to adequately assess whether the substance is associated with the outcome of interest, statisticians say that the study lacks the power necessary to test the hypothesis. Plaintiffs’ experts argue, among other points, that the RCTs upon which GSK relies are all underpowered to study cardiac risks.”

Op. at 5.

The Court might have helped itself by adverting to the Reference Manual of Scientific Evidence:

“Power is the chance that a statistical test will declare an effect when there is an effect to declare. This chance depends on the size of the effect and the size of the sample.”

Federal Judical Center, Reference Manual Scientific Evidence 2d ed. 125 – 26, 357 (2000) (internal citations omitted). In other words, you cannot assess the power of the study unless you specify the size of the association of the alternative hypothesis, and the sample size, among other things. It is true that most of the Avandia trials were not powered to detect heart attacks, but the concept of power requires the user to specify at least the alternative hypothesis against which the study is being assessed for power. Once the studies were completed, and the data became available, there was no longer any need or use for the consideration of power; the statistical precision of the studies’ results was given by their confidence intervals.

Incorrect Use of the Concept of Replication

The MDL court erred in accepting the plaintiffs’ expert witnesses’ bolstering of Nissen’s meta-analytic results by their claim that Nissen’s results had been “replicated”:

“[T]he NISSEN results have been replicated by other researchers. For example, the SINGH meta-analysis pooled data from four long-term clinical trials, and also found a statistically significant increase in the risk of myocardial infarction for patients taking Avandia. GSK and the FDA have also replicated the results of NISSEN through their own meta-analyses.”

Op. at 6 (internal citations omitted).

“The SINGH, GSK and FDA meta-analyses replicated the key findings of the NISSEN study.43”

Op. at 17.

These statements mistakenly suggest that Nissen’s meta-analysis was able to generate a reliable conclusion that there was a statistically significant association between Avandia use and MI. The Court’s insistence that Nissesn was replicated does not become more true for having been stated twice. Nissen’s meta-analysis was not an observational study in the usual sense. His publication made very clear what studies were included (and not at all clear what studies were excluded), and the meta-analytic model that he used. Thus, it is trivially true that anyone could have replicated his analysis, and indeed, several researchers did so. See, e.g., George A. Diamond, MD, et al., “Uncertain Effects of Rosiglitazone on the Risk for Myocardial Infarction and Cardiovascular Death,” 147 Ann. Intern. Med. 578 (2007).

But Nissen’s results were not replicated by Singh, GSK, or the FDA, because these other meta-analyses used different methods, different endpoints (in GSK’s analysis), different inclusion criteria, different data, and different interpretative methods. Most important, GSK and FDA could not reproduce the statistically significant finding for their summary estimate of association between Avandia and heart attacks.

One definition of replication that the MDL court might have consulted makes clear that replication is a repeat of the same experiment to determine whether the same (or a consistent) result is obtained:

“REPLICATION — The execution of an experiment or survey more than once so as to confirm the findings, increase precision, and obtain a closer estimation of sampling error. Exact replication should be distinguished from consistency of results on replication. Exact replication is often possible in the physical sciences, but in the biological and behavioral sciences, to which epidemiology belongs, consistency of results on replication is often the best that can be attained. Consistency of results on replication is perhaps the most important criterion in judgments of causality.”

Miquel Porta, Sander Greenland, and John M. Last, eds., A Dictionary of Epidemiology, 5th ed., at 214 (2008). The meta-analyses of Singh, GSK, and FDA did not, and could not, replicate Nissen’s. Singh’s meta-analysis obtained a result similar to Nissen’s, but the other meta-analyses by GSK, FDA, and Manucci failed to yield a statistically significant result for MI. This is replication only in Wonderland.

It is hard to escape the conclusion that the MDL denied GSK intellectual due process of law.

Posted in Rule 702 | 4 Comments »

Beecher-Monas and the Attempt to Eviscerate Daubert from Within

November 23rd, 2010

Part 2, of a Critique of Evaluating Scientific Evidence, by Erica Beecher-Monas (EBM)

Giving advice to trial and appellate judges on how they should review scientific evidence can be a tricky business. Such advice must reliably capture the nature of scientific reasoning in several different fields, such as epidemiology and toxicology, and show how such reasoning can and should be incorporated within a framework of statutes, rules, and common law rules. Erica Beecher-Monas’ book, Evaluating Scientific Evidence, fails to accomplish these goals. What she does accomplish is the confusion of regulatory assumptions and principles of precautionary principles with the science of health effects in humans.

7. “Empowering one type of information or one kind of study to the exclusion of another makes no scientific evidentiary sense.” Id. at 59.

It is telling that Erica Beecher-Monas (EBM) does not mention either the systematic review or the technique of meta-analysis, which is based upon the systematic review. Of course, these approaches, whether qualitative or quantitative, require a commitment to pre-specify a hierarchy of evidence, and inclusionary and exclusionary criteria for studies. What EBM seems to hope to accomplish is the flattening of the hierarchy of evidence, and making all types of evidence comparable in probative value. This is not science or scientific, but part of an agenda to turn Daubert into a standard of bare relevancy. Systematic reviews do not literally exclude any “one kind” of study, but they recognize that not all study designs are equal. The omission in EBM’s book speaks volumes.

8. “[T]he likelihood that someone whose health was adversely affected will have the courthouse doors slammed in his or her face,” id. at 64, troubles EBM.

EBM recognizes that inferences and scientific methodologies involve false positives and false negatives, but she appears disproportionately concerned by false negatives. Of course, this solicitude begs the question whether we have reasonably good knowledge that that someone really was adversely affected. A similar solicitude for the defendant who has had the courthouse door slammed on his head, in cases in which it has caused no harm, is missing. This imbalance leads EBM to excuse and defend gaps in plaintiffs’ evidentiary displays on scientific issues.

9. “Gaps in scientific knowledge are inevitable, not fatal flaws.” Id. at 51 (citing a work on risk assessment).

The author also seems to turn a blind eye to the size of gaps. Some gaps are simply too big to be bridged by assumptions. Scientists have to be honest about their assumptions, and temper their desire to reach conclusions. Expert witnesses often lack the requisite scientific temper to remain agnostic; they take positions when they should rightfully press for the gaps to be filled. Expert witnesses outrun their headlights, but EBM cites virtually no example of a gatekeeping decision with approval.

Excusing gaps in risk assessment may make some sense given that risk assessment is guided by the precautionary principle. The proofs in a toxic tort case are not. EBM’s assertion about the inevitability of “gaps” skirts the key question: When are gaps too large to countenance, and to support a judgment? The Joiner case made clear that when the gaps are supported only by the ipse dixit of an expert witness, courts should look hard to determine whether the conclusion is reasonably, reliably supported by the empirical evidence. The alternative, which EBM seems to invite, is intellectual anarchy.

8. “Extrapolation from rodent studies to human cancer causation is universally accepted as valid (at least by scientists) because ‘virtually all of the specific chemicals known to be carcinogenic in humans are also positive in rodent bioassays, and sometimes even at comparable dose and with similar organ specificity’.” Id. at 71n.55 (quoting Bernard Weinstein, “Mitogenesis is only one factor in carcinogenesis,” 251 Science 387, 388 (1991)).

When it comes to urging the primacy and superiority of animal evidence, EBM’s brief is relentless and baseless.

Remarkably, in the sentence quoted above, EBM has committed the logical fallacy of affirming the consequent: If all human carcinogens are rat carcinogens, then all rat carcinogens are human carcinogens. This argument form is invalid, and the consequent does not follow from the antecedent. And it is the consequent that provides the desired, putative validity for extrapolating from rodent studies to humans. Not only does EBM commit a non-sequitur, she quotes Dr. Weinstein’s article out of context, because his article makes quite clear that not all rat carcinogens are accepted causes of cancer in human beings.

9. “Post-Daubert courts often exclude expert testimony in toxic tort cases simply because the underlying tests relate to animals rather than humans.” Id. at 71n. 54.

Given EBM’s radical mission to “empower” animal evidence, we should not be too surprised that she is critical of Daubert decisions that have given lesser weight to animal evidence. The above statement is another example of EBM’s over- and misstatement. The cases cited, for instance the Hall decision by Judge Jones in the breast implant litigation, and the Texas Supreme Court in Havner, do not support the “simply because.” Those cases represent complex evidentiary displays that involved animal, in vitro, chemical analysis, and epidemiologic studies. The Hall decision was based upon Rule 702, but it was followed by Judge Jack Weinstein, who, after conducting two weeks of hearings, entered summary judgment sua sponte against the plaintiffs (animal evidence and all). Recently, Judge Weinstein characterized the expert witnesses who supported the plaintiffs’ claims as “charlatans.” See Judge Jack B. Weinstein, “Preliminary Reflections on Administration of Complex Litigation.” Cardozo Law Review De Novo at 14, http://www.cardozolawreview.com/content/denovo/WEINSTEIN_2009_1.pdf (“[t]he breast implant litigation was largely based on a litigation fraud. … Claims—supported by medical charlatans—that enormous damages to women’s systems resulted could not be supported.”) (emphasis added).

Given the widespread rejection of the junk science behind breast implant claims, by courts, scientists, court-appointed experts, and the Institute of Medicine, EBM’s insertion of “simply” in the sentence above simply tells volumes about how she would evaluate the evidentiary display in Hall. See also Evaluating Scientific Evidence at 81n.99 (arguing that Hall was mistaken). If the gatekeeping in the silicone breast implant litigation was mistaken, as EBM argues, it is difficult to imagine what slop would be kept out by a gatekeeper who chose to apply EBM’s “intellectual due process.”

10. “Animal studies are more persuasive than epidemiology for demonstrating small increases of risk.” Id. at 70

EBM offers no support for this contention, and there is none unless one is concerned to demonstrate small risks for animals. Even for the furry beasts themselves, the studies do not “demonstrate” (a mathematical concept) small increased risks at low doses comparable to the doses experienced by human beings.

EBM’s urging of “scientifically justifiable default assumptions” turns into advocacy for regulatory pronouncements of precautionary principle, which have been consistently rejected by courts as not applicable to toxic tort litigation for personal injuries.

11. “Nonthreshold effects, on the other hand, are characteristic of diseases (like some cancers) that are caused by genetic mutations.” Id. at 75.

EBM offers no support for this assertion, and she ignores the growing awareness that the dose-response curves for many substances are hormetic; that is, the substance often exercises a beneficial or therapeutic effect at low doses, but may be harmful at high doses. Alcohol is a known human carcinogen, but at low doses, alcohol reduces cardiovascular mortality. At moderate to high doses, alcohol causes female breast cancer, and liver cancer. Liver cancer, however, requires sufficiently high, prolonged doses to causes permanent fibrotic and architectural changes in the liver (cirrhosis) before it increases risk of liver cancer. These counterexamples, and others, show that thresholds are often important features of the dose-response curves of carcinogens.

Similarly, EBM incorrectly argues that the default assumption of a linear dose-response pattern is reasonable because it is, according to her, widely accepted. Id. at 74n. 65. Her supporting citation is, however, to an EPA document on risk assessment, which has nothing to do with determinations of causality. Risk assessments assume causality and attempt to place an upper bound on the magnitude of the hypothetical risk. Again, EBM’s commitment to the precautionary principle and regulatory approaches preempt scientific thinking. If EBM had considered the actual and postulated mechanisms of carcinogenesis, even in sources she cites, she would have to acknowledge that the linear no threshold model makes no sense because it ignores the operation of multiple protective mechanisms that must be saturated and overwhelmed before carcinogenetic exposures can actually induce clinically meaningful tumors in animals. See, e.g., Bernard Weinstein, “Mitogenesis is only one factor in carcinogenesis,” 251 Science 387, 388 (1991) (mistakenly cited by EBM for the proposition that rodent carcinogens should be “assumed” to cause cancer in humans).

12. “Under this assumption [of the EPA], demonstrating the development of lung cancer in mice would be admissible to show human causation in any organ. Because we know so little about cancer causation, there is justification for this as a workable but questionable assumption with respect to cancer.” Id. at 77.

Extrapolation, across species, across organs, and across disparate doses! No gap is too wide, too deep to be traversed by EBM’s gatekeepers. In arguing that extrapolation is a routine part of EPA risk assessment, EBM ignores that the extrapolation is not the basis for reaching scientific conclusions about health effects in human beings. Regulatory science is “mandating certainty” — the opposite side of David Michael’s caricature of industry’s “manufacturing doubt.”

13. “[T]he court in Hall was mistaken when it excluded the expert testimony because the studies relied on only showed that silicone could have caused the plaintiff’s diseases, not that it did.” Id. at 81n.99.

Admittedly, it is difficult to tell whether EBM is discussing general or specific causation in this sentence, but it certainly seems as if she is criticizing the Hall decision, by Judge Jones, because the expert witnesses for the plaintiff were unable to say that silicone did, in fact, cause Hall’s illness. EBM appears to be diluting specific causation to a “might have had some effect” standard.

The readers who have actually read the Hall decision, or who are familiar with the record in Hall, will know that one key expert witness for plaintiffs, an epidemiologist, Dr. David Goldsmith, conceded that he could not say that silicone more likely than not caused autoimmune disease. A few weeks after testifying in Hall, Goldsmith changed his testimony. In October 1996, in Judge Weinstein’s courtroom, based upon an abstract of a study that he saw the night before testifying, Goldsmith asserted that believed that silicone did cause autoimmune connective tissue disease, more likely than not. Before Goldsmith left the stand, Judge Weinstein declared that he did not believe that Goldsmith’s testimony would be helpful to a jury.

So perhaps EBM is indeed claiming that testimony that purports to provide the causal conclusion need not be expressed to some degree of certainty other than possibility. This interpretation is consistent with what appears to be EBM’s dilution of “intellectual due process” to permit virtually any testimony at all that has the slightest patina of scientific opinion.

14. “The underlying reason that courts appear to founder in this area [toxic torts] is that causation – an essential element for liability – is highly uncertain, scientifically speaking, and courts do not deal well with this uncertainty.” Id. at 57.

Regulation in the face of uncertain makes sense as an application of the precautionary principle, but litigation requires expert witness opinion that rises to the level of “scientific knowledge.” Rule 702. EBM’s candid acknowledgment is the very reason that Daubert is an essential tool to strip out regulatory “science,” which may well support regulation against a potential, unproven hazard. Regulations can be abrogated. Judgments in litigation are forever. The social goals and the evidentiary standards are different.

15. “Causal inference is a matter of explanation.” Id. at 43.

Here and elsewhere, EBM talks of causality as though it were only about explanations, when in fact, the notion of causal inference includes an element of prediction, as well. EBM seems to downplay the predictive nature of scientific theories, perhaps because this is where theories founder and confront their error rate. Inherent in any statement of causal inference is a prediction that if the factual antecedents are the same, the result will be the same. Causation is more than a narrative of why the effect followed the cause.

EBM’s work feeds the illusion that courts can act as gatekeepers, wrapped in the appearance of “intellectual due process,” but at the end of the day find just about any opinion to be admissible. I could give further examples of the faux pas, ipse dixit, and non sequitur in EBM’s Evaluating Scientific Evidence, but the reader will appreciate the overall point. Her topic is important, but there are better places for judges and lawyers to seek guidance in this difficult area. The Federal Judicial Center’s Reference Manual on Scientific Evidence, although not perfect, is at least free of the sustained ideological noise that afflicts EBM’s text.

Posted in Rule 702 | No Comments »