Sub-group Analyses in Epidemiologic Studies — Dangers of Statistical Significance as a Bright-Line Test

Both aggregation and disaggregation of outcomes poses difficult problems for statistical analysis, and for epidemiology.  If outcomes are bundled into a single composite outcome, there has to be some basis for the bundling to make sense.  Even so, a composite outcome, such as all cardiovascular disease events, could easily hide an association in a component outcome.  For instance, studies of a drug under scrutiny may show no increased risk for all cardiovascular events, but closer inspection may show an increased risk for heart attacks while also showing a decreased risk for strokes.

The opposite problem arises when studies report multiple subgroups.  The opportunity for post hoc data mining runs rampant, and the existence of multiple subgroups means that the usual level of statistical significance becomes ineffective for ruling out chance as an explanation for an increased or decreased risk in a subgroup.  This problem is well known and extensively explored in the epidemiology literature, but it receives no attention in the Federal Judicial Center’s current Reference Manual on Scientific Evidence.  I hope that the authors of the Third Edition, which is due out in a few months, give some attention to the problem of subgroup analysis in epidemiology.  This seems to be an area where judges need a good deal of assistance, and where the Reference Manual lets them down.

Litigation tends to be a fertile field for the data dredging or the Texas Sharp shooters’ approach to epidemiology. (The Texas Sharp shooter shoots first and draws the target later.) When studies look at many outcomes, or many subgroups, chance alone will lead to results that have p-values less than the usual level for statistical significance (p < 0.05).  Accepting a result as “significant” when there is a multiplicity of testing or comparisons resulting from subgroup analyses is a form of “data torturing.” Mills, “Data Torturing,” 329 New Engl. J. Med. 1196, 1196 (1993)(“If you torture the data long enough, they will confess.”).

The multiple testing or comparison issue arises in both cohort and case-control studies.  Cohort studies have the ability to look at cancer morbidity or mortality at 20 different organs, with multiple histological subtypes for each cancer.  There are hundreds of diseases, by World Health Organization disease codes, which can be a possible outcome in a cohort study.  The odds are very good that several disease outcomes will be significantly elevated or decreased by chance alone.  Similarly, in a case-control study, participants with the outcome of interest can be questioned about hundreds of lifestyle and exposure variables.  Again, the finding of a “risk factor,” with statistical significance is not very compelling under these circumstances.

The problem of subgroup analyses is exacerbated by defense counsel’s emphasis on statistical significance as a “bright-line” test.  When subgroup analyses yield a statistically significant result, at the usual p < 0.05, which they will often do by chance alone, plaintiffs’ counsel obtain a “gotcha” moment.  Having built up the importance of statistical significance, defense counsel are hard pressed to dismiss the “significant” finding, even though study design makes it highly questionable if not downright meaningless.

Although the Reference Manual ignores this recurrent problem, several authors have issued severe alerts to the issue. For instance, Lisa Bero, who writes frequently on science and the law issues, admonishes:

“Specifying subgroup analysis after data collection for the review has already begun can be a ‘fishing expedition’ or “data dredging” for statistically significant results and is not appropriate.”

L. Bero, “Evaluating Systematic Reviews and Meta-Analyses,” J. L. & Policy 569, 576 (2006).

Eggers and Davey Smith, two well-respected English authors, who write about methodological issues in epidemiology, warn:

“Similarly, unplanned data-driven subgroup analyses are likely to produce spurious results.”

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” 24 chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

Stewart and Parmar explain the genesis of the problem and the result of diluting the protection that statistical significance usually provides against Type I errors:

“In general, the results of these subgroup analyses can be very misleading owing to the very high probability that any observed differences is due solely to chance.8 For example, if 10 subgroup analyses are carried out, there is a 40% chance of finding at least one significant false-positive effect (5% significance level).  Further, when the results of subgroup analyses are reported, often only those that have yielded a significant result are presented, without noting that many other analyses have been performed.”

Stewart and Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996)

“Such data dredging must be avoided and subgroup analyses should be limited to those that are specified a priori in the trial protocol.”

Id. at 272.

“Readers and reviewers should be aware that subgroup analyses, exploratory or otherwise, are likely to be particularly unreliable in situations where no overall effect of treatment has been observed.  In this case, if one subgroup exhibits a particularly positive effect of treatment, then another subgroup has to have a counteracting negative effect.”

* * *

“Consequently, perhaps the most sensible advice to readers and reviewers is to be very skeptical about the results of subgroup analyses.”

Id.  See also Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000) (“Analysis of subgroup results in a clinical trial is surprisingly unreliable, even in a large trial.  This is the result of a combination of reduced statistical power, increased variance and the play of chance.  Reliance on such analyses is likely to be erroneous, and hence harmful, than application of the overall proportional (or relative) result in the whole trial to the estimate of absolute risk in that subgroup.  Plausible explanations can usually be found for effects that are, in reality, simply due to the play of chance.  When clinicians believe such subgroup analyses, there is a real damage of harm to the individual patient.”)

These warnings and admonitions are important caveats to statistical significance.  In emphasizing the importance of statistical significance in evaluating statistical evidence, defense lawyers are sometimes unwittingly hoisted with their own petard, in the form of studies that have results that meet the usual p-value threshold of lower than 5%.  Courts see these defense lawyers as engaged in special pleading when counsel argues that study multiplicity requires changing the p-value threshold to preserve the desired rate of Type I error, but that is exactly what must be done.

A few years ago, the New England Journal of Medicine published an article that detailed the problem and promulgated guidelines for avoiding the worst abuses.  R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).  Wang and colleagues provide some important insights for how subgroup analyses can lead to increased rates of Type I errors, and they provide guidelines for authors on appropriate descriptions of subgroup analyses:

“However, subgroup analyses also introduce analytic challenges and can lead to overstated and misleading results.”

Id. at 2189a.

“When multiple subgroup analyses are performed, the probability of a false positive finding can be substantial.”

Id. at 2190a.

“There are several methods for addressing multiplicity that are based on the use of more stringent criteria for statistical significance than the customary P < 0.05.”

Id. at 2190b.

“A pre-specified subgroup analysis is one that is planned and documented before any examination of the data, preferably in the study protocol.”

Id. at 2190b.

“Post hoc analyses refer to those in which the hypotheses being tested are not specified before any examination of the data. Such analyses are of particular concern because it is often unclear how many were undertaken and whether some were motivated by inspection of the data. However, both pre-specified and post hoc subgroup analyses are subject to inflated false positive rates arising from multiple testing. Investigators should avoid the tendency to pre-specify many subgroup analyses in the mistaken belief that these analyses are free of the multiplicity problem.”

Id. at 2190b.

“When properly planned, reported, and interpreted, subgroup analyses can provide valuable information.”

Id. at 2193b.

Although Wang and colleagues take their primary aim at the abuse of subgroup analyses in randomized clinical trials, they make clear that the abuse is equally present in observational studies:

“In other settings, including observational studies, we encourage complete and thorough reporting of the subgroup analyses in the spirit of the guidelines listed.”

Id. at 2193b.

Wang and colleagues provide some very specific guidelines for reporting subgroup analyses.  These guidelines are a helpful source for helping courts make sober assessments of results from subgroup analyses.

Recently, another guideline initiative, STROBE, in the field of observational epidemiology provided similar guidance to authors and journals for reporting subgroup analyses:

“[M]any debate the use and value of analyses restricted to subgroups of the study population. Subgroup analyses are nevertheless often done. Readers need to know which subgroup analyses were planned in advance, and which arose while analyzing the data. Also, it is important to explain what methods were used to examine whether effects or associations differed across groups … .”

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805, 817 (2007).

“There is debate about the dangers associated with subgroup analyses, and multiplicity of analyses in general.  In our opinion, there is too great a tendency to look for evidence of subgroup-specific associations, or effect-measure modification, when overall results appear to suggest little or no effect. On the other hand, there is value in exploring whether an overall association appears consistent across several,

preferably pre-specified subgroups especially when a study is large enough to have sufficient data in each subgroup. A second area of debate is about interesting subgroups that arose during the data analysis. They might be important findings, but might also arise by chance. Some argue that it is neither possible nor necessary to inform the reader about all subgroup analyses done as future analyses of other data will tell to what extent the early exciting findings stand the test of time. We advise authors to report which analyses were planned, and which were not   … . This will allow readers to judge the implications of multiplicity, taking into account the study’s position on the continuum from discovery to verification or refutation.”

Id. at 826-27.

Bibliography

E. Akl, M. Briel, J.J. You, et al., “LOST to follow-up Information in Trials (LOST-IT): a protocol on the potential impact,” 10 Trials 40 (2009).

Susan Assmann, Stuart Pocock, Laura Enos, Linda Kasten, “Subgroup analysis and other (mis)uses of baseline data in clinical trials,” Lancet 2000; 355: 1064–69.

M. Bhandari, P.J. Devereaux, P. Li, et al., “Misuse of baseline comparison tests and subgroup analyses in surgical trials,” 447 Clin. Orthoped. Relat. Res. 247 (2006).

S. T. Brookes, E. Whitely, M. Egger, et al., “Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test,” 57 J. Clin. Epid. 229 (2004).

A-W Chan, A. Hrobjartsson, K.J. Jorgensen, et al., “Discrepancies in sample size calculations and data analyses reported in randomised trials: comparison of publications with protocols,” 337 Brit. Med. J. a2299 (2008).

L. Cui, H.M. Hung, S.J. Wang, et al., “Issues related to subgroup analysis in clinical trials,” 12 J. Biopharm. Stat. 347 (2002).

Matthias Egger & George Davey Smith, “Principles of and procedures for systematic reviews,” chap. 2, in M. Egger, G. Davey Smith, D. Altman, eds., Systematic Reviews in Health Care:  Meta-Analysis in Context (2d ed. 2001).

J. Fletcher, “Subgroup analyses: how to avoid being misled,” 335 Brit. Med. J. 96 (2007).

Nick Freemantle,”Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic?” 322 Brit. Med. J. 989 (2001).

G. Guyatt, P.C. Wyer, J. Ioannidis, “When to Believe a Subgroup Analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 571-83 (2008).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Inconsistent trial assessments by the National Institute for Health and Clinical Excellence and IQWiG: standards for the performance and interpretation of subgroup analyses are needed,” 63 J. Clin. Epidem. 1298 (2010).

J. Hasford, P. Bramlage, G. Koch, W. Lehmacher, K. Einhäupl, and P.M. Rothwell, “Standards for subgroup analyses are needed? We couldn’t agree more,”  64 J. Clin. Epidem. 451 (2011).

R. Hatala, S. Keitz, P. Wyer, et al., “Tips for learners of evidence-based medicine: 4. Assessing heterogeneity of primary studies in systematic reviews and whether to combine their results,” 172 Can. Med. Ass’n J. 661 (2005).

A.V. Hernandez, E.W. Steyerberg, G.S. Taylor, et al., “Subgroup analysis and covariate adjustment in randomized clinical trials of traumatic brain injury: a systematic review,” 57 Neurosurgery 1244 (2005).

A.V. Hernandez, E. Boersma, G.D. Murray, et al., “Subgroup analyses in therapeutic cardiovascular clinical trials: are most of them misleading?” 151 Am. Heart J. 257 (2006).

K. Hirji & M. Fagerland, “Outcome based subgroup analysis: a neglected concern,” 10 Trials 33 (2009).

Stephen W. Lagakos, “The Challenge of Subgroup Analyses — Reporting without Distorting,” 354 New Engl. J. Med. 1667 (2006).

C.M. Martin, G. Guyatt, V. M. Montori, “The sirens are singing: the perils of trusting trials stopped early and subgroup analyses,” 33 Crit. Care Med. 1870 (2005).

D. Moher, K. Schulz, D. Altman, et al.,“The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials,” 357 Lancet 1191 (2001).

V.M. Montori, R. Jaeschke, H.J. Schunemann, et al., “Users’ guide to detecting misleading claims in clinical research reports,” 329 Brit. Med. J. 1093 (2004).

A.D. Oxman & G.H. Guyatt, “A consumer’s guide to subgroup analyses,” 116 Ann. Intern. Med. 78 (1992).

A. Oxman, G. Guyatt, L. Green, et al., “When to believe a subgroup analysis,” in G. Guyatt, et al., eds., User’s Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice 553-65 (2008).

S. Pocock, M. D. Hughes, R.J. Lee, “Statistical problems in the reporting of clinical trials:  A survey of three medical journals,” 317 New Engl. J. Med. 426 (1987).

S. Pocock, S. Assmann, L. Enos, et al., “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems,” 21 Statistics in Medicine 2917 (2002).

Peter Rothwell, “Subgroup analysis in randomised controlled trials:  importance, indications, and interpretation,” 365 Lancet 176 (2005).

Kenneth Schulz & David Grimes, “Multiplicity in randomised trials II: subgroup and interim analyses,” 365 Lancet 1657 (2005).

Sleight, “Subgroup analyses in clinical trials – – fun to look at, but don’t believe them,” 1 Curr. Control Trials Cardiovasc. Med. 25 (2000).

Reuel Stallones, “The Use and Abuse of Subgroup Analysis in Epidemiological Research,” 16 Prev. Med. 183 (1987).

Stewart & Parmar, “Bias in the Analysis and Reporting of Randomized Controlled Trials,” 12 Internat’l J. Tech. Assessment in Health Care 264, 271 (1996).

Xin Sun, Matthias Briel, Jason Busse, Elie A. Akl, John J .You, Filip Mejza, Malgorzata Bala, Natalia Diaz-Granados, Dirk Bassler, Dominik Mertz, Sadeesh K Srinathan, Per Olav Vandvik, German Malaga, Mohamed Alshurafa, Philipp Dahm, Pablo Alonso-Coello, Diane M Heels-Ansdell, Neera Bhatnagar, Bradley C. Johnston, Li Wang, Stephen D. Walter, Douglas G. Altman, and Gordon Guyatt, “Subgroup Analysis of Trials Is Rarely Easy (SATIRE): a study protocol for a systematic review to characterize the analysis, reporting, and claim of subgroup effects in randomized trials,” 10 Trials 1010 (2009).

A. Trevor & G. Sheldon, “Criteria for the Implementation of Research Evidence in Policy and Practice,” in A. Haines, ed., Getting Research Findings Into Practice 11 (2d ed. 2008).

Jan P. Vandenbroucke, Erik von Elm, Douglas G. Altman, Peter C. Gøtzsche, Cynthia D. Mulrow, Stuart J. Pocock, Charles Poole, James J. Schlesselman, and Matthias Egger, for the STROBE Initiative, “Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):  Explanation and Elaboration,” 18 Epidemiology 805–835 (2007).

Erik von Elm & Matthias Egger, “The scandal of poor epidemiological research Reporting guidelines are needed for observational epidemiology,” 329 Brit. Med. J. 868 (2004).

R. Wang, S. Lagakos, J. H. Ware, et al., “Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials,” 357 New Engl. J. Med. 2189 (2007).

S. Yusuf, J. Wittes, J. Probstfield, et al., “Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials,” 266 J. Am. Med. Ass’n 93 (1991).

Print Friendly, PDF & Email

Comments are closed.