Playing Dumb on Statistical Significance

For the last decade, at least, researchers have written to document, explain, and correct, a high rate of false-positive research findings in biomedical research[1]. And yet, there are some authors who complain that the traditional standard of statistical significance is too stringent. The best explanation for this paradox appears to lie in these authors’ rhetorical strategy of protecting their “scientific conclusions,” based upon weak and uncertain research findings, from criticisms. The strategy includes mischaracterizing significance probability as a burden of proof, and then speciously claiming that the standard for significance in the significance probability is too high as a threshold for posterior probabilities of scientific claims. SeeRhetorical Strategy in Characterizing Scientific Burdens of Proof” (Nov. 15, 2014).

Naomi Oreskes is a professor of the history of science in Harvard University. Her writings on the history of geology are well respected; her writings on climate change tend to be more adversarial, rhetorical, and ad hominem. See, e.g., Naomi Oreskes, Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming (N.Y. 2010). Oreskes’ abuse of the meaning of significance probability for her own rhetorical ends is on display in today’s New York Times. Naomi Oreskes, “Playing Dumb on Climate Change,” N.Y. Times Sunday Rev. at 2 (Jan. 4, 2015).

Oreskes wants her readers to believe that those who are resisting her conclusions about climate change are hiding behind an unreasonably high burden of proof, which follows from the conventional standard of significance in significance probability. In presenting her argument, Oreskes consistently misrepresents the meaning of statistical significance and confidence intervals to be about the overall burden of proof for a scientific claim:

“Typically, scientists apply a 95 percent confidence limit, meaning that they will accept a causal claim only if they can show that the odds of the relationship’s occurring by chance are no more than one in 20. But it also means that if there’s more than even a scant 5 percent possibility that an event occurred by chance, scientists will reject the causal claim. It’s like not gambling in Las Vegas even though you had a nearly 95 percent chance of winning.”

Although the confidence interval is related to the pre-specified Type I error rate, alpha, and so a conventional alpha of 5% does lead to a coefficient of confidence of 95%, Oreskes has misstated the confidence interval to be a burden of proof consisting of a 95% posterior probability. The “relationship” is either true or not; the p-value or confidence interval provides a probability for the sample statistic, or one more extreme, on the assumption that the null hypothesis is correct. The 95% probability of confidence intervals derives from the long-term frequency that 95% of all confidence intervals, based upon samples of the same size, will contain the true parameter of interest.

Oreskes is an historian, but her history of statistical significance appears equally ill considered. Here is how she describes the “severe” standard of the 95% confidence interval:

“Where does this severe standard come from? The 95 percent confidence level is generally credited to the British statistician R. A. Fisher, who was interested in the problem of how to be sure an observed effect of an experiment was not just the result of chance. While there have been enormous arguments among statisticians about what a 95 percent confidence level really means, working scientists routinely use it.”

First, Oreskes, the historian, gets the history wrong. The confidence interval is due to Jerzy Neyman, not to Sir Ronald A. Fisher. Jerzy Neyman, “Outline of a theory of statistical estimation based on the classical theory of probability,” 236 Philos. Trans. Royal Soc’y Lond. Ser. A 333 (1937). Second, although statisticians have debated the meaning of the confidence interval, they have not wandered from its essential use as an estimation of the parameter (based upon the use of an unbiased, consistent sample statistic) and a measure of random error (not systematic error) about the sample statistic. Oreskes provides a fallacious history, with a false and misleading statistics tutorial.

Oreskes, however, goes on to misidentify the 95% coefficient of confidence with the legal standard known as “beyond a reasonable doubt”:

“But the 95 percent level has no actual basis in nature. It is a convention, a value judgment. The value it reflects is one that says that the worst mistake a scientist can make is to think an effect is real when it is not. This is the familiar “Type 1 error.” You can think of it as being gullible, fooling yourself, or having undue faith in your own ideas. To avoid it, scientists place the burden of proof on the person making an affirmative claim. But this means that science is prone to ‘Type 2 errors’: being too conservative and missing causes and effects that are really there.

Is a Type 1 error worse than a Type 2? It depends on your point of view, and on the risks inherent in getting the answer wrong. The fear of the Type 1 error asks us to play dumb; in effect, to start from scratch and act as if we know nothing. That makes sense when we really don’t know what’s going on, as in the early stages of a scientific investigation. It also makes sense in a court of law, where we presume innocence to protect ourselves from government tyranny and overzealous prosecutors — but there are no doubt prosecutors who would argue for a lower standard to protect society from crime.

When applied to evaluating environmental hazards, the fear of gullibility can lead us to understate threats. It places the burden of proof on the victim rather than, for example, on the manufacturer of a harmful product. The consequence is that we may fail to protect people who are really getting hurt.”

The truth of climate change opinions do not turn on sampling error, but rather on the desire to draw an inference from messy, incomplete, non-random, and inaccurate measurements, fed into models of uncertain validity. Oreskes suggests that significance probability is keeping us from acknowledging a scientific fact, but the climate change data sets are amply large to rule out sampling error if that were a problem. And Oreskes’ suggestion that somehow statistical significance is placing a burden upon the “victim,” is simply assuming what she hopes to prove; namely, that there is a victim (and a perpetrator).

Oreskes’ solution seems to have a Bayesian ring to it. She urges that we should start with our a priori beliefs, intuitions, and pre-existing studies, and allow them to lower our threshold for significance probability:

“And what if we aren’t dumb? What if we have evidence to support a cause-and-effect relationship? Let’s say you know how a particular chemical is harmful; for example, that it has been shown to interfere with cell function in laboratory mice. Then it might be reasonable to accept a lower statistical threshold when examining effects in people, because you already have reason to believe that the observed effect is not just chance.

This is what the United States government argued in the case of secondhand smoke. Since bystanders inhaled the same chemicals as smokers, and those chemicals were known to be carcinogenic, it stood to reason that secondhand smoke would be carcinogenic, too. That is why the Environmental Protection Agency accepted a (slightly) lower burden of proof: 90 percent instead of 95 percent.”

Oreskes’ rhetoric misstates key aspects of scientific method. The demonstration of causality in mice, or only some perturbation of cell function in non-human animals, does not warrant lowering our standard for studies in human beings. Mice and rats are, for many purposes, poor predictors of human health effects. All medications developed for human use are tested in animals first, for safety and efficacy. A large majority of such medications, efficacious in rodents, fail to satisfy the conventional standards of significance probability in randomized clinical trials. And that standard is not lowered because the drug sponsor had previously demonstrated efficacy in mice, or some other furry rodent.

The EPA meta-analysis of passive smoking and lung cancer is a good example of how not to conduct science. The protocol for the EPA meta-analysis called for a 95% confidence interval, but the agency scientists manipulated their results by altering the pre-specified coefficient confidence in their final report. Perhaps even more disgraceful was the selectivity of included studies for the meta-analysis, which biased the agency’s result in a way not reflected in p-values or confidence intervals. SeeEPA Cherry Picking (WOE) – EPA 1992 Meta-Analysis of ETA & Lung Cancer – Part 1” (Dec. 2, 2012); “EPA Post Hoc Statistical Tests – One Tail vs Two” (Dec. 2, 2012).

Of course, the scientists preparing for and conducting a meta-analysis on environmental tobacco smoke began with a well-justified belief that active smoking causes lung cancer. Passive smoking, however, involves very different exposure levels and raises serious issues of the human body’s defensive mechanisms to protect against low-level exposure. Insisting on a reasonable quality meta-analysis for passive smoking and lung cancer was not a matter of “playing dumb”; it was a recognition of our actual ignorance and uncertainty about the claim being made for low-exposure effects. The shifty confidence intervals and slippery methodology exemplifies how agency scientists assume their probandum to be true, and then manipulate or adjust their methods to provide the result they had assumed all along.

Oreskes then analogizes not playing dumb on environmental tobacco smoke to not playing dumb on climate change:

“In the case of climate change, we are not dumb at all. We know that carbon dioxide is a greenhouse gas, we know that its concentration in the atmosphere has increased by about 40 percent since the industrial revolution, and we know the mechanism by which it warms the planet.

WHY don’t scientists pick the standard that is appropriate to the case at hand, instead of adhering to an absolutist one? The answer can be found in a surprising place: the history of science in relation to religion. The 95 percent confidence limit reflects a long tradition in the history of science that valorizes skepticism as an antidote to religious faith.”

I will leave substance of the climate change issue to others, but Oreskes’ methodological misidentification of the 95% coefficient of confidence with burden of proof is wrong. Regardless of motive, the error obscures the real debate, which is about data quality. More disturbing is that Oreskes’ error confuses significance and posterior probabilities, and distorts the meaning of burden of proof. To be sure, the article by Oreskes is labeled opinion, and Oreskes is entitled to her opinions about climate change and whatever.  To the extent that her opinions, however, are based upon obvious factual errors about statistical methodology, they are entitled to no weight at all.


 

[1] See, e.g., John P. A. Ioannidis, “How to Make More Published Research True,” 11 PLoS Medicine e1001747 (2014); John P. A. Ioannidis, “Why Most Published Research Findings Are False” 2 PLoS Medicine e124 (2005); John P. A. Ioannidis, Anna-Bettina Haidich, and Joseph Lau, “Any casualties in the clash of randomised and observational evidence?” 322 Brit. Med. J. 879 (2001).