August 23, 2017

Paper: Econ Department Chair Calls for “Revolution” in Statistical Significance

Professor John List, Chairman of the Economics Department.

Professor John List, Chairman of the Economics Department.

University of Chicago / The Chicago Maroon

Economics department chair John List is among a group of prominent academics who coauthored a recent paper proposing that the default p-value threshold for statistical significance be changed from 0.05 to 0.005.

“Clearly, what we’re going after here is a revolution,” List told The Maroon

The paper, which was pre-published on July 22 and will appear in an upcoming edition of the science journal Nature Human Behavior, addresses a mounting concern among scientists and academics about an apparent lack of reproducibility with clinical studies in disciplines across the natural and social sciences.

“We believe that a leading cause of non-reproducibility has not yet been adequately addressed: Statistical standards of evidence for claiming new discoveries in many fields of science are simply too low,” the paper reads.

List says there is a “reproducibility crisis.”

“I think if we don’t begin to produce results that are more replicable, we might lose our force in policymaking and in the evidence-based world that we have right now,” he said. 

The paper urges researchers to adopt a stricter threshold for null hypothesis testing, which measures the probability that data would be observed if the contrary hypothesis is assumed to be true. It recommends that results that do not meet the proposed 0.005 threshold should be considered “suggestive” rather than “significant.” 

Acknowledging the complexity of the reproducibility problem and that the p-value threshold is only one of many factors responsible for low rates of reproducibility, the authors maintain that setting a higher bar for evidence is one practical response. 

“Changing the P-value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance,” the paper reads.

University of Chicago professor and historian of statistics Stephen Stigler sees it differently. 

Although he said the authors’ proposal seems well-intentioned, he thinks there could be potentially detrimental consequences if it takes effect.

“If seriously widely adopted it could have a number of unfortunate side effects: reducing the publication of exploratory science, and simply replacing one bad idea (mindless use of 0.05) by another whose practical effects are less well understood (mindless use of 0.005),” Stigler told The Maroon.

Richard Berk, co-author of the paper and professor of criminology and statistics at Johns Hopkins University, addressed Stigler’s criticism in a conversation with The Maroon

“He’s absolutely right that the original decision about the 0.05 as a recommendation was essentially arbitrary and convenient. And [he’s] right, it has been used mindlessly, that’s what I’m saying—the social sciences have a history of not thinking through their methods,” Berk said. 

But he said the particular choice of the 0.005 threshold was meant to indicate a broader shift in standards of evidence, rather than create a new uncontestable numerical threshold. “It could be .007 or .003; there’s nothing magic about the 5, but the idea is basically to allow people to think about it in contrast to the 0.05 level, so it makes the leap more clear,” he said.

In 1945, Ronald Fisher, often referred to as the father of statistics, chose the p-value threshold of 0.05 as a shorthand measure of whether a dataset provides reasonable evidence to advance a scientific claim. While Fisher used the arbitrary figure as a rough standard, it has since been taken up in fields across the sciences as indicating highly probable evidence, rather than a suggestive finding.

Responding to the point that the proposed shift only makes an arbitrary threshold stricter, List argued that in an ideal world, scientists wouldn’t treat a p-value as the gold standard, but would use more dynamic methods to evaluate experimental certainty. He cited as an example Bayesian inference, a method for updating probability at each stage of the scientific process.

“If we could turn back the clock we would have of course never wanted that [standard]; we would have wanted a more continuous measure of how strong evidence is, rather than to say if it’s 0.05 or lower it’s a good result and if not it’s a bad result. Given that we’re in that world… we believe that one way to make that world better is to have a p-value threshold that is much smaller,” List said.

Berk emphasized the effects of rapidly evolving fields of computer science, statistics and econometrics on scientists who, he says, don’t always keep up with the literature. “This is not going to solve crappy research,” he said. “The message is, get up to speed. … You can’t assume that what you learned in graduate school is necessarily the appropriate way to proceed now.”

Berk and List are two among many of the paper’s authors who have long advocated empirical methods and field research over theoretical reasoning and conventional models.

“Researchers in the social sciences use models of the world that they assume are correct and they proceed with all of their analyses as if the model is correct, when they know in their hearts that it’s wrong. And not even in their hearts - after a couple of beers they’ll tell you it’s wrong,” Berk said. “The current practice is to proceed as if the models are right, with a little bit of cleaning up around the edges.”

Beyond the charge of arbitrariness, critics of the 0.005 recommendation have suggested that since the new standard would require increasing study sample sizes by around 70%, according to the authors, it could drive up the cost of drug trials.

Co-author Valen E. Johnson, department head of statistics at Texas A&M University, told The Maroon that, on the contrary, the measure would improve the cost effectiveness of drug trials overall. 

Pharmaceutical drugs move from smaller Phase I and II trials to multi-year, several thousand–patient Phase III testing, after which drugs are approved by the FDA, if successful. Rates of approval have fallen in recent years, a trend frequently attributed to lax early-stage testing. 

Johnson argued that because the standard in Phase II trials is often 0.05, drug companies mistakenly believe many drugs are promising, and advance ineffective drugs to expensive late-stage testing. 

“Initiating a phase III trial of a drug that’s not effective is prohibitively expensive. And the reason that so many phase III trials are failing is because the evidence collected in the phase II trials wasn’t really adequate to justify moving to phase III,” Johnson said.

In a Center for Open Science blog post advocating for the shift, six of the paper’s authors emphasize that the benefit outweighs the cost implicit in larger sample sizes and more rigorous testing. “False positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises.”