Rethinking Science’s Magic Number

ByTiffany DillWednesday, February 28, 2018 NOVA NextNOVA Next

Behind nearly every piece of science news you read is a magic number: the p-value.

When a new scientific finding arrives in your newsfeed, it is often too early to recognize whether it’s undercooked. Consider “power posing,” the focus of a Harvard Business School study . By standing like a superhero one minute a day, its authors found that it boosted testosterone, reduced stress hormones, and generally made people feel empowered.

Not long after, a “power pose” TED talk went viral, leaving a trail of anecdotal success stories and garnering millions of views. Business websites were quick to promote the “power pose” life hack—it was backed by science, after all.

And then it all came crumbling down. A nearly identical experiment, using more thorough methods and four times as many study participants, did not find evidence that power posing did any of those things. An offshoot study further failed to reproduce the magic of the “power pose.”

It’s easy to feel hoodwinked, but at the time of its release, author Amy Cuddy genuinely believed her research findings. In fact, her statistics seemed to back her up, and peer reviewers had given her the thumbs up. And she is not alone—for decades, scientists and the media alike have consumed claims from single-publications as fact. As meticulous as scientists are, inaccurate, false, or even misleading science claims are surprisingly common. A 2015 experiment reported over one-third of psychology experiments are faulty, while a similar test for cancer research indicates similar trends, though that one is still in progress. The problem isn’t limited to those two fields, either.

Magic Number

Behind nearly every piece of science news you read is a magic number: the p-value. To generate a p-value, scientists run statistical tests comparing data sets. Their starting assumption is that there’s no difference between the things they’re comparing—new drug vs. sugar pill, power pose vs. nothing—and when they find a result that deviates significantly enough from “no difference,” they report it as a positive finding.

“If you assume that nothing is going on…and get a surprising set of data, it might make you wonder whether there is something going on.” said Daniel Lakens, an assistant professor of applied cognitive psychology at Einthoven University of Technology in the Netherlands.

You can think of the p-value as a “warmer-colder” game for scientific knowledge: a low p-value shouts “warmer,” while a high p-value reads “colder, not much to see here.” It’s a helpful tool, but it’s not perfect. And the threshold that most scientists use is less perfect than some would like.

The p-value was originally devised by French scholar Pierre-Simon Laplace, a contemporary of Napoleon Bonaparte. Sir Ronald A. Fisher later popularized it after a famous experiment involving a lady who could taste differences in virtually identical cups of tea. Fisher proposed a p-value threshold of 0.05, and his ideas laid the groundwork for modern experimentation as it exists today.

Over time, p = 0.05 has morphed into a Caesar’s thumb for research.

“For decades, the conventional p-value threshold has been 0.05,” says Dr. Paul Wakim, chief of the biostatistics and clinical epidemiology service at the National Institutes of Health Clinical Center, “but it is extremely important to understand that this 0.05, there’s nothing rigorous about it. It wasn’t derived from statisticians who got together, calculated the best threshold, and then found that it is 0.05. No, it’s Ronald Fisher, who basically said, ‘Let’s use 0.05,’ and he admitted that it was arbitrary.”

But over time, p = 0.05 has morphed into a Caesar’s thumb for research, making it hard to publish anything that doesn’t fall under that threshold. What’s worse, the pressure to report a passing p-value leads some scientists to resort to misconduct .

One form of misconduct, called “p-hacking,” occurs when scientists selectively embellish a few, randomly extreme data points, and conceal many insignificant ones. Their work passes the p-value threshold and makes the effects of their experiment seem more convincing than what occurs in reality.

The NIH Clinical Center has reined in p-hacking in federally funded projects through improved the surveillance. By registering on websites like www.clinicaltrials.gov , researchers are required to outline their experiments upfront. Scientists must then report their results in that same registry. Since the information is tracked, non-significant results get regularly communicated, making it more difficult to p-hack.

Most fields don’t have a registry and instead rely heavily on publications for peer-review of research, which has its shortcomings as well. Peer-review can work well, but it only analyzes experiments after p-values have been produced, for example.

Support Provided By

Learn More

A Warning

Misconduct or not, many statisticians and scientists are concerned about the improper use of p-values. The American Statistical Association recently issued a stern warning against misuse of p-values, a first in its 179-year existence. “The p-value was never intended to be a substitute for scientific reasoning,” its director wrote.

That said, there may be one easy fix—lowering the p-value for significance. When examining the p-values of psychology claims that successfully passed the reproducibility test , Open Science Collaboration authors noted that studies with the lowest p-values were more likely to pass, suggesting that low p-values could help weed out intentionally and unintentionally spurious claims.

“The p-value was never intended to be a substitute for scientific reasoning.”

Some research fields have voluntarily opted for more stringent p-values, because p = 0.05 was not strict enough. Particle physics, for example, has replaced the conventional p = 0.05 with “five sigma” or, p = 0.0000003, allowing them to announce discoveries, like the Higgs boson , with extraordinary confidence. Genetics applies similar thresholds to help scientists focus on the most relevant genes among the thousands residing in our genomes.

Another branch of statistics also supports lowering the p-value threshold. Valen Johnson of Texas A&M University analyzed scientific claims made with p-values of 0.05 using what’s known as a Bayesian approach. Most results were inconclusive. It wasn’t until the p-value neared 0.005 that the research claims satisfied both varieties of stats—and in Johnson’s eyes, were scientifically believable.

Last summer, a similar, albeit less-technical publication emerged in the renowned journal Nature , where 72 prominent scientists reiterated Johnson’s sweeping proposal. Unsurprisingly, Johnson was among the lead authors.

Think of the p-value as a filter: with a high p-value, the holes in the filter are wider, and some plausible-sounding data that was actually just randomness will make it through the peer-review process. However, if you reduce the p-value to 0.005, the filter becomes much more stringent.

That sounds great, but Dr. Wakim argues that the proposed solution creates a new problem for clinical research: filtering out real discoveries that could translate into treatments—and potentially lives saved. He’d rather see a false-positive discovery be neatly corrected later, rather than miss it altogether.

Alternative Approaches

The breadth of the p = 0.005 proposal stepped on many researchers’ toes, and it wasn’t long before the internet blew up in response. Lakens seized the wave of attention to organize a Google Doc , hosting a massive (and surprisingly organized) open-dialogue on the subject. Scientists with diverse backgrounds and experience levels contemplated how a standard of p = 0.005 would be implemented.

Ultimately they did not agree with the 72-author publication—there was too much uncertainty on how potential benefits could justify costs.

If a specific experiment is repeated, and the outcomes are the same as before, we can feel more confident in the results.

Instead, the Google Doc’s authors entertained less rigid alternatives. One proposal was that each field or community might set its own threshold of significance. This is because the costs and benefits of changing a p-value vary depending on the discipline.

Lakens likens the conflict to a debate over speed limits. “It’s like saying we have a lot of accidents when we drive, and we could fix it if we set the driving speed to 20 miles per hour because then we’re never going to have lethal accidents,” he says. “But then of course you don’t get anywhere as quickly anymore. And maybe you cannot make some trips because they are too far. It’s sort-of the same for science, so you can make it very slow, and careful…but whether it’s actually optimal in the complete sense of weighing all the costs and benefits, it’s not very clear.”

Lakens would rather see more studies that attempt to reproduce previous experiments. After all, if a specific experiment is repeated, and the outcomes are the same as before, we can feel more confident in the results. And with scientists knowing their work could be challenged with replications, they might feel less inclined to p-hack.

But repeating an experiment is currently less dazzling than publishing a new finding, and it’s expensive. A few grants exist to address the latter issue, but the former is more intractable. New results are almost always better received than confirmations of what we already thought we knew.

Dr. Wakim would like to see scientific journals accept results no matter the p-value. “Scientific discoveries take a long time, and many of them happen by elimination of unlikely explanations,” he says. Negative results, while not as exciting, have real value.

Similar to Thomas Edison’s discovery of 999 ways not to make a lightbulb, negative results inform future research efforts, making science more efficient. What’s more, if scientists are transparent about their p-values, organizations like the Cochrane Collaboration readily use this data for meta-analyses. Detecting a cohort of similar claims, all with p-values slightly above 0.05, results in a form of significance all its own that stand-alone studies could not necessarily substantiate.

“People say, ‘Ugh, it’s above 0.05, I wasted my time.’ No, you didn’t waste your time.” says Dr. Wakim. “If the research question is important, the result is important. Whatever it is.”

Photo credit: Dennis van Zuijlekom / Flickr (CC BY-SA 2.0)

Tiffany DillPosts By This Contributor

Related

One Man's Battle With Anorexia

Accidental Discoveries

Why is Whooping Cough Back?

Magic Number

Support Provided By

A Warning

Alternative Approaches