Science's most important (and controversial) number has its origins in a British experiment involving milk and tea.
What is a P-Value?
Published March 7, 2018
Onscreen: What if a woman claimed to tell the difference between milk poured into tea vs. tea poured into milk? How could you test her claim?
British scientist Ronald A. Fisher designed an experiment using statistics and probability to test her claim. He proposed that a reasonable test of her ability would be 8 cups: four with milk into tea and four with tea into milk, each presented randomly. The lady then would separate them back into the two groups.
That produced 70 possible combinations of the cups, but only one with them separated correctly. If she got it right, that wouldn’t prove she had a special ability. But Fisher could conclude if she was just guessing it was an extremely unlikely result.
A probability of just 1.4%
Thanks mainly to Ronald A. Fisher, that idea became enshrined in experimental science as the p-value. P for probability. If you assume your results were just chance, what’s the probability of those results or something even more rare?
Rebecca Goldin: If you assume that there’s a process that is completely random, and you find that it’s pretty unlikely to get your data, then you might be suspicious that something was happening. You might conclude that it’s not a random process. That this is interesting to look at what else might be going on and it passes some kind of sniff test.
Fisher also suggested a benchmark. Only experimental results with a p-value under .05, a probability of less than 5% were worth a second look. If you assume your results were just due to chance, you’d see them less than 1 time out of 20. Not very likely. He called those results statistically significant.
Jordan Ellenberg: Statistically significant. Now, this is a terrible word. It could be quite insignificant. You could be detecting a very, very, very small effect. But it would be called, in the mathematical lingo, significant.
Since Ronald A. Fisher’s day, p-values have been used a convenient yardstick for success by many, including most scientific journals. Since they prefer to publish successes, and getting published is critical to career advancement, the temptation to massage and manipulate experimental data into a good p-value is enormous.
There’s even a name for it: P-hacking.
Regina Nuzzo: P-hacking is when researchers consciously or unconsciously guide their data analysis to get the result that they want. And since .05 is the bar for being able to publish, and call something real, get all your grant money, it’s usually guiding the results so that you arrive at that P of .05.
How much p-hacking really goes on is hard to know. What may be more important is to remember what was originally intended by a p-value.
Ellenberg: The p-value was always meant to be a detective, not a judge. If you do an experiment and find a result that is statistically significant, that is telling you, that is an interesting place to look and research and understand further what’s going on. Not, don’t study this further because the matter is settled.
In a sense, a low p-value is an invitation to reproduce the experiment, to help validate the result. But that doesn’t always happen. In fact, there are few career incentives for it. Journals and funders prefer novel research. There is no Nobel Prize for replication.
- Digital Producer
- Ana Aceves
- PREDICTION BY THE NUMBERS
- Written, Produced, and Directed by
- Daniel McCabe
- Cara Feinberg
- © WGBH Educational Foundation 2018
Science Photo Library