Since 2005, when Stanford University professor John Ioannidis published his paper “Why Most Published Findings Are False” in PLOS Medicine, reports have been mounting of studies that are false, misleading, and/or irreproducible. Two major pharmaceutical companies each took a sample of “landmark” cancer biology papers and only were able to validate the findings of 6% and 11%, respectively. A similar attempt to validate 70 potential drugs targets for treating amytrophic lateral sclerosis in mice came up with zero positive results. In psychology, an effort to replicate 100 peer-reviewed studies successfully reproduced the results for only 39. While most replication efforts have focused on biomedicine, health, and psychology, a recent survey of over 1,500 scientists from various fields suggests that the problem is widespread.
What originally began as a rumor among scientists has become a heated debate garnering national attention. The assertion that many published scientific studies cannot be reproduced has been covered in nearly every major newspaper, featured in TED talks, and discussed on televised late night talk shows.
Interpretations of the issue seem to fall into two categories:
- This is how science works. Science is inherently uncertain, and contradictions happen all the time. The problem is that we do not know how to manage our expectations of science. The solution is to distinguish uncertain science from science that has been established beyond a reasonable doubt.
- This is not how science works. Conflicting studies expose flawed or malfunctioning science. The solution is for science to change its practices.
The evidence around reproducibility suggests that both are true: science is inherently uncertain, and it needs to change its practices.
Science at Work
If science is functioning properly, why might the same experiment yield one result one day and a different result another?
To understand this in its most basic sense, it’s helpful to imagine conducting a simple experiment to test theories of gravity. For centuries, Aristotle’s belief prevailed, in which objects were thought to fall at a speed proportional to their mass. If you drop a rock and a feather at the same time, the heavier rock falls faster than the lighter feather. Does this prove Aristotle’s theory?
Now imagine Galileo, who was skeptical of Aristotle’s theory, dropping a cannonball and a musketball at the same time. It’s a different way of testing the same theory. Although their respective weights are very different, the two balls would hit the ground at the same time. This demonstration effectively proves Aristotle’s theory wrong (though there is no evidence that Galileo himself did it).
The moral of this story is not that Aristotle was wrong through-and-through. His observation is still true—indeed, a feather will always fall more slowly than a rock (in Earth’s atmosphere). Only his conclusion was wrong.
This thought experiment illustrates how the conclusions drawn by scientists can outrun the available evidence—a process known as induction. Induction is a natural part of the scientific process, and the simple fact that no two experiments can be exactly the same explains why many scientific theories fail.
In the 17th century, Robert Boyle’s air pump was a crucial apparatus for investigating the nature of the vacuum. Another scientist, Christiaan Huygens, built his own air pump—one of the only other air pumps in the world at the time—and produced a phenomenon where water appeared to levitate inside a glass jar inside the air pump. He called it “anomalous suspension” of water. But Boyle could not replicate the effect in his air pump and consequently rejected Huygens’ claims. After months of dispute, Huygens travelled to England to produce the result on Boyle’s own air pump. Once replicated in Boyle’s air pump, the anomalous suspension of water was accepted as a matter of fact. The explanation of why it had occurred and what it meant remained a mystery, but the experiment had been successfully reproduced.
More recently, a similar dispute occurred between Mina Bissell, a breast cancer researcher at the University of California Berkeley, and collaborator Kornelia Polyak of Harvard University. One tiny methodological difference meant the two labs were unable to replicate each other’s profiles produced by fluorescence-activated cell sorting (FACS) of human breast cells. They were eventually able to resolve the issue by performing the experiments literally side-by-side. Bissell and Polyak found that the results depended on the way in which cell samples were agitated—“vigorously stirring” compared to “rocking relatively gently.” Once this methodological difference was identified, the FACS profiles were consistent and the two labs were able to move forward.
The disputes between Aristotle and Galileo, Boyle and Huygens, and Bissel and Polyak each involved some inconsistency between the respective experiments that needed to be “fixed.” If everyone did everything the same way, the experimental phenomena behave reliably. When the results disagree, something is different. The challenge is to find out what.
But not all science works that way.
Variation, Uncertainty, and Judgement Calls
In the early 1970s, Russell Bliss, the owner of a small waste oil business in Frontenac, Missouri, was contracted by Hoffman-Taff, a company that produced Agent Orange in the Vietnam War, to dispose of synthetic herbicide byproducts that contained high concentrations of TCDD, a toxic chemical known as dioxin. Bliss also happened to run a business spraying waste oils to control dust on dirt roads and horse-riding arenas. In a Kafkaesque sequence of events, Bliss mixed the toxic waste with the petroleum-based waste oils he used for his dust-suppressing business and began spraying the contaminated oil on dirt roads, stables, and arenas throughout the region, including the entire road network of the sleepy town of Times Beach, Missouri.
When news of the disaster broke in 1982, regulatory discussions on the disposal and containment of toxic waste ramped up, and people looked to science for guidance. A 1978 study on cancer and dioxin by Richard Kociba at Dow Chemical became central to determining how dangerous these chemicals really are. In the study, rats were dosed with the chemical for two years, then liver slides were analyzed to measure tumor growth. In the original 1978 study, 20 out of the 50 rats that were exposed to a certain dose of the chemical grew liver tumors. In 1980, the EPA re-analyzed the same liver slides. This time, 29 rats were found to have tumors at that dosage. Then again in 1990, the paper industry commissioned yet another analysis which reported only nine rats with tumors. Three different results taken from precisely the same slides. (Moreover, the 1990 analysis was performed by a team of seven pathologists who had to resort to majority-rule decision making when the group couldn’t agree on what to count as a tumor.)
Kociba’s slides illustrate how variation, uncertainty, and judgment can skew results within a given observation for an experiment. The same thing can happen with statistical analyses.
In 2015, Brian Nosek and the Center for Open Science conducted a comparative study with data from a study which asked, do soccer referees give more red cards to dark-skinned players than light-skinned ones? Nosek and his colleagues gave the same dataset to 29 teams of analysts asked their opinion (the analysts all knew that their results would be compared to others’). Just like Kociba’s liver slides, the soccer data yielded different results—from precisely the same data. A few teams reported no difference between light-skinned and dark-skinned, a couple of analysts reported nearly a three-fold increase in red cards for dark-skinned players, and the rest found around a 20–40% increase for dark-skinned players.
The scientific questions explored in these studies are of an entirely different nature than the ones asked by Aristotle about gravity, Boyle about the vacuum, or Bissell about FACS profiles in breast cells. When it comes to dioxin and liver cancer or skin color and red cards, the variables of interest are not easily measured and the outcomes do not behave predictably. Injecting a rat with dioxin does not guarantee that it will develop tumors. Conversely, some rats not injected with dioxin will still develop tumors. Having dark skin does not guarantee that you get red carded, and neither does having light skin always give you a free pass.
In the case of Times Beach, dioxin was ultimately determined to have a relationship with cancer, and in the case of soccer refereeing, skin color does influence the likelihood of getting a red card. For both, the details of how and to what extent are still fuzzy, but incongruous evidence does not always mean it’s meaningless. Inconsistencies between studies like this do not necessarily indicate that something went wrong or that anything needs to be fixed. Rather, they result from noise in the system being studied or in the measurements being taken. When trying to replicate studies that contain lots of noise, “irreproducibility” may be a misnomer. If you add up enough repetitions of the same study (a process known as meta-analysis), they will eventually converge on the truth of the matter.
When It’s Not Science
In some instances, irreproducibility reflects genuine uncertainty. In others, it can indicate sloppiness, fraud, and misconduct.
Richard Horton, editor-in-chief of The Lancet, a premier medical journal, wrote in 2015 that “Much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.” While it’s important to note that Horton’s statement refers only to the medical literature, it does call into question the value of peer review as a barometer for scientific truth.
Peer-reviewed journals have become the cultural gatekeepers for scientific credibility. But this title is on shaky ground as retractions from scientific journals increase. They jumped tenfold between 2001 and 2009, and a 2012 analysis concluded that two-thirds of retractions on PubMed, a database of biomedical articles, were due to misconduct.
These problems appear to be particularly acute in the medical sciences. For example, over 1,000 studies have been invalidated because a breast cancer cell line was found to be actually a skin cancer cell line. In another case, a software flaw in specialized statistical packages potentially invalidated tens of thousands of fMRI studies. When the biotech company Amgen confirmed the results of only six out of 53 “landmark” cancer biology studies, lead author Glenn Begley noted that the “non-reproducible papers shared a number of features, including inappropriate use of key reagents, lack of positive and negative controls, inappropriate use of statistics and failure to repeat experiments. If repeated, data were often heavily selected to present results that the investigators ‘liked.’ ” In these cases, the problem may have less to do with reproducibility than with getting the experiments right in the first place.
One fixable flaw in peer review is the inadequate reporting of methods and analyses performed. Scientists who fail to disclose exactly how they went about data analysis can present their results as statistically significant—and thus publishable—even if they are not. For example, in one study, a team of social psychologists performed a real experiment to test a purposefully outrageous hypothesis: that listening to a children’s song can make people younger—by subtracting years from their age. The research techniques they expose, colloquially called “p-hacking,” allowed them to “prove” that people actually did get younger after listening to a children’s song. Taking the idea one step further, the researchers employed these techniques on computer-generated random data, repeatedly demonstrating the ability to obtain a significant result a staggeringly-frequent 61% of the time using the standard statistical threshold (p < 0.05, for the statistically inclined).
Intentional p-hacking in this way constitutes fraud, but the line between misconduct and unintentional bias grows blurry when scientists have to make decisions throughout the course of their research which impact the result. In the soccer refereeing study, of the 29 teams who analyzed the soccer data, 20 found results that were “statistically significant,” and nine did not. Now imagine that a single researcher performed all 29 analyses and had to choose just one to report. If you were trying to get published, which would you choose? Such choices “might be the single largest contributor to the phenomenon of nonreproducibility, or falsity, of published claims,” write John Ioannidis and his colleagues.
The reproducibility crisis—and a potential solution—is neatly encapsulated in a pop culture phenomenon from the 1970s. In 1976, the Viking 1 spacecraft snapped a photo of a mesa in the Cydonia region on Mars that looked like a human face. A few days later, when NASA unveiled the image, the “face” became an immediate media sensation. A book was even published claiming that a civilization of humanoids had lived on Mars and constructed pyramids.
NASA scientists, of course, dismissed it as an optical illusion, and subsequent pictures taken from different angles and at higher resolution revealed that the “face” does not look like a face at all.
Today, some scientists are seeing faces on Mars in their own data. With tight budgets and a competitive job market, they produce low-resolution evidence that builds their resume, but little else. In instances where someone has gone back to look at the same phenomenon again—perhaps with improved techniques, or different angles, as the later images of Cydonia did—the original results, for the most part, have not held up. But regardless of the outcome, replication efforts give us greater confidence in what science has uncovered—whether it really is a face or just another rock.