Cancer Biology Reproducibility Project Sees Mixed Results

How trustworthy are the findings from scientific studies?

A growing chorus of researchers says there’s a “reproducibility crisis” in science, with too many discoveries published that may be flukes or exaggerations. Now, an ambitious project to test the reproducibility of top studies in cancer research by independent laboratories has published its first five studies in the open-access journal eLife.

“These are the first public replication studies conducted in biomedical science, and that in itself is a huge achievement,” says Elizabeth Iorns, CEO of Science Exchange and one of the project’s leaders.

reproducibility-crisis
Cancer biology is just one of many fields being scrutinized for the reproducibility of its studies.

The Reproducibility Project: Cancer Biology is a collaboration between the non-profit Center for Open Science and the for-profit Science Exchange, which runs a network of laboratories for outsourcing biomedical research. It began in 2013 with the goal of repeating experiments from top-cited cancer papers; all of the work has been planned, executed, and published in the open, in consultation with the studies’ original authors. These papers are the first of many underway and slated to be published in the coming months.

The outcome so far has been mixed, the project leaders say. While some results are similar, none of the studies looks exactly like the original, says Tim Errington, the project’s manager. “They’re all different in some way. They’re all different in different ways.” In some studies, the experimental system didn’t behave the same. In others, the result was slightly different, or it did not hold up under the statistical scrutiny project leaders used to analyze results. All in all, project leaders report, one study failed to reproduce the original finding, two supported key aspects of the original papers, and two were inconclusive because of technical issues.

Errington says the goal is not to single out any individual study as replicable or not. “Our intent with this project is to perform these direct replications so that we can understand collectively how reproducible our research is,” he says.

Indeed, there are no agreed-upon criteria for judging whether a replication is successful. At the project’s end, he says, the team will analyze the replication studies collectively by several different standards—including simply asking scientists what they think. “We’re not going to force an agreement—we’re trying to create a discussion,” he says.

The project has been controversial; some cancer biologists say it’s designed to make them look bad bad at a time when federal research funding is under threat. Others have praised it for tackling a system that rewards shoddy research. If the first papers are any indication, those arguments won’t be easily settled. So far, the studies provide a window into the challenges of redoing complex laboratory studies. They also underscore the need that, if cancer biologists want to improve the reproducibility of their research, they have to agree on a definition of success.

An Epidemic?

A recent survey in Nature of more than 1,500 researchers found that 70% have tried and failed to reproduce others’ experiments, and that half have failed to reproduce their own. But you wouldn’t know it by reading published studies. Academic scientists are under pressure to publish new findings, not replicate old research. There’s little funding earmarked toward repeating studies, and journals favor publishing novel discoveries. Science relies on a gradual accumulation of studies that test hypotheses in new ways. If one lab makes a discovery using cell lines, for instance, the same lab or another lab might investigate the phenomenon in mice. In this way, one study extends and builds on what came before.

For many researchers, that approach—called conceptual replication, which gives supporting evidence for a previous study’s conclusion using another model—is enough. But a growing number of scientists have been advocating for repeating influential studies. Such direct replications, Errington says, “will allow us to understand how reliable each piece of evidence we have is.” Replications could improve the efficiency of future research by winnowing out false hypotheses early and help scientists recreate others’ work in order to build on it.

In the field of cancer research, some of the pressure to improve reproducibility has come from the pharmaceutical industry, where investing in a spurious hypothesis or therapy can threaten profits. In a 2012 commentary in Nature, cancer scientists Glenn Begley and Lee Ellis wrote that they had tried to reproduce 53 high-profile cancer studies while working at the pharmaceutical company Amgen, and succeeded with just six. A year earlier, scientists at Bayer HealthCare announced that they could replicate only 20–25% of 47 cancer studies. But confidentiality rules prevented both teams from sharing data from those attempts, making it difficult for the larger scientific community to assess their results.

‘No Easy Task’

Enter the Reproducibility Project: Cancer Biology. It was launched with a $1.3 million grant from the Laura and John Arnold Foundation to redo key experiments from 50 landmark cancer papers from 2010 to 2012. The work is carried out in the laboratory network of Science Exchange, a Palo Alto-based startup, and the results tracked and made available through a data-sharing platform developed by the Center for Open Science. Statisticians help design the experiments to yield rigorous results. The protocols of each experiment have been peer-reviewed and published separately as a registered report beforehand, which advocates say prevents scientists from manipulating the experiment or changing their hypothesis midstream.

The group has made painstaking efforts to redo experiments with the same methods and materials, reaching out to original laboratories for advice, data, and resources. The labs that originally wrote the studies have had to assemble information from years-old research. Studies have been delayed because of legal agreements for transferring materials from one lab to another. Faced with financial and time constraints, the team has scaled back its project; so far 29 studies have been registered, and Errington says the plan is to do as much as they can over the next year and issue a final paper.

“This is no easy task, and what they’ve done is just wonderful,” says Begley, who is now chief scientific officer at Akriveia Therapeutics and was originally on the advisory board for the project but resigned because of time constraints. His overall impression of the studies is that they largely flunked replication, even though some data from individual experiments matched. He says that for a study to be valuable, the major conclusion should be reproduced, not just one or two components of the study. This would demonstrate that the findings are a good foundation for future work. “It’s adding evidence that there’s a challenge in the scientific community we have to address,” he says.

Begley has argued that early-stage cancer research in academic labs should follow methods that clinical trials use, like randomizing subjects and blinding investigators as to which ones are getting a treatment or not, using large numbers of test subjects, and testing positive and negative controls. He says that when he read the original papers under consideration for replication, he assumed they would fail because they didn’t follow these methods, even though they are top papers in the field.. “This is a systemic problem; it’s not one or two labs that are behaving badly,” he says.

Details Matter

For the researchers whose work is being scrutinized, the details of each study matter. Although the project leaders insist they are not designing the project to judge individual findings—that would require devoting more resources to each study—cancer researchers have expressed concern that the project might unfairly cast doubt on their discoveries. The responses of some of those scientists so far raise issues about how replication studies should be carried out and analyzed.

One study, for instance, replicated a 2010 paper led by Erkki Ruoslahti, a cancer researcher at Sanford Burnham Prebys Medical Discovery Institute in San Diego, which identified a peptide that could stick to and penetrate tumors. Ruoslahti points to a list of subsequent studies by his lab and others that support the finding and suggest that the peptide could help deliver cancer drugs to tumors. But the replication study found that the peptide did not make tumors more permeable to drugs in mice. Ruoslahti says there could be a technical reason for the problem, but the replication team didn’t try to troubleshoot it. He’s now working to finish preclinical studies and secure funding to move the treatment into human trials through a company called Drugcendr. He worries that replication studies that fail without fully exploring why could derail efforts to develop treatments. “This has real implications to what will happen to patients,” he says.

Atul Butte, a computational biologist at the University of California San Francisco, who led one of the original studies that was reproduced, praises the diligence of the team. “I think what they did is unbelievably disciplined,” he says. But like some other scientists, he’s puzzled by the way the team analyzed results, which can make a finding that subjectively seems correct appear as if it failed. His original study used a data-crunching model to sort through open-access genetic information and identify potential new uses for existing drugs. Their model predicted that the antiulcer medication cimetidine would have an effect against lung cancer, and his team validated the model by testing the drug against lung cancer tumors in mice. The replication found very similar effects. “It’s unbelievable how well it reproduces our study,” Butte says. But the replication team used a statistical technique to analyze the results that found them not statistically significant. Butte says it’s odd that the project went to such trouble to reproduce experiments exactly, only to alter the way the results are interpreted.

Errington and Iorns acknowledge that such a statistical analysis is not common in biological research, but they say it’s part of the group’s effort to be rigorous. “The way we analyzed the result is correct statistically, and that may be different from what the standards are in the field, but they’re what people should aspire to,” Iorns says.

In some cases, results were complicated by inconsistent experimental systems. One study tested a type of experimental drug called a BET inhibitor against multiple myeloma in mice. The replication found that the drug improved the survival of diseased mice compared to controls, consistent with the original study. But the disease developed differently in the replication study, and statistical analysis of the tumor growth did not yield a significant finding. Constantine Mitsiades, the study’s lead author and a cancer researcher at the Dana-Farber Cancer Institute, says that despite the statistical analysis, the replication study’s data “are highly supportive of and consistent with our original study and with subsequent studies that also confirmed it.”

A Fundamental Debate

These papers will undoubtedly provoke debate about what the standards of replication should be. Mitsiades and other scientists say that complex biological systems like tumors are inherently variable, so it’s not surprising if replication studies don’t exactly match their originals. Inflexible study protocols and rigid statistics may not be appropriate for evaluating such systems—or needed.

Some scientists doubt the need to perform copycat studies at all. “I think science is self-correcting,” Ruoslahti says. “Yes, there’s some loss of time and money, but that’s just part of the process.” He says that, on the positive side, this project might encourage scientists to be more careful, but he also worries that it might discourage them from publishing new discoveries.

Though the researchers who led these studies are, not surprisingly, focused on the correctness of the findings, Errington says that the variability of experimental models and protocols is important to document. Advocates for replication say that current published research reflects an edited version of what happened in the lab. That’s why the Reproducibility Project has made a point to publish all of its raw data and include experiments that seemed to go awry, when most researchers would troubleshoot them and try again.

“The reason to repeat experiments is to get a handle on the intrinsic variability that happens from experiment to experiment,” Begley says. With a better understanding of biology’s true messiness, replication advocates say, scientists might have a clearer sense of whether or not to put credence in a single study. And if more scientists published the full data from every experiment, those original results may look less flashy to begin with, leading fewer labs to chase over-hyped hypotheses and therapies that never pan out. An ultimate goal of the project is to identify factors that make it easier to produce replicable research, like publishing detailed protocols and validating that materials used in a study, such as antibodies, are working properly.

What makes science reliable? Reproducibility is one of the hallmarks of a valid scientific finding. But science is facing what many consider a reproducibility crisis, and the stakes are high.

Beyond this project, the scientific community is already taking steps to address reproducibility. Many scientific journals are making stricter requirements for studies and publishing registered reports of studies before they’re carried out. The National Institutes of Health has launched training and funding initiatives to promote robust and reproducible research. F1000Research, an open-access, online publisher launched a “Preclinical Reproducibility and Robustness Channel” in 2016 for researchers to publish results from replication studies. Last week several scientists published a “reproducibility manifesto” in the journal Human Behavior that lays out a broad series of steps to improve the reliability of research findings, from the way studies are planned to the way scientists are trained and promoted.

Sasha Kamb, former senior vice president at Amgen and one of the founders of the F1000 channel, says that ultimately science needs to shift its culture and incentives. “There’s seemingly too much to be gained by publishing first and fast with sexy stuff, and not enough value on the care and robustness,” he says. To change the way science is done, scientists need to be rewarded for repeating the work of others—and for doing work that others can repeat.