The Reproducibility Crisis Is Good for Science

Weak statistics are getting called out, and replication is gaining respect.

April 15, 20167:08 AM

scientist data. — After reports of widespread problems in psychology and biomedicine, scientists have become increasingly anxious that many published studies do not stand up.
Huntstock/Thinkstock

This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. On Thursday, April 21, Future Tense will hold an event in Washington, D.C., on the reproducibility crisis in biomedicine. For more information and to RSVP, visit the New America website.

When I decided to use Stephen Jay Gould’s book to teach how science falls prey to human nature, I did not anticipate that it would break my high school students’ hearts. In The Mismeasure of Man, published first in 1981 and then expanded in 1996, Gould recounted how scientists unconsciously twisted studies to fit their prejudices. For instance, 19^th century anthropologists underestimated Africans’ cranial capacities; other biologists’ observations suggested that Africans’ prenatal development was intermediate between hairy apes and northern Europeans.

I thought debunking biased studies would be an empowering way to teach skeptical thinking. My students, mainly African American, had all encountered racism—cops showed up on our field trips, students and teachers from white-majority schools made offensive assumptions. These teens knew people often saw what they expected to see instead of what was really there. But they were still dismayed. The racism that they encountered in everyday life was not supposed to happen in science.

Now, as a reporter at Nature, I see similar disorientation within the scientific community. Science advances when new research builds on past scientific work. (Remember how Newton said that if he had seen further than others, it was only by standing on the shoulders of giants?) But after reports of widespread problems in psychology and biomedicine, scientists have become increasingly anxious that many published studies do not stand up.

Scientific careers are built on high-profile publications, and scientists are all too susceptible to picking out data that support a publishable conclusion. That means many scientific papers describe wishful analysis rather than accurate interpretation. Or the observations might be accurate, but not all the details to repeat them are shared. Either way, efforts to follow up pour time and money down the drain.

Various projects have sprung up to get a grip on how reliable the literature is. The biggest so far is the Reproducibility Project: Psychology, which recruited hundreds of scientists to design and perform 100 replication studies. When fewer than half repeated successfully, Brian Nosek, head of the Center for Open Science and the project’s main organizer, emphasized that these results could not say any particular paper was valid or invalid—some results were no doubt due to statistical flukes or methodological differences. What exactly does it mean when someone says work is not reproducible? When two studies reach different conclusions, the question which is right? is often too simple. Nosek’s team used five distinct measures just to judge what would count as a replication.

One stinging critique argued that the project has nothing to say about the state of psychology, but many hailed the outcome as a confirmation that scientific literature was littered with false positives. (In Slate, Daniel Engber examined how the Reproducibility Project: Psychology called into question an influential theory about willpower.)

But to me, the most interesting question is not how much research is reproducible, but whether it is becoming more so. I think the answer is a resounding yes. The scientific community is recognizing destructive practices and learning to avoid them. Mechanisms to share and assess work before, during, and after publication are becoming part of the scientific culture. (If you are wondering whether my middle name is Pollyanna, recall that recognizing progress is not the same as denying a problem exists. Of course some solutions to improve rigor will not work, or will work only in certain fields and circumstances. The point is that scientists have moved from handwringing to acting.)

The Reproducibility Project, which was years in the making, assessed studies that had been published in 2008. Many psychologists I’ve spoken to think their field is much better now. An important wake-up call for the proper use of statistics came—for many psychologists—in 2011, when a trio of statisticians showed how “analytical flexibility” or “p-hacking” could show anything, including that listening to the Beatles makes undergraduates younger. Lots of researchers had previously believed that trying out many types of analyses was a form of rigor; they were homing in on conditions that revealed the truth of their hypotheses. In fact, researchers were gaming analyses to gain publications, albeit often unwittingly.

Scientific funders have joined efforts to make studies more reliable. This January, the National Institutes of Health required that grant applicants explicitly defend the validity of their experimental design and materials. CHDI, a research foundation, has created a new position devoted to helping scientists plan robust experiments. In 2014, dozens of journals endorsed principles aimed at making research more rigorous and transparent. Nature and other journals now ask authors to complete checklists describing experimental design. Several scientific societies have issued guidelines to ensure reproducibility and robustness. In March, the American Statistical Society issued specific warnings that the misuse of statistics was contributing to irreproducible findings. It said scientists should show all analyses performed, not just ones that yielded exciting techniques. Scientific journals and reviewers, for their part, need to stop relying on distinct statistical cutoffs as a measure of certainty.

Do scientists care? I believe so. For what it’s worth, articles about confirmation bias and the misuse of p-values are consistently among Nature’s most-read stories.

Opportunities to get credit for careful work that does not yield a flashy, new result are also expanding. In the past year, journals as diverse as the American Journal of Gastroenterology and Scientific Data actively solicited replication studies or negative results. Information Systems, a data science journal, has introduced a new type of article in which independent experts are explicitly invited to verify work from a previous publication. Last November, the U.K.’s Royal Society introduced a system known as registered reports: The decision to publish is made before results are obtained based on a pre-specified plan to address an experimental question. The F1000 Preclinical and Reproducibility Channel, launched in February, aims to give drug companies an easy way to show which scientific papers promising new paths to drugs might not deliver.

Taken together, these venues could stop researchers from blithely following up on work that cannot be reproduced or charging down paths others have charted as dead ends. Again, it’s too early to tell whether these venues will take off. Some experts, like Stanford’s John Ioannidis, famous for his calculations that most published research findings are false, worry that special labels for replication or negative result relegate good work into a second-class status; journals devoted entirely to negative results popped up some time ago, but many scientists are not interested in publishing in them.

A common reason that one lab cannot reproduce another lab’s work is that a tiny variation in technique can make a big difference. For instance, experiments sorting out breast cancer cells suggested different conclusions depending on whether samples had been, as one expert put it, shaken or stirred. Researchers can now publish excruciatingly detailed descriptions in journals like Nature Protocols, upload videos to journals like the Journal of Visualized Experiments, and use unique identifiers to unambiguously describe experimental materials. Projects devoted to appropriate use of common tools like antibodies and chemical probes have been set up to warn researchers against chasing artifacts that masquerade as exciting results.

Meanwhile, technologies and traditions are maturing to let researchers check each other’s results. The Pipeline Project, for example, has established a buddy system where work is checked before submission for publication. Another proposal suggests that psychology graduate students replicate work in the published literature to qualify for their degrees. Not sharing data (and not having a good reason for not doing so) is fast becoming a faux pas, but the scientific community is just beginning to set out formal expectations.

Of course, problems loom larger when you delve into the details. Incentives for careful work are more misaligned with professional success, and scientific studies today are far more complicated than they were a century ago. Research institutions could do much more to promote reproducibility. Making science more reliable is a never-ending struggle. People will always find patterns where none exist, and the scientific enterprise still over-rewards flashy results and undervalues careful but prosaic work.

Ultimately my students learned that science is like any other human enterprise—subject to inflated claims and self-deception. But as long as scientists stay worried about reproducibility, they will work to make science better.