Science

Science Is Broken. How Much Should We Fix It?

More rigor in research could stamp out false positive results. It might also do more harm than good.

Richard Harris author of "Rigor Mortis"
In his new book, Richard Harris makes the case that many of science’s problems could be fixed if scientists only learned to exercise a bit more rigor.

Photo by Meredith Rizzo

Richard Harris has an elephant’s memory for busted scientific claims. In his new book on the replication crisis, Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, the veteran NPR reporter serves as a tour guide to a crushing series of dead ends. It’s not that science isn’t self-correcting, Harris shows. It’s that it isn’t self-correcting fast enough.

Take the rapid method for detecting ovarian cancer that was published with much fanfare in 2002. Its “new diagnostic paradigm” was soon revealed to be a bunch of noise, an artifact of the machine used for testing blood samples. Even so, the paper has been cited almost 4,000 times. Or consider the line of breast-cancer cells from the 1970s, touted by the National Cancer Institute as a vital research tool, that was later shown to be mislabeled melanoma cells. Scientists still use them in studies of breast cancer.

Enough such anecdotes have proliferated in the past few years that it now seems as though any major research finding, especially in psychology or biomedicine, should be viewed with suspicion. But if these stories of the replication crisis invite a sense of panic, Rigor Mortis offers hope. “Simply too much of what’s published is wrong. It doesn’t have to be this way,” Harris writes.

In between his horror tales of research cul-de-sacs, Harris makes the case that many of the problems he describes would be fixed if our scientists only learned to exercise a bit more rigor. Researchers who work with cell lines, for example, could take the small, inexpensive step of making sure that they’ve been labeled properly. In studies using rats or mice, larger samples could be used, with more animals in each test condition. And scientists could preregister their experiments, to ensure they don’t fudge results (consciously or not) after collecting all the data. “Scientists have been taking shortcuts around the methods they are supposed to use to avoid fooling themselves,” he says.

Yet Harris’ overarching, optimistic theme raises several sticky questions. First, what does rigor mean, exactly, and can its definition change? Second, how much rigor is enough, and could we ever end up with too much?

Many of the fixes he proposes would be straightforward and uncontroversial. (Who could argue with a call for scientists to authenticate their cell lines?) But others are murkier. Harris notes, for example, that many scientists misapply statistics when they analyze their data. The standard measure of statistical significance—called a p-value—isn’t always used correctly; researchers mistakenly assume that if their p is less than .05, it always means their hypothesis is likely to be true. Yet if p-values are a problem now, they were once themselves a vehicle for adding badly needed rigor to the scientific enterprise. A landmark paper on the theory of hypothesis testing, published in the early days of p-values, aimed to find a rule of thumb for scientists to “govern our behavior” and ensure that “in the long run of experience, we shall not be too often wrong.” The fact that p-values have since spawned a hapless cult in academia reflects both an updating of the science of statistics and the ever-present risk of unintended consequences.

Given how things change, scientific rigor can be a hazy goal on the horizon, or even a mirage. In psychology, for instance, concerns about the field’s credibility have led to calls for longer articles that describe results from multiple experiments. “[S]ingle-study papers are simply more likely than multi-study papers to identify false positives,” two social psychologists argued in 2012. (Both have served as editors at prominent journals in their field.) That may be an intuitive position, but it turns out to be wrong. Now we know that by demanding multistudy papers, journal editors may have granted ersatz rigor to the field’s most questionable results.

It’s also hard to say which laboratory practices should be the targets of reform. Harris tells a fascinating tale of replication failure in a pair of well-established cancer research labs. The two groups, one in Berkeley, California, and the other in Boston, were collaborating on a study of human breast cells, but for some reason they couldn’t get their findings to agree. For two years, the researchers tried to figure out what was going wrong. They’d been following the same procedures, and using identical methods and reagents, but their data wouldn’t match even when they took their tissue from the same specimen. It was only after the scientists had met in person and conducted the experiment side by side that they figured out the problem: The lab in Boston had been stirring the tissue samples in a flask with a spinning bar while the lab in Berkley put it in a test tube on a rocking and rotating platform.

As Harris tells it, the researchers demonstrated rigor by working to resolve the conflict in their findings. But their story also shows that rigor is a squirmy thing in practice. It may be that a laboratory’s mode of stirring should be standardized and specified across experiments, just like its sample sizes and the identities of its cell lines. But what about all the other tiny details of an experiment that might be subject to the same exactitude—the time of day, the indoor temperature, the size of the pipettes, and so on? Is there any reasonable limit to the list of things that might become a source of error? And if there isn’t, then how should we decide which aspects of our research methods should be prioritized for rigor and which should not?

These questions really matter. If careless work wastes a scientist’s time, so does being too precise. You can’t smear rigor into every crack and crevice of the lab without gumming things up a bit. Rigor may keep false results from entering the literature, but it can stifle valid findings, too. In effect, the decisions we make regarding rigor help set the balance of the scientific process—its ratio of faulty hits to facts we might have learned about the world but didn’t. If we saddled research with the strictest possible requirements, we’d eliminate bogus papers from the literature. Then again, a journal that wanted to make absolutely sure that it never published false positives would be zero pages long.

So how much rigor should we ask from scientists (or how much sloppiness should we accept) to maximize returns from research funding? Since rigor and reproducibility go hand in hand, here’s another way to ask that question: In a perfect world, how easy should it be to replicate experiments successfully?

In 2015, Brian Nosek and the other members of the Reproducibility Project for psychology announced that, out of 100 experiments from top journals they’d tried to reproduce, just 39 appeared to work. That sounds pretty bad, but other estimates of the replication rate have been even worse. Harris says that when Malcolm Macleod, a neurologist at the University of Edinburgh, reviewed the literature on stroke research, he concluded that just 15 percent of the published studies were plausibly correct. And when the former head of cancer research at Amgen, Glenn Begley, asked his staff to replicate 53 important studies, he reported they only found success with six of them—or 11 percent.

Thirty-nine percent, 15 percent, 11 percent. Are any of these numbers even close to what we’d want from laboratory science?

Harris quotes Columbia University neuroscientist Stuart Firestein, who suggests that even the lowest of those numbers might be OK. (Disclosure: I briefly worked in Firestein’s lab while in graduate school.) “This has been characterized as a ‘dismal’ success rate,” Firestein wrote of the Amgen study in his 2015 book Failure: Why Science Is So Successful. “Is it dismal? Are we sure that the success of 11% of landmark, highly innovative studies isn’t a bonanza?” Firestein goes on to rail against the “cottage industry of criticism that is highly charged and lewdly suggestive of science gone wrong, and of a scientific establishment that has developed some kind of rot in its core.” We should not expect to get everything right all of the time, he says, since the scientific method necessarily includes a lot of failure—and “possibly at a very high rate.”

Harris responds that many of those failures are “easily preventable,” so we’d be wise to raise the bar. That’s true enough, but let’s say we followed his advice and put in place the easiest, most unobtrusive fixes to biomedical research. (Leave aside his call to mend the structure and culture of academic research, since however useful that might be, it’s neither easy nor straightforward.) Let’s say that all our sample sizes were big enough and our cell lines were labeled properly. Then what should we expect the replication rate to be? Fifty percent? Sixty percent? Ninety percent? We really have no consensus answer.

Even if we had a target in mind, there’s one more wrinkle to consider: Rigor may not always serve the public good. In biomedicine, everyone is looking for positive results—meaningful, affirmative experiments that could one day help support a novel treatment for disease. (That’s true both for scientists who study biomedicine at universities and those employed by giant pharmaceutical companies.) In that context, rigor serves to check scientists’ ambition and enthusiasm: It reins in their wild oversteps and helps to keep experiments on track.

But not every field of research enjoys the same harmony of goals. In the sciences most relevant to policy and regulation—such as climatology, toxicology, and nutrition—academics’ focus on making new discoveries is counterbalanced by another group of researchers, funded by commercial interests, who want to do the opposite. In these fields, significant results are often used to justify government regulation, of what we put in packaged food, for example, or how we mine for natural gas. Scientists for industry, then, are paid to undermine them. As a rule, they look for nothings in the data, sift for signs of noneffects, and valorize unsuccessful replications.

When science is political, the balance between flexibility and rigor can be very delicate. Those who call for better, more reproducible research may be responding to their own perverse incentives. In fact, the language of the replication crisis has now been adapted to strategic ends in Washington. On March 29, the House of Representatives passed a bill demanding that any science used by the Environmental Protection Agency for making rules be “subject to basic standards of transparency and reproducibility.” That’s scientific rigor, to be sure, but in this instance it’s in the service of slowing regulation, not making research more efficient.

Rigor Mortis provides an excellent summation of the case for fixing science, but the nature of that fix—how it ought to be applied and to what degree—remains frustratingly uncertain. Perhaps we need a bit more scientific rigor to figure out the sort of scientific rigor that we really need. That’s not a joke: The field of meta-science appears to be alive and well, and its researchers are working on the answers to these questions. Let’s just hope they don’t screw up.