In 2002, an article in the Washington Monthly explored a new trend called "open-source biology." It asked, "Can a band of biologists who share data freely out-innovate corporate researchers?" The basic idea: Instead of squirreling away their research so no one else could use it, scientists would pool their findings.
More than a decade later, open-source doesn't need to be in quotation marks, and the potential benefits of making scientific data freely available seem obvious. Plus, your tax dollars pay for a lot of it! But this week, researchers at the Defense Advanced Research Projects Agency's "Biology Is Technology" conference have a reality check to share: Open-source scientific data is grossly underutilized and kind of a mess.
Making scientific data open-source is a logical way to encourage interdisciplinary collaboration among researchers and democratize fields that are often stratified. It seems particularly exciting and promising when paired with big data—as computers have become powerful enough to process enormous data sets, the opportunity to make connections and draw conclusions seems irresistible.
And large data repositories have been the foundation of major biomedical discoveries and achievements. Joel Dudley, a biomedical informatics researcher at Mount Sinai, talked at the conference about a counterintuitive molecular similarity between skin disease and Alzheimer's that was discovered only because of large-scale data mapping. He also showed how broad access to patient medical histories and genotypes can reveal things like subpopulations within Type 2 diabetes patients in which each group is predisposed to have different types of conditions alongside diabetes.
The more data sets that are openly available, the more work like this can occur. But even something as potentially powerful as the open-source movement can be dead in the water if no one wants to engage with it. "Making data available to others is not sufficient to get people to work on it," said Stephen Friend, the president and co-founder of the nonprofit open-research organization Sage Bionetworks. Friend says that a big part of the problem is lack of incentives. Sure, building models to analyze and compare different datasets could produce meaningful results, but that takes time and other resources, and most of the work happens behind the scenes in obscurity. And scientists—well, they want a little glory.
One solution, which Sage is championing, is to create a sort of GitHub for biological data, called Synapse. GitHub is a Web-based code repository that offers project management and tracking tools for developers. Every time someone finalizes a change to code in GitHub, it's called a "commit," and when they push the change to the server, other people can see it in the project's history. The idea is that there's a log of which user was responsible for each change, however small, so everyone can see who is accountable for each decision. The flip side of commits is that when someone does something really smart, whether it's fixing a bug or adding new functionality to a program, everyone knows. Even if they're not responsible for the whole project, users can still publicly get credit for the good things they do.
Sage wants Synapse to work the same way. "The heart of it is an element of provenance," Friend said. The system tracks all different types of data organization and manipulation, and works to facilitate collaboration between disparate, even competing researchers by carefully recording who does what.
Another problem with open-source data is that it's often an unrecognizable hodgepodge of raw numbers from different experiments. "The hard thing is not actually to dump your data into the public domain," Peter Sorger, a systems biologist at Harvard, said at the DARPA event. "It’s to dump it in an intelligible way." Sorger estimates that to make data from a project usable, it takes about 20 percent of a researcher's total work. But "The incentive to do that? Zero," he said. "We have not created a system of incentives where the liberation of data is seen as critical."
If goodwill and curiosity aren't motivating researchers to work with open-source data on their own, there is still something that probably will: human limitation. "We have tiny little brains. We can’t understand the big stuff anymore," said Paul Cohen, a DARPA program manager in the Information and Innovation Office. "Machines will read the literature, machines will build complicated models, because frankly we can’t." When all you have to do is let your algorithms loose on a trove of publicly available data, there won't be any reason not to pull in everything that's out there.