That’s great for a discrete, artificial, coded fantasy world, but what about the real world? Here the issue of messiness re-enters and dominates. If you look at the most successful case studies in Viktor Mayer-Schönberger and Kenneth Cukier’s sensible 2013 book Big Data—from Amazon’s recommendation engine to New York’s search for illegally converted buildings to predicting exploding manholes—they are all cases of selection optimization. That is, the data is used to help select and prioritize the most relevant and crucial data points, whether those points are books you are likely to buy or manholes that are likely to explode. Big data is suited to optimization problems because such problems are generally error-tolerant: If the analysis points to some safe manholes or some books you don’t want to buy, that’s fine. Think of it as analogous to Google’s search results and ads: It’s fine if some irrelevant results or ads show up, as long as there are enough good results and ads to keep people clicking.
Similarly, such analysis can identify anomalous correlations that would otherwise go unnoticed. Statistician Andrew Gelman’s analysis of New York’s stop-and-frisk policy concluded, “The differences in stop rates among ethnic groups are real, they are substantial, and they are not explained by previous arrest rates or precincts.” Having found a meaningful correlation where, in principle, there should not have been one, Gelman could then show a meaningful disparity in stop rates based on race. Such analyses would only improve with finer-grained and more comprehensive data; as long as the interpretation is well-grounded, individual errors and incompleteness should not corrupt the results.
For contrast, consider problems where error tolerance is extremely low. In speech recognition, language translation, medical diagnosis, and many other fields, analysis and results must be complete, exhaustive, and almost perfectly fine-tuned. If you translate a sentence from Japanese to English, your margin for error is pretty much zero: Any mistake could create a total misunderstanding. This isn’t to underestimate how often Google Translate does produce sensible results, due to the corpus of megasourced translation data available to it. But it also shows why human translators won’t be out of business anytime soon.
Likewise, while I’m happy to have my doctor use big data–style analyses—such as those offered by gene analyzer 23andMe—to find potential trouble spots in my health, I only want that data to supplement my doctor’s skills, not replace it. Only in cases such as spell-checking, where the megasourced data is remarkably coherent and precise, can you reach a degree of certainty that you would feel comfortable turning over much responsibility to a computer. Even crowdsourced spam-filtering, while impressively reliable, produces nontrivial numbers of false negatives and positives.
This is also why the government’s “vacuum cleaner” approach to collecting data rings somewhat hollow. We now know that the FBI had been warned multiple times about Boston Marathon bombing suspect Tamerlan Tsarnaev back in 2011, but the FBI never put him under surveillance, possibly due to lack of coordination and a spelling mistake. The next time a terrorist attack happens on their watch, I guarantee you that there will have been signals in their data that their analyses missed. Given a huge haystack, big data will find some needles pretty quickly, but it will never guarantee you that it’s found them all.
Big data, then, is good for when you want incremental optimization rather than a killer paradigm shift. The sorts of “discoveries” you see Facebook, OkCupid, and even Google trumpeting from big data should be greeted with caution. The real gains come in degrees of quantity rather than quality: saving time, identifying potential trouble spots, and identifying the biggest bang for your buck. These gains can save huge amounts of money, time, and even lives. But they do lack some of the flashiness of, for example, Google Flu Trends—where we thought one kind of data could be conjured out of another through magically emergent correlations, only to find that the correlations were a lot less solid than they seemed. Ironically, the great increase in data only makes the failings of its imprecision more noticeable and problematic. Though disappointing, it’s also reassuring. Messiness is how you know that your data really does reflect real life.