Today, we live in a world of data. Twenty years ago, we didn’t. Just as computing power has exponentially increased over the last 50 years, doubling every two years or so, the amount of computational data has been doubling at a similar rate. Ninety percent of all the data in human history was created in the last two years. And the advent of “big data” brings with it such scary and Orwellian doings as Facebook conducting mood experiments on its users.
OkCupid founder Christian Rudder jumped to Facebook’s defense on Monday, talking about how the online dating service had conducted similar experiments on its millions of users, including lying to them about how well-matched they were with potential dates. (People weren’t quite as outraged as they were with Facebook, possibly because, in the words of Gawker’s Jay Hathaway, “Online dating already feels like consenting to participate in a social experiment.”)
So however bad Facebook’s experiment was, it looks like there might be a lot more of it in our future. But unlike Facebook, which published its findings in a humorless academic paper, OkCupid treated its results with some serious skepticism, raising the question: What is big data actually good for? Does it even work?
Not all data is equal, of course. The complete works of Isaac Newton and William Shakespeare take up about as much space as a sound file of Pharrell’s “Happy.” But even if you restrict yourself to words and numbers, the great works of human civilization have now been drowned in measurements, statistics, and status updates. Songs and texts are on the order of megabytes. There are a bit more than a million megabytes in a terabyte, which is about what it would take to store the entire printed material of the Library of Congress. The total human store of information is a few billion Libraries of Congress, measured in zettabytes (a billion terabytes).
What do we do with all our big data? The answer is often “not much.” The National Security Agency stuffs all its surveillance data into its Utah data center despite not having the tools to analyze most of it. Data storage costs have become so cheap that it’s far easier to collect petabytes than to figure out why they’re useful. Last year, market research firm Gartner put big data and many of its technologies near the top of the “peak of inflated expectations” of its hype cycle, to be followed soon by a “trough of disillusionment.”
Why the trough? Because big data has yet to yield big money. For all the hype about the quantified self, the Internet of things, and data science, big data has yet to yield a true killer app. Google Flu Trends is a fascinating idea, but extrapolating flu incidents from Google searches on flu keywords has not produced reliable results. The New York Times recently published a piece by Sendhil Mullainathan wondering if search queries for “slow iPhone” might imply that Apple is intentionally slowing down older iPhones as new ones are released, but he concluded merely that big data doesn’t tell us enough to know for sure.
Big data really only has one unalloyed success on its track record, and it’s an old one: Google, specifically its Web search. (Disclosure: I used to work at Google, and my wife still does.) Way back in the last century, Google found that by analyzing the entirety of the Web, a sufficient number of pages gave them the ability to obtain a) really good results for keyword searches and b) high click-through ads for those keywords. What’s more, it didn’t require any particularly sophisticated analysis of their data. Simply examining word frequencies and the link structure of the Web was enough to obtain high-quality analysis. (This has changed as SEOs and click farms have tried to game the system, but the point stands.) As artificial intelligence kingpin Peter Norvig puts it, “Simple models and a lot of data trump more elaborate models based on less data.”
Many companies, including Google itself, have tried to repeat that success since then, but no one has really succeeded. Amazon is probably the No. 2 big data success because of its recommendation engine, but Amazon’s success was still not primarily dependent on big data–style analysis in the way that Google’s core business has been. Facebook has succeeded more through viral ubiquity than through big data innovation.
The recent Facebook data science experiment is telling. Regardless of moral outrage, the problem with Facebook’s recent attempt to make its users feel bad (or good) by curating their news feeds is that the manipulation was inept—because the analysis was done using the inadequate Linguistic Inquiry and Word Count software. For example, “I don’t feel happy” and “I feel happy” both registered as “positive” updates, simply due to the presence of the word happy.