The Big Data Paradox

It’s never complete, and it’s always messy—and if it’s not, you can’t trust it.

World of Warcraft's XT-002 Deconstructor.

Big data is messy data. It’s not enough just to collect it and count it, because there is never just one way to count it. Big data certainly doesn’t mean “the end of theory,” as Wired editor Chris Anderson notoriously put it in 2008. I came down hard on big data last week while discussing the Facebook and OkCupid experiments on users and their supposed revelations about human nature. These revelations turned out to be founded on sloppy analysis. That said, big data is undeniably important and is already responsible for great gains in efficiency and knowledge. Big data is not a miracle worker, but it is changing our lives.

One question to ask is how big data differs from regular data, other than there just being a lot more of it. Regular data doesn’t magically become big data just because you’ve got 100 million data points instead of a thousand. While computers have made large-scale number crunching far easier and faster than it was 20 or 30 years ago, that doesn’t mean that weather reports or graphs of seismic activity suddenly qualify as big data.

Contrariwise, big data is never complete either. It’s easy to think of big data as simply including all the data, but as Rachel Schutt and Cathy O’Neil put it in their excellent and skeptical book Doing Data Science, “It’s pretty much never all.” This is not a bad thing. As Jorge Luis Borges put it in “On Exactitude in Science,” a perfect map of a country—one necessarily as big as the country itself—is perfectly useless. But you must remain aware of what’s being excluded.


Aside from sheer quantity, there are three defining characteristics of big data. One key big data difference is megasourcing: taking data from huge numbers of distributed sources. If these sources are people, you can call it “crowdsourcing,” but the sources don’t need to be people. Every online ranking system, from Facebook “likes” to Reddit reputation systems, is an example of megasourcing, but so is Google Maps, which aggregates data from thousands of cars and satellites and third-party data sources around the world.

Another is automation. The ability to analyze data as fast as it can be collected means that the results can be put in play automatically, without anyone having to examine the data manually. This is not just a benefit, but a necessity, as the sheer quantity of data is becoming too great for humans to analyze even with the benefit of extra time. Hence the danger of big data: that the analyses are garbage, as we saw in the case of Facebook’s mood experiment, where “not happy” and “happy” both got treated as positive mood indicators. There’s so much data that there’s not enough time to validate the results (unless there’s a public outcry).

Finally, there is the issue of feedback. If an automated ad system decides you should see an ad for diapers because you recently “liked” a stroller on Facebook and bought wipes on Amazon, then further data on you—such as whether you clicked on that diaper ad—is interpreted as a consequence of the analysis that’s already been performed. Big data does not measure static or pristine systems; it puts its results back into these systems and changes their behavior. (This, naturally, makes the effects of big data that much more complicated and dependent.) “We’re witnessing the beginning of a massive, culturally saturated feedback loop,” write Schutt and O’Neil, “where our behavior changes the product and the product changes our behavior.”

A perfect example of these three features of big data comes from online multiplayer game World of Warcraft. (As usual, computer gamers got here first.) To figure out how often certain rare treasure items drop, how strong certain monsters are, and where items and monsters pop up in the game world, players wrote external, automated tracking systems like Wowhead that could be installed on their computers. Anyone who used these extensions while playing would automatically upload data of all of their encounters, pickups, and statistics to a central third-party server, which would aggregate them into a searchable database and generate stats. So if you wanted to know where to find a particular monster in WoW, you could get a breakdown of probabilities, down to the specific in-world coordinates.

