An HTML for Numbers
Is Google's Public Data Explorer the first step toward a universal data format?
The Age of Data is just around the corner, right where it has been for years. As someone who spends a lot his time creating visualizations, I've been hoping for this day to come for a very long time. "It used to be that you would get stories by chatting to people in bars," Internet godfather (and non-journalist) Tim Berners-Lee declared last year. "But now it's also going to be about poring over data and equipping yourself with the tools to analyze it." Don't buy it? This transfixing eight-part video series from Knight Journalism fellow Geoff McGhee might change your mind. Data isn't just for nerds any more—it's beautiful, alluring, extraordinary.
It's also incredibly hard to work with. The problem with bringing data to journalism isn't convincing writers and editors that it's useful for telling stories; it's the toil required to get the numbers in a usable format. The data is already there, from federal sentencing figures and unemployment rates by county to minute-by-minute Twitter responses to the Black Eyed Peas' smoldering wreckage of a Super Bowl halftime show. The problem is that it all looks different. It is compiled by different people using different programs and represented in different formats. As a result, mashing up data isn't as simple as mashing together two balls of Silly Putty. It's more like trying to plug a bunch of American appliances into outlets in Tbilisi.
In hopes of bridging this data divide, Google is rolling out a tool called Public Data Explorer. While Data Explorer has been around for a while, it's now been extended to allow users to upload and visualize their own data sets. But that's not why Google's effort is important. If you want to make cool visualizations, IBM's Many Eyes offers more than a dozen different ways to display information. (Google currently offers four pretty standard ones.) The exciting news here is that Google is pushing for the adoption of a specific format. Users must upload their data in two files, one for all the numbers and one that describes what those numbers represent. If this feature becomes popular, it will make it a whole lot easier for people and agencies to use one another's data. It's not quite a universal format, but it's a lot closer than anything we have today.
The beauty of the Web—in fact, the reason the Internet can function in the first place—is that it doesn't require intensive training to publish a page in a readable format. Sure, you might have to learn a few HTML tags—or pay an 11-year-old who knows HTML—but it's a simple language that's easy to pick up. There is no equivalent for data. There are plenty of standards for making data readable by a machine, but no single format that everyone can understand and agree on.
While plenty of people have tried to develop a data standard, none of them have been named Google. A promising site called Swivel tried to became a "YouTube for data" a few years ago, but don't go looking for it now. One of Google's greatest powers is its ability to cajole Web developers into playing by the company's rules, in hopes of climbing in the rankings and generally staying in the demigod's good graces. For sure, there are well-developed languages, like XML and JSON, for organizing data in a way computers can understand. While these are great for representing data for a specific purpose, a search engine wouldn't know what to do with my code without extra information from me on what the numbers mean. This is where a standard format becomes essential.
To understand why I'm rooting for Google, consider this brief tale of woe. When I was trying to build a map of job-loss data for Slate, I started with the month-by-month, county-by-county figures from the Bureau of Labor Statistics. This data comes in huge text files with arcane codes—meaningless gibberish unless you have the software and the know-how to match those codes to the names of counties, which live in a different file. At the time, I did this in Excel with a cocktail of Byzantine macros, late nights, and emotional breakdowns.
I've since discovered better ways to crunch these figures, but I had to learn a lot of programming to get there. If the job data I wanted for my map had already been represented in Google's format, I would have saved days of work getting it into shape (even if I wanted to use my own software to visualize it instead one of the four display options that Google offers).
Even more compelling is the possibility that data could join the ranks of text, images, and video in Google search results. This happens in a very basic form now. If you Google "population of Italy," you see a simple graph of population over time at the top of the results page, which you can click for more detail. This is the exact same tool that's opening up to the public today. Imagine if Google's spiders, forever crawling the Web to index its contents, could smartly identify and sort data? Let's say I publish an article on YouTube view counts that includes proprietary data I collected for the piece. If it's formatted according to Google's standards, it might show up as a little bar graph when someone searches for "YouTube views," even if my article itself isn't at the top of the results. (By the way, Public Data Explorer allows you to choose whether to share your figures—despite the name, your data won't be public if you don't want it to be.)
Thebreadth and relative complexity of Google's format will become clearer over time, but it can already represent many common types of information. For example, it can account for hierarchical data—a set in which the number of jobs in King County, Wash., is represented as a subset of jobs in Washington, which is in turn a subset of jobs in the United States. (This is useful for things like treemaps.) It's also good for time-lapse data, allowing you to show change over time in animated charts and graphs. While Google's system will have to evolve to accommodate less-traditional visualizations like network diagrams, its relative simplicity is a good thing. In a format like RDF, the author needs to add a ton of extra information to the source code to help computers figure out what's what. Google, by contrast, wants most of the burden to be on its shoulders.
Public Data Explorer is important because not many people care to read data in its raw form. A simple presentation tool—essentially, an HTML for numbers instead of words—might not be sexy, but it could do a lot to elevate data to the same importance as text in search results.Then, fingers crossed, mashing it all up could end up being just like mashing together a couple of balls of putty.
Chris Wilson is a Slate contributor.
Photograph by iStockphoto/Thinkstock.