The spell-checker is just one small example of what large-scale data mining makes possible. In my last column, I looked at another case—speech recognition, which relies on billions of audio samples and text queries to teach machines how to understand human speech. Here are some more: Google analyzes search logs to detect large-scale computer security threats; when it notices anomalous collections of searches (which viruses have been known to perform to seek out vulnerable Web servers) it can stop viruses in their tracks. Then there's "crowdsourced traffic." On many different smartphones, you can turn on a Google Maps feature called "My Location," which beams back anonymous data about where your phone is at any current moment. Google collects and analyzes this information to create real-time traffic reports on highways and even surface streets.
An even more important area of research is in prediction—by looking at what people are searching for today, Google can guess what's going to happen tomorrow or next month. In 2008, Google launched Flu Trends, a site that monitors increases in health-related searches in different parts of the world. The team behind the system published a research paper showing that they can accurately predict the outbreak of a flu epidemic in a certain region before public health authorities catch on to it. In 2009, Hal Varian, Google's chief economist, published a paper (PDF) showing that Google searches can be used to predict a bevy of economic data, too, including retail sales and unemployment claims.
There's something important to note about the spellchecker, Flu Trends, speech recognition, and other Google products based on data. They weren't planned. Google didn't begin saving search queries in order to build the spell-checker; it built the spell-checker because it began saving search queries, and eventually realized that the database could be useful. "You may not think of it right away—later on, you'll come up with some use for the data, and over the years we've constantly come up with new ways to look at data to improve our products and services," Varian told me.
This suggests the danger of our maximalist attitudes about online privacy. We all argue that we'd like companies to store less of our personal information and to jump through many hoops if it wants to store more. Members of Congress are now pushing for legislation that would tighten privacy controls at Web companies; there's even a move to expand the FTC's settlement with Google to all Web firms. I don't oppose greater privacy measures, so long as they're the product of an honest discussion. The trouble is, we rarely have rational discussions about privacy. Witness the annual furor over Facebook: Every year or so, the media and activists get exercised over some new and alarming slight by the social-networking company. Yet our actions belie our concerns; while we all holler about how much we hate Facebook, none of us quit it—and, in fact, hundreds of thousands more keep signing up.
We need to be more honest about what we mean when we say we want to protect "privacy" online. Does this mean that we want to be able to control every single bread crumb we leave behind when we're on the Web? Many activists fear that the distinction between anonymous and personally identifiable data is eroding—that broad data-mining practices make it possible for Web companies to suss out who you are by analyzing supposedly "anonymous" data. Based on this worry, some regulators have proposed restrictions on the anonymous data that companies keep (PDF).
But if that's what we want to see happen, we ought to be clear about the costs. Yes, Web companies track a lot of what we do in our browsers. But the next time you misspell a word or get caught in a traffic jam or have your computer shut down by a virus, remember that tracking is not always a bad thing.