Last week the Federal Trade Commission and Google signed a broad privacy settlement that requires the search company to submit to "privacy audits" every two years. The agreement ended a dispute that began last year, when Google launched Buzz, the ill-fated social-messaging system built into Gmail.
Buzz was certainly a privacy boondoggle for Google—a black eye for a company that had been trying to position itself as the good guy to Facebook's bad guy. I agree with the FTC that Google should pay for the mistakes it made (the company has apologized and says it's fixed its privacy procedures to prevent another such imbroglio). And if they're done judiciously, the privacy audits may prove helpful in ensuring that Google stays on the up and up.
But that's what I worry about: Will the audits in fact be done judiciously? There's a good chance that privacy regulators—spurred by a public that doesn't really know what it wants when it comes to online privacy—may go too far, blocking Google from collecting and analyzing information about its users. That will be a terrible outcome, because while we all reflexively hate the thought of a company analyzing our digital lives, we also benefit from this practice in many ways that we don't appreciate.
I know I sound naïve, but bear with me. Yes, Google collects a lot of information about all of us. It does so on purpose, and for all sorts of reasons. Some of these reasons we don't like very much—Google, like all big Web companies, sells ads, and it can get more money for those ads when they're targeted to you. This practice pays for the Web, and it's the reason you don't pay a fee to conduct a Google search. Still, I understand why people are wary of online data collection. Too often, though, our conversations about online privacy end right here.
Broadly speaking, there are two types of data that Web companies keep on us—personally identifiable information (like your name and list of friends), and information that can't be tied to you as an individual. In our discussions about privacy, we rarely make this important distinction. While we focus on the disadvantages of companies collecting our information, we rarely look at the innovations that wouldn't be possible without our personal data. This is especially true when it comes to anonymous data—information that can't be used to identify you, but which serves as the building blocks of amazing things.
Indeed, some of Google's best and most-loved products would not be possible without our data. Take the spell checker: How does Google know you meant Rebecca Black when you typed Rebeca Blacke? Note that this is a trick that no ordinary, dictionary-based spell-checker could perform—these are proper nouns, and we're dealing with an ephemeral personality. But since Google has stored lots of other people's search requests for Black, it knows you're looking for the phenom behind "Friday." The theory behind the spell-checker can be applied more broadly. By studying words that often come together in search terms—for instance, people may either search for "los angeles murder rate" or "los angeles homicide rate"—Google can detect that two completely different words may have the same meaning. This has profound implications for the future of computing: In a very real sense, mining search queries is teaching computers how to understand language (and not just English, either). If Google were forced to forget every search query right after it served up a result, none of these things would be possible.
The spell-checker is just one small example of what large-scale data mining makes possible. In my last column, I looked at another case—speech recognition, which relies on billions of audio samples and text queries to teach machines how to understand human speech. Here are some more: Google analyzes search logs to detect large-scale computer security threats; when it notices anomalous collections of searches (which viruses have been known to perform to seek out vulnerable Web servers) it can stop viruses in their tracks. Then there's "crowdsourced traffic." On many different smartphones, you can turn on a Google Maps feature called "My Location," which beams back anonymous data about where your phone is at any current moment. Google collects and analyzes this information to create real-time traffic reports on highways and even surface streets.
An even more important area of research is in prediction—by looking at what people are searching for today, Google can guess what's going to happen tomorrow or next month. In 2008, Google launched Flu Trends, a site that monitors increases in health-related searches in different parts of the world. The team behind the system published a research paper showing that they can accurately predict the outbreak of a flu epidemic in a certain region before public health authorities catch on to it. In 2009, Hal Varian, Google's chief economist, published a paper (PDF) showing that Google searches can be used to predict a bevy of economic data, too, including retail sales and unemployment claims.
There's something important to note about the spellchecker, Flu Trends, speech recognition, and other Google products based on data. They weren't planned. Google didn't begin saving search queries in order to build the spell-checker; it built the spell-checker because it began saving search queries, and eventually realized that the database could be useful. "You may not think of it right away—later on, you'll come up with some use for the data, and over the years we've constantly come up with new ways to look at data to improve our products and services," Varian told me.
This suggests the danger of our maximalist attitudes about online privacy. We all argue that we'd like companies to store less of our personal information and to jump through many hoops if it wants to store more. Members of Congress are now pushing for legislation that would tighten privacy controls at Web companies; there's even a move to expand the FTC's settlement with Google to all Web firms. I don't oppose greater privacy measures, so long as they're the product of an honest discussion. The trouble is, we rarely have rational discussions about privacy. Witness the annual furor over Facebook: Every year or so, the media and activists get exercised over some new and alarming slight by the social-networking company. Yet our actions belie our concerns; while we all holler about how much we hate Facebook, none of us quit it—and, in fact, hundreds of thousands more keep signing up.
We need to be more honest about what we mean when we say we want to protect "privacy" online. Does this mean that we want to be able to control every single bread crumb we leave behind when we're on the Web? Many activists fear that the distinction between anonymous and personally identifiable data is eroding—that broad data-mining practices make it possible for Web companies to suss out who you are by analyzing supposedly "anonymous" data. Based on this worry, some regulators have proposed restrictions on the anonymous data that companies keep (PDF).
But if that's what we want to see happen, we ought to be clear about the costs. Yes, Web companies track a lot of what we do in our browsers. But the next time you misspell a word or get caught in a traffic jam or have your computer shut down by a virus, remember that tracking is not always a bad thing.