Project Dreamcatcher

How cutting-edge text analytics can help the Obama campaign determine voters’ hopes and fears.

Jan 13, 20124:48 PM

“Share your story,” Barack Obama’s Pennsylvania website encouraged voters just before the holidays, above a text field roomy enough for even one of the president’s own discursive answers. “Tell us why you want to be involved in this campaign,” read the instructions. “How has the work President Obama has done benefited you? Why are you once again standing for change?” In Obama’s world, this is almost a tic. His transition committee solicited “[a]n American moment: your story” on the occasion of his inauguration. The Democratic National Committee later asked people to “[s]hare your story about the economic crisis.” It’s easy to see where this approach fits into the culture of Obama’s politicking: His own career is founded on the value of personal narratives and much of his field staff takes inspiration from Marshall Ganz, the former labor tactician who famously built solidarity in his organizing sessions by asking participants to talk about their backgrounds. But might a presidential campaign have another use for tens of thousands of mini-memoirs?

That’s the central thrust of a project under way in Chicago known by the code name Dreamcatcher and led by Rayid Ghani, the man who has been named Obama’s “chief scientist.” Veterans of the 2008 campaign snicker at the new set of job titles, like Ghani’s, which have been conjured to describe roles on the re-election staff, suggesting that they sound better suited to corporate life than a political operation priding itself on a grassroots sensibility. Indeed, Ghani last held the chief-scientist title at Accenture Technology Labs, just across the Chicago River from Obama’s headquarters. It was there that he developed the expertise Obama’s campaign hopes can help them turn feel-good projects like “share your story” into a source of valuable data for sorting through the electorate.

At Accenture, Ghani mined the mountains of private data that collect on corporate consumer servers to find statistical patterns that could forecast the future. In one case, he developed a system to replace health insurers’ random audits by deploying an algorithm able to anticipate which of 50,000 daily claims are most likely to require individual attention. (Up to 30 percent of an insurer’s resources can be devoted to reprocessing claims.) To help set the terms of price insurance marketed to eBay sellers, Ghani developed a model to estimate the end-price for auctions, based on each sale item’s unique characteristics.

Often, Ghani found himself trying to help businesses find patterns in consumer behavior so that his clients could develop different strategies for different individuals. (In the corporate world, this is known as “CRM,” for customer-relationship management.) To help grocery stores design personalized sale promotions that would maximize total revenue, Ghani needed to understand how shoppers interacted with different products in relation to one another. The typical store had 60,000 products on its shelves, and Ghani coded each into one of 551 categories (like dog food, laundry detergent, orange juice) that allowed him to develop statistical models of how people build a shopping list and manage their baskets.

Ghani’s algorithms assigned shoppers scores to rate their individual propensities for particular behaviors, like the “opportunistic index” (“how ‘savvy’ the customer is about getting better prices than the rest of the population”), and to see whether they had distinctive habits (like “pantry-loading”) when faced with a price drop. If there was a two-for-one deal on a certain brand of orange juice, Ghani’s models could predict who would double their purchase, who would keep buying the same amount, and who would switch from grapefruit for the week.

But Ghani realized that customers didn’t see the supermarket as a collection of 551 product categories, or even 60,000 unique items. He points to the example of a 1-liter plastic jug of Tropicana Low Pulp Vitamin-D Fortified Orange Juice. To capture how that juice actually interacted with other products in a shopper’s basket, Ghani knew the product needed to be seen more as just an item in the “orange juice” category. So he reduced it to a series of attributes—Brand: Tropicana, Pulp: low, Fortified with: Vitamin-D, Size: 1 liter, Bottle type: plastic —that could be weighed by the algorithms. Now a retailer’s models could get closer to calculating shopping decisions as customers actually made them. A sale on low-pulp Tropicana might lure people who usually purchased a pulpier juice, but would Florida’s Natural drinkers shift to a rival brand? Would a two-for-one deal get those who typically looked for their juice in a carton to stock up on plastic?

The challenge was, in essence, semantic: teaching computers to decode complex product descriptions and isolate their essential attributes. For another client, Ghani, along with four Accenture colleagues and a Carnegie Mellon computer scientist, used a Web crawler to pull product names and descriptions from online clothes stores and built an algorithm that could assess products based on eight different attributes, including “age group,” “formality,” “price point,” and “degree of sportiness.” Once the products had been assigned values in each of those categories, they could be manipulated numerically—the same way that Ghani’s predictive models had tried to make sense of the grocery shopping list. By reducing it to its basic attributes—lightweight mesh nylon material, low profile sole, standard lacing system—a retailer could predict sales for shoes it had never sold before by comparing them to ones it had.

Ghani’s clients in the corporate world were companies that “analyze large amounts of transactional data but are unable to systematically ‘understand’ their products,” as his team wrote. Political campaigns struggle with much the same problem. In 2008, Obama’s campaign successfully hoarded hard data available from large commercial databases, voter files, boutique lists, and an unprecedented quantity of voter interviews it regularly conducted using paid phone banks and volunteer canvassers. Obama’s analysts used the data to build sophisticated statistical models that allowed them to sort voters by their relative likelihoods of supporting Obama (and of voting at all). The algorithms could also be programmed to predict views on particular issues, and Obama’s targeters developed a few flags that predicted binary positions on discrete, sensitive topics—like whether someone was likely pro-choice or pro-life.

But the algorithms the Obama campaign used in 2008—and that Mitt Romney has used so far this year—have trouble picking up voter positions, or the intensity around those positions, with much nuance. In other words, the analysts were getting pretty good at sorting the orange juice drinkers from the grapefruit juice drinkers. But they still didn’t have a great sense of why a given voter preferred grapefruit to O.J.—and how to change his mind. Polls seemed unable to get at an honest hierarchy of personal priorities in a way that could help target messages. Before the 2008 Iowa caucuses, every Democrat’s top concern seemed to be opposition to the Iraq war; once Lehman Bros. collapsed not long after the conventions, the economy became the leading issue across demographic and ideological groups. But microtargeting surveys were unable to burrow beneath that surface unanimity to separate individual differences in attitudes toward the war or the economy. If a voter writes in a Web form that her top concern is the war in Afghanistan, should she should be asked to enlist as a “Veterans for Obama” volunteer, or sent direct mail written to placate foreign-policy critics?

Campaigns do, however, take in plenty of information about what voters believe, information that is not gathered in the form of a poll. It comes in voters’ own words, often registered onto the clipboards of canvassers, during a call-center phone conversation, in an online signup sequence or a stunt like “share your story.” As part of the Dreamcatcher project, Obama campaign officials have already set out to redesign the “notes” field on individual records in the database they use to track voters so that it sits visibly at the top of the screen—encouraging volunteers to gather and enter that information. And they’ve made the field large enough to include the “stories” submitted online. (One story was 60,000 text characters long.)

What can the campaign do with this blizzard of text snippets? Theoretically, Ghani could isolate keywords and context, then use statistical patterns gleaned from the examples of millions of voters to discern meaning. Say someone prattles on about “the auto bailout” to a volunteer canvasser: Is he lauding a signature domestic-policy achievement or is he a Tea Party sympathizer who should be excluded from Obama’s future outreach efforts? An algorithm able to interpret that voter’s actual words and sort them into categories might be able to make an educated guess. “They’re trying to tease out a lot more nuanced inferences about what people care about,” says a Democratic consultant who worked closely with Obama’s data team in 2008.

Obama’s campaign has boasted that one of their priorities this year is something they’ve described only as “microlistening,” but would officially not discuss how they intend to deploy insights gleaned from their new research into text analytics. “We have no plans to read out our data/analytics/voter contact strategy,” spokesman Ben LaBolt writes by email. “That just telegraphs to the other guys what we’re up to.”

Yet those familiar with Dreamcatcher describe it as a bet on text analytics to make sense of a whole genre of personal information that no one has ever systematically collected or put to use in politics. Obama’s targeters hope the project will allow them to make more sophisticated decisions about which voters to approach and what to say to them. “It’s not about us trying to leverage the information we have to better predict what people are doing. It’s about us being better listeners,” says a campaign official. “When a million people are talking to you at once it’s hard to listen to everything, and we need text analytics and other tools to make sense of what everyone is saying in a structured way.”