The following article is adapted from Viktor Mayer-Schönberger and Kenneth Cukier’s Big Data: A Revolution That Will Transform How We Live, Work, and Think, out now from Houghton Mifflin Harcourt.
Mike Flowers was a lawyer in the Manhattan district attorney’s office in the early 2000s, prosecuting everything from homicides to Wall Street crimes, then made the shift to a plush corporate law firm. After a boring year behind a desk, he decided to leave that job too. Looking for something more meaningful, he thought of helping to rebuild Iraq. A friendly partner at the firm made a few calls to people in high places. The next thing Flowers knew, he was heading into the Green Zone, the secure area for American troops in the center of Baghdad, as part of the legal team for the trial of Saddam Hussein.
Most of his work turned out to be logistical, not legal. He needed to identify areas of suspected mass graves to know where to send investigators digging. He needed to ferry witnesses into the Green Zone without getting them blown up by the many IED (improvised explosive device) attacks that were a grim daily reality. He noticed that the military treated these tasks as information problems. And data came to the rescue. Intelligence analysts would combine field reports with details about the location, time, and casualties of past IED attacks to predict the safest route for that day.
On his return to New York City a few years later, Flowers realized that those methods marked a more powerful way to combat crime than he’d ever had at his disposal as a prosecutor. And he found a veritable soul mate in the city’s mayor, Michael Bloomberg, who had made his fortune in data by supplying financial information to banks. Flowers was named to a special task force assigned to crunch the numbers that might unmask the villains of the subprime mortgage scandal in 2009. The unit was so successful that a year later Mayor Bloomberg asked it to expand its scope. Flowers became the city’s first “director of analytics.” His mission: to build a team of the best data scientists he could find and harness the city’s untapped troves of information to reap efficiencies covering everything and anything.
Flowers cast his net wide to find the right people. “I had no interest in very experienced statisticians,” he says. “I was a little concerned that they would be reluctant to take this novel approach to problem solving.” Earlier, when he had interviewed traditional stats guys for the financial fraud project, they had tended to raise arcane concerns about mathematical methods. “I wasn’t even thinking about what model I was going to use. I wanted actionable insight, and that was all I cared about,” he says. In the end he picked a team of five people he calls “the kids.” All but one were economics majors just a year or two out of school and without much experience living in a big city, and they all had something a bit creative about them.
Among the first challenges the team tackled was “illegal conversions”—the practice of cutting up a dwelling into many smaller units so that it can house as many as 10 times the number of people it was designed for. They are major fire hazards, as well as cauldrons of crime, drugs, disease, and pest infestation. A tangle of extension cords may snake across the walls; hot plates sit perilously on top of bedspreads. People packed this tightly regularly die in blazes. In 2005 two firefighters died trying to rescue residents. New York City gets roughly 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them. There seemed to be no good way to distinguish cases that were simply nuisances from ones that were poised to burst into flames. To Flowers and his kids, though, this looked like a problem that could be solved with lots of data.
They started with a list of every property lot in the city—all 900,000 of them. Next they poured in datasets from 19 different agencies indicating, for example, if the building owner was delinquent in paying property taxes, if there had been foreclosure proceedings, and if anomalies in utilities usage or missed payments had led to any service cuts. They also fed in information about the type of building and when it was built, plus ambulance visits, crime rates, rodent complaints, and more. Then they compared all this information against five years of fire data ranked by severity and looked for correlations in order to generate a system that could predict which complaints should be investigated most urgently.
Initially, much of the data wasn’t in usable form. For instance, the city’s record keepers did not use a single, standard way to describe location; every agency and department seemed to have its own approach. The buildings department assigns every structure a unique building number. The housing preservation department has a different numbering system. The tax department gives each property an identifier based on borough, block, and lot. The police use Cartesian coordinates. The fire department relies on a system of proximity to “call boxes” related to the location of firehouses, even though call boxes are defunct. Flowers’ kids embraced this messiness by devising a system that identifies buildings by using a small area in the front of the property based on Cartesian coordinates and then draws in geo-loco data from the other agencies’ databases. Their method was inherently inexact, but the vast amount of data they were able to use more than compensated for the imperfections.
The team members weren’t content just to crunch numbers, though. They went into the field with inspectors to watch them work. They took copious notes and quizzed the pros on everything. When one grizzled chief grunted that the building they were about to examine wouldn’t be a problem, the geeks wanted to know why he felt so sure. He couldn’t quite say, but the kids gradually determined that his intuition was based on the new brickwork on the building’s exterior, which suggested to him that the owner cared about the place.
The kids went back to their cubicles and wondered how they could possibly feed “recent brickwork” into their model as a signal. After all, bricks aren’t datafied—yet. But sure enough, a city permit is required for doing any external brickwork. Adding the permit information improved their system’s predictive performance by indicating that some suspected properties were probably not major risks.