Bitwise

The Code We Can’t Control

Frank Pasquale’s new book highlights the dangers of “runaway data” and “black box algorithms.”

All those black boxes …

Photo illustration by Slate. Photos by Thinkstock.

Can a computer program be racist? Imagine this scenario: A program that screens rental applicants is primed with examples of personal history, debt, and the like. The program makes its decision based on lots of signals: rental history, credit record, job, salary. Engineers “train” the program on sample data. People use the program without incident until one day, someone thinks to put through two applicants of seemingly equal merit, the only difference being race. The program rejects the black applicant and accepts the white one. The engineers are horrified, yet say the program only reflected the data it was trained on. So is their algorithm racially biased?

Yes, it definitely is, and it’s just one of the dangers that can arise from an overreliance on widespread corporate and governmental data collection. University of Maryland law professor Frank Pasquale’s notable new book, The Black Box Society, tries to come to grips with the dangers of “runaway data” and “black box algorithms” more comprehensively than any other book to date. (An essay I wrote on “The Stupidity of Computers” is quoted in the book, though I wasn’t aware of this until I read it.) It’s an important read for anyone who is interested in the hidden pitfalls of “big data” and who wants to understand just how quantified our lives have become without our knowledge.

Pasquale cites a 2013 study, “Discrimination in Online Ad Delivery,” in which Harvard professor Latanya Sweeney found that black-identified names (including her own) frequently generated Google ads like “Lakisha Simmons, Arrested?” while white-identified names did not. Because Google’s secret sauce is, well, secret, Sweeney could only speculate as to whether it was because her first and/or last names specifically linked to ad templates containing “arrest,” because those ads have had higher click-through rates, or some other reason. Though Google AdWords was certainly not programmed with any explicit racial bias, the results nonetheless showed a kind of prejudice. Yet Sweeney’s example was one that could still be tested and measured by consumers. What about the everyday profiling that goes on without anyone noticing?

In one study, black-identified names generated different ads than white-identified ones.

Chart courtesy Latanya Sweeney/Harvard University (http://arxiv.org/ftp/arxiv/papers/1301/1301.6822.pdf)

We’ve heard a lot about the NSA and other federal agencies monitoring and profiling citizens, but private “big data” companies—from Google and Facebook to shadowy “data brokers” like Acxiom and BlueKai—are engaged in equal if not greater amounts of data collection. Their goal is chiefly “microtargeting,” knowing enough about users so that ads can be customized for tiny segments like “soccer moms with two kids who like Kim Kardashian” or “aging, cynical ex-computer programmers.”

Some of these categories are dicey enough that you wouldn’t want to be a part of them. Pasquale writes that some third-party data-broker microtargeting lists include “probably bipolar,” “daughter killed in car crash,” “rape victim,” and “gullible elderly.” There are no restrictions on marketers assembling and distributing such lists, nor any oversight, leading to what Pasquale terms “runaway data.” With such lists circulating among marketers, credit bureaus, hiring firms, and health care companies, these categories—which cross the line into racial or gender classification as well—easily slip from marketing tools into reputation indicators.

This customer information is considered highly valuable and lucrative, and Facebook has partnered with brokers like Turn (700 million user profiles) to get its hands on it (as I chronicled in my feature “You Are What You Click”). Once the genie is out of the lamp through comprehensive and multifaceted monitoring of your Web habits, your purchases, searches, and social media activities go on sale and resale to the highest bidders, with everyone profiting except you—unless you consider invasive personalized ads a benefit.

Worse, there’s no way to ensure the lists are even correct. The New York Times’ Natasha Singer reported in 2013 that Acxiom’s data was wrong about Acxiom’s own CEO, listing him as having two children instead of three and being of Italian rather than Norwegian descent. Would you want to be mistakenly (or even accurately) classified into a category like “STD sufferers”? There is no clear process for fixing these errors, making the process of “cyberhygiene” extraordinarily difficult. Even if it were to be made public, should individuals really have to verify the information used to profit off of them? The system is so complicated that only the wealthy may be able to afford the time and assistance required to maintain a clean bill of cyberhealth.

Today, there is barely any oversight of the increasingly complex algorithms that sort and classify us, with the government mostly allowing the industry to “self-regulate,” more or less amounting to no regulation at all. As a lawyer, Pasquale looks at the problem from the outside in, considering the civil structure in which data-collection algorithms are embedded and how we could potentially regulate abusive and harmful uses of the data while still enabling beneficial “big data” studies.

Yet Pasquale underestimates the degree to which even those on the inside can’t control the effects of their algorithms. As a software engineer at Google, I spent years looking at the problem from within, so it’s not surprising that I assign less agency and motive to megacorporations like Google, Facebook, and Apple. In dealing with real-life data, computers often fudge and even misinterpret, and the reason why any particular decision was made is less important than making sure the algorithm makes money overall. Who has the time to validate hundreds of millions of classifications? Where Pasquale tends to see such companies moving in lockstep with the profit motive, I can say firsthand just how confusing and confused even the internal operations of these companies can be.

For example, just because someone has access to the source code of an algorithm does not always mean he or she can explain how a program works. It depends on the kind of algorithm. If you ask an engineer, “Why did your program classify Person X as a potential terrorist?” the answer could be as simple as “X had used ‘sarin’ in an email,” or it could be as complicated and nonexplanatory as, “The sum total of signals tilted X out of the ‘non-terrorist’ bucket into the ‘terrorist’ bucket, but no one signal was decisive.” It’s the latter case that is becoming more common, as machine learning and the “training” of data create classification algorithms that do not behave in wholly predictable manners.

Philosophy professor Samir Chopra has discussed the dangers of such opaque programs in his book A Legal Theory for Autonomous Artificial Agents, stressing that their autonomy from even their own programmers may require them to be regulated as autonomous entities. Pasquale stresses the need for an “intelligible society,” one in which we can understand how the inputs that go into these black box algorithms generate the effects of those algorithms. I’m inclined to believe it’s already too late—and that algorithms will increasingly have effects over which even the smartest engineers will have only coarse-grained and incomplete control. It is up to us to study the effects of those algorithms, whether they are racist, sexist, error-laden, or simply invasive, and take countermeasures to mitigate the damage. With more corporate and governmental transparency, clear and effective regulation, and a widespread awareness of the dangers and mistakes that are already occurring, we can wrest back some control of our data from the algorithms that none of us fully understands.