We Need a Nuremberg Code for Big Data

The world of social-engineering surveillance is growing rapidly.

June 20, 20137:17 AM

German toddlers of the "Frogs" group play in the garden at the Spreekita Kindergarten in Berlin May 3, 2007. — German toddlers of the “Frogs” group play in the garden at the Spreekita Kindergarten in Berlin May 3, 2007.

Photo by John MacDougall/Getty Images

Recent revelations about the federal government’s PRISM program have sparked widespread debate about the benefits and harms of state surveillance of Americans in the name of national security. But what about the surveillance we submit to in the service of more mundane activities, like improving children’s vocabularies or increasing student engagement in the classroom? This growing world of social-engineering surveillance has garnered far less attention and controversy but poses significant challenges to the future of privacy.

This spring, the city of Providence, R.I., won the grand prize in the Bloomberg Philanthropies’ Mayor’s Challenge, an annual competition that invites the leaders of cities to propose innovative solutions to urban problems. Providence will use the $5 million prize money to launch Providence Talks, a project targeting the so-called “word gap.” The program draws on the work of psychologists Betty Hart and Todd Risley, whose research in the 1990s on parent-child communication concluded that by the age of 3, lower-income children had heard 30 million fewer words than their better-off peers, leaving them at a disadvantage as they entered school.

Providence Talks hopes to bridge that gap, but with a technological twist. Instead of clipboard-wielding researchers fanning out into a small number of homes, as Hart and Risley did in the 1990s, Providence Talk participants—that is, infants and toddlers—will be constantly surveilled by recording devices provided by LENA, a company that specializes in language environment analysis. For 16 hours, one day a month, the kids will wear little recording devices. LENA has even devised special clothing for its research subjects, including a dapper pair of green overalls and a sweet pink pullover (festooned with the LENA company logo) to safely house the recording devices—prompting images of crafty toddlers engaging in spontaneous acts of civil disobedience by dumping applesauce in the clothing’s “high-tech pockets.”

The program targets low-income families eligible for home visits under the state’s Universal Newborn Screening program and ideally will begin recording children at birth. After analyzing the data, researchers will create evaluations, which will be passed along to the social workers and nurses who meet monthly with families. Then, the social workers will offer strategies for improving the way lower-income parents talk to their kids, such as pointing out everyday objects and responding to infants’ vocalizations. But the program has larger ambitions: As the mayor of Providence wrote in his proposal, “We believe these data will be useful for city managers as well. Aggregate data on block and neighborhood level household auditory environments would allow us to direct existing early childhood resources with a level of precision and thoughtfulness never before possible.”

Setting aside the project’s logistical challenges (will families, hyperaware of the recording devices, end up overcompensating, thus skewing the results?) and the many issues it raises with regard to social class, it is surprising how little discussion of privacy and consent the project has prompted. The program is described as “free, confidential, and completely voluntary,” and according to LENA’s descriptions of its technology, it is possible for the recorder to encrypt what it records, although it’s not clear that it will be so for Providence participants. And in their proposal to Bloomberg Philanthropies, city officials claimed that the recordings would be deleted after being analyzed by LENA’s software. Of course, those data are proprietary, and LENA has so far stated no intention of making it publicly available to other researchers.

More worrisome, however, is the lack of concern about how state surveillance of private citizens—even in the interest of “improving” those citizens—is increasing with little public debate about the challenges such interventions pose to freedom and autonomy. Research has demonstrated that teaching low-income families to talk more to their children yields positive results, but why is intrusive technological surveillance necessarily better than simply having social workers emphasize that during home visits? Even if digital surveillance can provide a bit more detail about a family’s conversational patterns, is this extra information worth the cost in terms of the money spent on the technology and the loss of privacy? If Providence’s pilot project yields rich data and good results for families, will the state of Rhode Island make it mandatory for anyone applying for government assistance?

You don’t have to live in Providence to be the subject of social engineering surveillance. If you are on Facebook, have enrolled in a MOOC, or use electronic textbooks, you are also, perhaps unwittingly, part of a growing number of human subjects used in Big Data research. MOOCs such as edX, Coursera, and Udacity are all engaged in large-scale data collection of the students registered in their courses, ostensibly for the purpose of improving course offerings. But such information can also be sold to third parties.

At universities such as Texas A&M, professors who use digital textbooks developed by Silicon Valley startup CourseSmart can track whether or not their students are reading and annotating their digital textbooks diligently enough (not unlike what your Kindle or Nook are doing if you read on them). As Evgeny Morozov observed about CourseSmart last year, “This data may seem trivial but once merged with other data—say, their Facebook friends or their Google searches—it suddenly becomes very valuable to advertisers and potential employers.” CourseSmart generates an “engagement index” to assess student performance based on the tracking data; more than 3 million students use its textbooks, which generate masses of data for the company. As a recent story in the New York Times revealed, the efforts of students who take their notes by hand or on a file on their personal computer instead of in the digital text itself are not included in the engagement index, leaving them vulnerable to giving their instructors the impression that they aren’t spending serious time with the course textbook—even if they are.

In the past few years, parents of children in public schools in New York and Massachusetts have demanded greater transparency about the kind and quantity of information about students their respective State Education Departments are providing to the Shared Learning Collaborative. The collaborative, an online venture funded by the Gates Foundation, uses student data to help develop and market for-profit educational products by third parties. Because the information in the collaborative is personally identifiable and includes things such as student test scores, grades, and attendance records, parents have asked that the state require parental consent for participation in the database as well as asking education departments to release the terms of their contracts with the Gates Foundation (and its new corporation, inBloom). So far they have had little success in achieving even these reasonable goals.

And at Harvard, sociology professor Nicholas Christakis is studying the Facebook profiles of students at an unidentified East Coast college for his research on how social relationships form. The students don’t know they are being studied, although Christakis did get approval from the college’s administrators and from an Institutional Review Board at Harvard to conduct his surveillance. “We’re on the cusp of a new way of doing social science,” Christakis told the New York Times. “Our predecessors could only dream of the kind of data we now have.” And yet the rules that researchers are applying are surprisingly analog—such as the idea that trolling student profiles on Facebook is the same as observing strangers in public because they have made their information available for the world to see.

Our increasing surveillance capabilities, coupled with the rise of Big Data, have not as yet been matched by a sustained effort to craft ethical rules for digital human subject research. Safeguards are erected in piecemeal fashion, if at all, and questions about what informed consent even means in this environment are left largely unanswered. In the past, ethical principles about research on human subjects all too often developed in reaction to gross abuses (such as experiments conducted in Tuskegee, or Stanley Milgram’s obedience studies, for example). As a result, hospitals, universities, and research facilities have created elaborate ethical oversight structures in the form of Institutional Review Boards and the federal government has crafted regulations for how federal funds can be used for human subject research. This architecture of ethical oversight isn’t perfect, but it is a reasonable attempt to acknowledge the potential dangers and the need for greater accountability by researchers. We need something similar for digital research. We need a Nuremberg Code for Big Data.

The goals of these research projects are laudable—improving literacy, “personalizing” the learning experience—but that doesn’t excuse us from asking tough questions about the methods involved. So far we’ve been satisfied with paltry reassurances that our data are not personally identifiable and are merely being collected to enhance the “user experience” (a phrase Orwell would have loved). But downloading that textbook or signing up for that MOOC often opens us up to having our data sold to third parties whose identities we’ll never know.

We also have a history of underestimating the possibilities that science and technology will push past the privacy boundaries we’ve erected. Years ago, when state and federal governments began constructing DNA databases, privacy advocates’ concerns were dismissed as overblown since the databases would use so-called “junk DNA,” which at the time was thought to give no individual information about a person’s genetic predispositions or other medical history. Years later, scientists discovered that a great deal of that kind of information can be gleaned from junk DNA, and the future will likely bring more such discoveries.

As technological surveillance continues to replace traditional observational science, we need to ask better questions: Are there alternatives to intrusive technological surveillance? Are researchers scrupulous about seeking truly informed consent? When and how should we create structures to provide ethical review and oversight of these technologies? In the era of Big Data, we are all potential research subjects.

This article arises from Future Tense, a collaboration among Arizona State University, the New America Foundation, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter.