How Big Data and Spam Bots Threaten Online Discussion

What's to come?
Oct. 26 2012 9:42 AM

Muzzled by the Bots

Intermediaries online are more powerful, and more subtle, than ever before.

Iranians protest against the publication by a French satirical weekly of a cartoons depicting Prophet Mohammed.
Iranians outside France's embassy protest the publication of a cartoon about the Prophet Mohammed on Sept. 23, 2012

Photograph by ATTA Kenare/AFP/Getty Images.

“Disintermediation” is often heralded as the defining feature of the digital age. Thanks to innovative new technologies, middlemen of all stripes are believed to go the way of the dodo. Once editors, publishers, and bookstores wither, the story goes, our public life will finally be liberated from their biases, inefficiencies, and hidden agendas. To quote Amazon's Jeff Bezos—a master slayer of intermediaries if there ever were one—“even well-meaning gatekeepers slow innovation. When a platform is self-service, even the improbable ideas can get tried, because there’s no expert gatekeeper ready to say “that will never work!” Even if Bezos is right, he's missing one important aspect of this story: The digitization of our public life is also giving rise to many new intermediaries that are mostly of invisible—and possibly suspect—variety.

Consider blogging. When the first generation of bloggers got online in the late 1990s, the only intermediaries between them and the rest of the world were their hosting companies and their Internet service providers. Anyone starting a blog in 2012 is likely to end up on a commercial platform like Tumblr or WordPress, with all of their blog comments run through a third-party company like Disqus. But the intermediaries don't just stop there: Disqus itself cooperates with a company called Impermium, which relies on various machine learning tools to check whether comments posted are spam. It's the proliferation—not elimination—of intermediaries that has made blogging so widespread.  The right term here is “hyperintermediation,” not “disintermediation.”

Impermium's new service goes even further: The company claims to have developed a technology to “identify not only spam and malicious links, but all kinds of harmful content—such as violence, racism, flagrant profanity, and hate speech—and allows site owners to act on it in real-time, before it reaches readers.” It says it has 300,000 websites as clients (which is not all that surprising, if it's incorporated into widely used third-party tools like Disqus). As far as intermediaries go, this sounds very impressive: a single Californian company making decisions over what counts as hate speech and profanity for some of the world's most popular sites without anyone ever examining whether its own algorithms might be biased or excessively conservative.

Advertisement

Impermium's model is interesting because it adds a “big data” layer to the usual process of determining what counts as spam or hate speech. It used to be that anyone who mentions “Viagra” in his comment or blog post would be deemed a spammer and thus blocked immediately. Now Impermium claims that, by leveraging user data that come from its network of 300,000 participating websites, it can actually distinguish jokes about Viagra from spam about Viagra.

This might seem liberating: Adding context to the moderation decision could save legitimate jokes. However, in other contexts, this marriage of big data and automated content moderation might also have a darker side, particularly in undemocratic regimes, for whom a war on spam and hate speech—waged with the help of domestic spam-fighting champions—is just a pretense to suppress dissenting opinions. In their hands, solutions like Impermium's might make censorship more fine-grained and customized, eliminating the gaps that plague “dumb” systems that censor in bulk.

Bloggers in China, for example, regularly employ euphemisms and allusions to trick the censorship algorithms of the country's online platforms. A seemingly innocuous expression like “river crab” often stands in for “Internet censorship” while “vacation therapy” has been used to refer to arrests of government officials. Left uncensored—since they don't use big words like “human rights” or “democracy”—such expressions quickly become memes and trigger critical discussions about Chinese politics.

With the help of “big data,” content-moderation software can check the relative frequency with which such expressions have been used on other popular sites and investigate the actual commentators using them—who are their friends? what other articles have they commented on?—to spot suspicious euphemisms. Or they might investigate where some of the posts containing those euphemisms come from. Just imagine what kind of new censorship possibilities open up once moderation decisions can incorporate geolocational information (what some researchers already call “spatial big data”): Why not block comments, videos, or photos uploaded by anyone located in, say, Tahrir Square or some other politically explosive location?

Or autocrats could be even craftier and hijack, rather than simply block, new content trends they find threatening. Following the Arab Spring uprisings, anyone posting critical comments about Bahrain or Syria on Twitter was likely to receive angry corrections from the government loyalists or, more likely, their bots. Likewise, Tibetan activists  lament that several Tibet-related Twitter hashtags—#tibet and #freetibet in particular—feature so much junk created by spambots that they are no longer useful.

Now big-data technology can make such propaganda more precise. For governments and corporations alike, the next frontier is to learn how to identify, pre-empt, and disrupt emerging memes before they coalesce behind a catchy hashtag—this is where “big data” analytics would be most helpful. Thus, one of the Russian security agencies has recently awarded a tender to create bots that can both spot the formation of memes and to disrupt and counter them in real-time through ”mass distribution of messages in social networks with a view to the formation of public opinion.” Moscow is learning from Washington here: Last year the Pentagon  awarded a $2.7 million contract to the San Diego-based firm Ntrepid in order to build software to create fake multiple online identities and “counter violent extremist and enemy propaganda outside the US.” “Big data”-powered analytics would make spotting such “enemy propaganda” much easier.

Why would anyone bother with such tactics, given how hard it might be for a bot—which has few contacts and no meaningful history of tweeting—to persuade humans? First, persuasion may not be the goal. Some bots exist only to make it harder to discover timely factual information about, say, some ongoing political protests. All that investment in bots may have paid off for the Kremlin: During the protests that followed the disputed parliamentary elections in December 2011, Twitter was brimming with fake accounts that sought to overwhelm the popular hashtags with useless information. One recent study claims that of 46,846 Twitter accounts that participated in discussing the disputed Russian elections, 25,860—more than half!—were bots, posting 440,793 tweets on the subject.

Second, bots might help to add numerical ballast to memes that are already promoted by prominent humans in order to push them to the top of the viral charts. During this year's presidential elections in Mexico, the PRI (whose candidate won the election) was accused of programming thousands of bots to tweet specific words and phrases in order to land their preferred message on Twitter's “trending topics.” But the PRI also did a very good job getting its supporters to tweet en masse. One campaign had them all tweet a certain hashtag at the same time. It's through such combination of humans and bots that memes emerge.

Third, the smartest of bots can serve another very interesting function: They can introduce humans to one another—for example, by mentioning both of their Twitter handles in one tweet. A 2011 experiment by PacSocial, an analytics company focused on bots, revealed that bots can, indeed, increase connections between users. In the PacSocial experiment, the connection rate increased by 43 percent (as compared to a testing period where no bots were present). Thus, with just some clever manipulation, bots might get you to follow the right humans—and it's the humans, not bots, who would then influence your thinking.

Digitization will increase—not decrease!—the number of intermediaries in our public life. There is nothing inherently evil about intermediaries once we remember to keep them in check. Instead of celebrating the mythical nirvana of disintermediation, we should peer inside the blackboxes of spam algorithms and propaganda bots. Our public debate might be only as good as our memes, but we shouldn't forget that not all memes are created equal—some are not created at all, while others are formed with a heavy dose of clever and insidious planning.

This article arises from Future Tense, a collaboration among Arizona State University, the New America Foundation, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter.

Evgeny Morozov is a contributing editor at the New Republic and the author of To Save Everything, Click Here: The Folly of Technological Solutionism.

  Slate Plus
Slate Picks
Dec. 19 2014 4:15 PM What Happened at Slate This Week? Staff writer Lily Hay Newman shares what stories intrigued her at the magazine this week.