How Big Data and Spam Bots Threaten Online Discussion

What's to come?
Oct. 26 2012 9:42 AM

Muzzled by the Bots

Intermediaries online are more powerful, and more subtle, than ever before.

Iranians protest against the publication by a French satirical weekly of a cartoons depicting Prophet Mohammed.
Iranians outside France's embassy protest the publication of a cartoon about the Prophet Mohammed on Sept. 23, 2012

Photograph by ATTA Kenare/AFP/Getty Images.

“Disintermediation” is often heralded as the defining feature of the digital age. Thanks to innovative new technologies, middlemen of all stripes are believed to go the way of the dodo. Once editors, publishers, and bookstores wither, the story goes, our public life will finally be liberated from their biases, inefficiencies, and hidden agendas. To quote Amazon's Jeff Bezos—a master slayer of intermediaries if there ever were one—“even well-meaning gatekeepers slow innovation. When a platform is self-service, even the improbable ideas can get tried, because there’s no expert gatekeeper ready to say “that will never work!” Even if Bezos is right, he's missing one important aspect of this story: The digitization of our public life is also giving rise to many new intermediaries that are mostly of invisible—and possibly suspect—variety.

Consider blogging. When the first generation of bloggers got online in the late 1990s, the only intermediaries between them and the rest of the world were their hosting companies and their Internet service providers. Anyone starting a blog in 2012 is likely to end up on a commercial platform like Tumblr or WordPress, with all of their blog comments run through a third-party company like Disqus. But the intermediaries don't just stop there: Disqus itself cooperates with a company called Impermium, which relies on various machine learning tools to check whether comments posted are spam. It's the proliferation—not elimination—of intermediaries that has made blogging so widespread.  The right term here is “hyperintermediation,” not “disintermediation.”

Impermium's new service goes even further: The company claims to have developed a technology to “identify not only spam and malicious links, but all kinds of harmful content—such as violence, racism, flagrant profanity, and hate speech—and allows site owners to act on it in real-time, before it reaches readers.” It says it has 300,000 websites as clients (which is not all that surprising, if it's incorporated into widely used third-party tools like Disqus). As far as intermediaries go, this sounds very impressive: a single Californian company making decisions over what counts as hate speech and profanity for some of the world's most popular sites without anyone ever examining whether its own algorithms might be biased or excessively conservative.

Advertisement

Impermium's model is interesting because it adds a “big data” layer to the usual process of determining what counts as spam or hate speech. It used to be that anyone who mentions “Viagra” in his comment or blog post would be deemed a spammer and thus blocked immediately. Now Impermium claims that, by leveraging user data that come from its network of 300,000 participating websites, it can actually distinguish jokes about Viagra from spam about Viagra.

This might seem liberating: Adding context to the moderation decision could save legitimate jokes. However, in other contexts, this marriage of big data and automated content moderation might also have a darker side, particularly in undemocratic regimes, for whom a war on spam and hate speech—waged with the help of domestic spam-fighting champions—is just a pretense to suppress dissenting opinions. In their hands, solutions like Impermium's might make censorship more fine-grained and customized, eliminating the gaps that plague “dumb” systems that censor in bulk.

Bloggers in China, for example, regularly employ euphemisms and allusions to trick the censorship algorithms of the country's online platforms. A seemingly innocuous expression like “river crab” often stands in for “Internet censorship” while “vacation therapy” has been used to refer to arrests of government officials. Left uncensored—since they don't use big words like “human rights” or “democracy”—such expressions quickly become memes and trigger critical discussions about Chinese politics.

With the help of “big data,” content-moderation software can check the relative frequency with which such expressions have been used on other popular sites and investigate the actual commentators using them—who are their friends? what other articles have they commented on?—to spot suspicious euphemisms. Or they might investigate where some of the posts containing those euphemisms come from. Just imagine what kind of new censorship possibilities open up once moderation decisions can incorporate geolocational information (what some researchers already call “spatial big data”): Why not block comments, videos, or photos uploaded by anyone located in, say, Tahrir Square or some other politically explosive location?

  Slate Plus
Slate Picks
Nov. 25 2014 3:21 PM Listen to Our November Music Roundup Hot tracks for our fall playlist, exclusively for Slate Plus members.