New Scientist

How Creative Is Your Computer?

The Lovelace test is a better measure of artificial intelligence than the Turing test.

Watercolor portrait of Ada Lovelace
Watercolor portrait of Ada Lovelace, whom the test is named for.

Courtesy of Wikimedia

This article originally appeared in New Scientist.

The Turing test is too easy—creativity should be the benchmark of humanlike intelligence, says Mark Riedl, associate professor at Georgia Tech’s School of Interactive Computing in Atlanta. His work straddles artificial intelligence, virtual worlds, and storytelling. He has developed a new form of the Turing test, called the Lovelace 2.0 test.

What are the elements of the Turing test?

The Turing test was a thought experiment that suggested if someone can’t tell the difference between a human and a computer when communicating with them using just text chat or something similar, then whatever they’re chatting with must be intelligent. When Alan Turing wrote his seminal paper on the topic in 1950, he wasn’t proposing that the test should actually be run. He was trying to convince people that it might be possible for computers to have humanlike abilities, but he had a hard time defining what intelligence was.

Why do you think the test needs upgrading?

It has been beaten at least three times now by chatbots, which almost every artificial intelligence researcher will tell you they don’t think are very intelligent.

A 2001 test called the Lovelace test tried to address this, right?

Yes. That test, named after the 19th-century mathematician Ada Lovelace, was based on the notion that if you want to look at humanlike capabilities in AI, you mustn’t forget that humans create things, and that requires intelligence. So creativity became a proxy for intelligence. The researchers who developed that test proposed that an AI can be asked to create something—a story or poem, say—and the test would be passed only if the AI’s programmer could not explain how it came up with its answer. The problem is, I’m not sure that the test actually works because it’s very unlikely that the programmer couldn’t work out how their AI created something.

How is your Lovelace 2.0 test different?

In my test, we have a human judge sitting at a computer. They know they’re interacting with an AI, and they give it a task with two components. First, they ask for a creative artifact such as a story, poem, or picture. And secondly, they provide a criterion. For example: “Tell me a story about a cat that saves the day,” or “Draw me a picture of a man holding a penguin.”

Must the artifacts be aesthetically pleasing?

Not necessarily. I didn’t want to conflate intelligence with skill: The average human can play Pictionary but can’t produce a Picasso. So we shouldn’t demand superintelligence from our AIs.

What happens after the AI presents the artifact?

If the judge is satisfied with the result, he or she makes another, more difficult, request. This goes on until the AI is judged to have failed a task, or the judge is satisfied that it has demonstrated sufficient intelligence. The multiple rounds means you get a score as opposed to a pass or fail. And we can record a judge’s various requests so that they can be tested against many different AIs.

So your test is more of an AI comparison tool?

Exactly. I’d hate to make a definitive prediction of what it will take for an AI to achieve humanlike intelligence. That’s a dangerous sort of thing to say.