Google’s AI passed a famous test — and showed how the test is broken
The Turing test has long been a benchmark for machine intelligence. But what it really measures is deception.
In 1950, the ingenious computer scientist Alan Turing proposed a thought experiment he called the Imitation Game. An interviewer converses via typewriter with two subjects, knowing one is human and the other a machine. If a machine could consistently fool the interviewer into believing it was the human, Turing suggested, we might speak of it as capable of something like thinking.
Tech is not your friend. We are. Sign up for The Tech Friend newsletter.
Whether machines could actually think, Turing believed, was a question “too meaningless to deserve discussion.” Nonetheless, the “Turing test” became a benchmark for machine intelligence. Over the decades, various computer programs vied to pass it using cheap conversational tricks, with some success.
In recent years, wealthy tech firms including Google, Facebook and OpenAI have developed a new class of computer programs known as “large language models,” with conversational capabilities far beyond the rudimentary chatbots of yore. One of those models — Google’s LaMDA — has convinced Google engineer Blake Lemoine that it is not only intelligent but conscious and sentient.
If Lemoine was taken in by LaMDA’s lifelike responses, it seems plausible that many other people with far less understanding of artificial intelligence, AI, could be as well — which speaks to its potential as a tool of deception and manipulation, in the wrong hands.
To many in the field, then, LaMDA’s remarkable aptitude at Turing’s Imitation Game is not an achievement to be celebrated. If anything, it shows that the venerable test has outlived its use as a lodestar for artificial intelligence.
“These tests aren’t really getting at intelligence,” said Gary Marcus, a cognitive scientist and co-author of the book “Rebooting AI.” What it’s getting at is the capacity of a given software program to pass as human, at least under certain conditions. Which, come to think of it, might not be such a good thing for society.
“I don’t think it’s an advance toward intelligence,” Marcus said of programs like LaMDA generating humanlike prose or conversation. “It’s an advance toward fooling people that you have intelligence.”
Lemoine may be an outlier among his peers in the industry. Both Google and outside experts on AI say that the program does not, and could not possibly, possess anything like the inner life he imagines. We don’t need to worry about LaMDA turning into Skynet, the malevolent machine mind from the Terminator movies, anytime soon.
But there is cause for a different set of worries, now that we live in the world Turing predicted: one in which computer programs are advanced enough that they can seem to people to possess agency of their own, even if they actually don’t.
Cutting-edge artificial intelligence programs, such as OpenAI’s GPT-3 text generator and image generator DALL-E 2, are focused on generating uncannily humanlike creations by drawing on immense data sets and vast computing power. They represent a far more powerful, sophisticated approach to software development than was possible when programmers in the 1960s gave a chatbot called ELIZA canned responses to various verbal cues in a bid to hoodwink human interlocutors. And they may have commercial applications in everyday tools, such as search engines, autocomplete suggestions, and voice assistants like Apple’s Siri and Amazon’s Alexa.
It’s also worth noting that the AI sector has largely moved on from using the Turing test as an explicit benchmark. The designers of large language models now aim for high scores on tests such as the General Language Understanding Evaluation, or GLUE, and the Stanford Question Answering Dataset, or SQuAD. And unlike ELIZA, LaMDA wasn’t built with the specific intention of passing as human; it’s just very good at stitching together and spitting out plausible-sounding responses to all kinds of questions.
Yet beneath that sophistication, today’s models and tests share with the Turing test the underlying goal of producing outputs that are as humanlike as possible. That “arms race,” as the AI ethicist Margaret Mitchell called it in a Twitter Spaces conversation with Washington Post reporters on Wednesday, has come at the expense of all sorts of other possible goals for language models. Those include ensuring that their workings are understandable and that they don’t mislead people or inadvertently reinforce harmful biases. Mitchell and her former colleague Timnit Gebru were fired by Google in 2021 and 2020, respectively, after they co-authored a paper highlighting those and other risks of large language models.
[Google fired its star AI researcher one year ago. Now she’s launching her own institute.]
While Google has distanced itself from Lemoine’s claims, it and other industry leaders have at other times celebrated their systems’ ability to trick people, as Jeremy Kahn pointed out this week in his Fortune newsletter, “Eye on A.I.” At a public event in 2018, for instance, the company proudly played recordings of a voice assistant called Duplex, complete with verbal tics like “umm” and “mm-hm,” that fooled receptionists into thinking it was a human when it called to book appointments. (After a backlash, Google promised the system would identify itself as automated.)
“The Turing Test’s most troubling legacy is an ethical one: The test is fundamentally about deception,” Kahn wrote. “And here the test’s impact on the field has been very real and disturbing.”
Kahn reiterated a call, often voiced by AI critics and commentators, to retire the Turing test and move on. Of course, the industry already has, in the sense that it has replaced the Imitation Game with more scientific benchmarks.
But the Lemoine story suggests that perhaps the Turing test could serve a different purpose in an era when machines are increasingly adept at sounding human. Rather than being an aspirational standard, the Turing test should serve as an ethical red flag: Any system capable of passing it carries the danger of deceiving people.
How an AI managed to confuse humans in an imitation game
Here’s how “hints of humanness” may have come into play in the Italian experiment.
Researchers from the Italian Institute of Technology have trained an artificially intelligent computer program to pass a nonverbal Turing test. By behaving like humans would, such as changing up its reaction times while participating in an interactive shape and color-matching game, the AI was able to hide its true identity from people. The related study was published this week in the journal Science Robotics. A Turing test is a standard used by computer scientists to determine whether a machine can display a convincing enough level of human-like intelligence to fool another human into believing that it too, could be human. This is done usually through a version of “the imitation game.” The game works like this: There are three players in total. One player is unable to see the other two players but can communicate with them by asking questions or interacting with them to determine which of the two players is human and which is machine. A computer passes the Turing test when the human spectator can’t distinguish between the answers of the human versus the computer. “Given how our brains are tuned to detect very subtle and implicit signals of human-like behavior, it is worth asking whether a humanoid robot can ever pass the nonverbal version of the Turing test by embodying human characteristics in its physical actions,” the researchers wrote in the study. In their experiment, they put a human and a robot together in a room. Both the human and the robot had to observe colored shapes on a screen. Each participant was in charge of responding to one color only. For example, one participant could be tasked with pressing their button only when a red square appears on the screen, and the other participant could be tasked with pressing their button only when a green square appears on the screen. But here’s the twist: The robotic teammate was remotely controlled, sometimes by a human and sometimes by AI. The AI took into account the time its human teammate responded and randomly varied its reaction time between seeing the shape and pressing the button to appear more human.
After running through the game together, the human participants were asked to guess whether their robot teammate was run by another human or by the AI-powered computer program. While people could usually tell when another human was behind the robot, they had trouble when it came to the AI, and guessed wrong more than 50 percent of the time. “Our results suggest that hints of humanness, such as the range of behavioral variability, might be used by observers to ascribe humanness to a humanoid robot,” the researchers wrote. “This provides indications for robot design, which aims at endowing robots with behavior that can be perceived by users as human-like.”
This is not the first time a machine has passed the Turing test. In a 2014 event organized by the University of Reading, a computer program convinced one-third of the human judges at the Royal Society in London that it was a 13-year-old boy. And just last month, one of Google’s AI also passed a version of this test, igniting controversies over the ethics of these types of programs. Many scientists, though, have noted that while passing the Turing test is a meaningful milestone, due to inherent flaws in the test’s design, it cannot be used to measure whether machines are actually thinking, and therefore cannot be used to prove true general intelligence.