Has Speech Crossed the Uncanny Valley?

Article Featured Image

Japanese robotics professor Masahiro Mori first proposed the concept of the uncanny valley in a 1970 book in which he suggested that computerized systems that too closely resemble human beings provoke uncanny or strange feelings of uneasiness in the people who encounter them.

Mori’s original hypothesis stated that as robots are made to appear more human, some observers’ emotional responses become increasingly positive and empathetic, until they reach a point beyond which the response quickly becomes strong revulsion. However, as the robot’s appearance continues to become less distinguishable from a human being, the emotional response becomes positive once again and approaches human-to-human empathy levels.

To give the theory some context, consider the following paradox: On one hand, bestowing machines with human attributes gives them the advantage of familiarity, which can facilitate communication. On the other hand, machines’ human appearance or behavior might trigger users to expect human abilities that cannot always be fulfilled.

That was 50 years ago, when computer technology wasn’t nearly as advanced as it is today.

Thanks to huge leaps forward in artificial intelligence, natural language processing, advanced analytics, neural networking, speech synthesis, voice cloning, noise cancellation, and so many other technologies, almost every type of speech engine these days claims humanlike results. In some cases and environments, speech technology vendors boast that their speech recognition engines yield word error rates that are even superior to human beings and that humans exposed to their text-to-speech voices can barely distinguish them from actual humans. Speech technology, the industry asserts, is more accurate, more realistic, comparable to human speech in many ways, and able to reproduce the nuances of human speech, speaking patterns, and language.

So how comfortable are people interacting with these technologies today? They have far less apprehension than Professor Mori might have thought.

As more speech interfaces become available and users see the benefits of using them, speech will emerge from the uncanny valley, predicts Jim Larson, senior advisor to the Open Voice Network. “Many users are now used to interacting with [Amazon’s] Alexa and no longer think of her as creepy because they find her convenient when performing simple tasks,” he says.

Dan Miller, Opus Research’s founder and lead analyst, largely agrees. “Conversational interactions are starting to feel more natural, but they have largely been through chatbots, search boxes (text), and other text messaging platforms,” he says. “We humans are getting trained to feel more comfortable conversing with bots. That’s how we’re building our own tactics for getting results from interacting with self-service resources.”

People might not necessarily fear these automated interactions on their face any longer, but there is one element that leaves a sense of trepidation with some of them.

The perception still exists—albeit undeserved—that Apple’s Siri, Amazon’s Alexa, and similar systems linger in active listening mode, collecting utterances not intended for them, storing them in the cloud, and passing them on to companies so they can personalize advertising based on the conversations that they overhear. Makers of those systems have fought back vigorously against that perception, saying that their systems only act after the user utters the prescribed wake-up word, such as “Hey Alexa.”

Debbie Dahl, principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group, isn’t as convinced that we’ve completely crossed the chasm just yet. “Text-to-speech by itself is getting very good, especially over the last year or so. I’ve listened to many synthetic voices that are good enough to be confused with human speech, at least for short utterances. I don’t know if that’s enough to make people more comfortable talking to the robots that use them, though,” she says.

Part of why Dahl hesitates to say that the uncanny valley has been completely left behind stems from some aspects that have less to do with the speech technologies themselves.

“The uncanny valley comes not just from speech but from everything we can perceive about an artificial being—not only its speech, but the words and phrasings that it chooses and especially its movements and appearance, if you can see it,” she says. “Any of these can send the artificial being directly into the uncanny valley, even if one of them is perfectly humanlike.

“It might even make people more uncomfortable if something else about the robot is uncanny valley-like,” she continues, “because a weird appearance or strange wording choices will contaminate the whole experience.”

Some of it does have to do with current technology and its apparent limitations, according to Dahl.

“The most fascinating unmet challenge to me about synthetic speech is how speech that sounds extremely realistic over short utterances starts to sound more and more mechanical as the passages get longer,” she says. “This is still true, even with today’s technology.”

Dahl explains that this happens because as the text passages get longer, it becomes increasingly important to nail down the prosody, including pitch, tone, volume, pauses, and emphasis, so that the intended meaning is conveyed.

Addressing this challenge will take more work from the vendors of synthetic speech technologies, according to Dahl, who cites the need for two key technical advances.

“First, we have to get the natural language understanding for longer passages right. Second, we have to understand how to represent that understanding with the right prosody,” she says.

But that might actually make the technology more like a human. As Dahl argues, “Not very many people can do a good job reading long passages with the right prosodic nuances. It takes a pretty skilled actor to do a good job recording audiobooks. It is hardly surprising that it’s challenging for speech systems.”

Several research studies on that topic have uncovered some rather surprising results that further shape uncanny valley discussions. The most relevant was done at the University of Central Florida; it found that higher-fidelity synthesized speech (or speech that sounds more human) is not necessarily more favorable.

The research compared standard computer-generated speech, higher-fidelity computer-generated speech based on neural networks, and synthetic speech based on recordings of the voices of real humans. It found that humans by and large viewed the recorded speech more positively than both synthetic speech options and that there was little change in perception whether neural network or standard concatenated text-to-speech was used.

Similar studies found that in the movies, animated characters voiced by human actors were viewed as more favorable than those voiced by synthetic TTS.

Still other research has yielded one very simple truth: Humans don’t have a problem with voice recognition when it works, but when they have to repeat themselves or it just doesn’t understand them, the level of frustration can be very high.

Moshe Yudkowsky, president of Disaggregate Consulting, places speech at another precarious position, teetering close to yet another stage in the development of technology that he calls the “annoying valley,” In proposing this stage, Yudkowsky defines the annoying valley as the “long stretch of time between when the technology is kicked out the door and slams into the customer’s face and when it improves enough to stop being a nuisance to work around.”

Larson likes Yudkowsky’s annoying valley terminology but defines it just a little differently. For him, the annoying valley lies “between introduction of a new technology and when developers create useful and usable applications of the technology that annoy users.”

Getting past the annoying stage might be harder than the uncanny stage, according to Larson. “Some speech applications are being dismissed due to flaws in how speech is used,” he states. “While Alexa is becoming less creepy, she is annoying when she bursts into conversations, especially Zoom conversations, uninvited. Other speech applications are annoying because they add to noise pollution or interrupt the user.”

There are several theories about where the annoying stage begins and ends and what it will take to bridge that gap.

“Natural language processing married to [automatic speech recognition] is getting pretty good, but most of the solutions in place as speech-enabled [interactive voice response systems] are still pretty rigid, hard-wired, and prone to the classic failure to recognize topics or intents that are outside their vocabulary. That’s the annoying part,” Miller laments.

One Step Forward, Two Steps Back?

And then there is always the risk that even after the uncanny valley has been crossed, speech technology could fall back into the creepy abyss or be thrown there by other forces.

“Users will be thrown back into the uncanny valley when they encounter AI-induced falsehoods (called hallucinations) in speech and conversational text generated by AI-based agents trained on large databases of knowledge containing false statements gleaned from the web,” Larson suspects. “Many people will not distinguish between the creepiness of speech recognition and the super-creepiness of these AI-induced hallucinations. I suspect hallucinations and regurgitated untruths will be a major problem with conversational systems.”

Miller sees another outside influence that can be even more detrimental:

“There are bad actors that discover and foster creepy or malicious applications. Already we’ve seen synthesized voice used to impersonate a bank customer. We’ve had mixed feedback about the use of speech synthesis as a translator or accent eraser. And I can picture use cases where a lifelike voicebot doesn’t have to identify itself as a robot in order to carry out tasks for customers or employees,” Miller says.

The accent eraser to which Miller was referring comes from Sanas, which offers real-time speech-to-speech accent translation technology. Sanas’s software intercepts audio and converts accents through a speech-to-speech approach, building a virtual bridge between the audio device and the computer and then sending the new signal to whichever communication app (Zoom, Google Hangouts, etc.) is in use. Almost instantly, the accent of a customer care representative, for example, could be matched to the accent of an incoming caller. An agent in Spain, for example, could have her accent artificially altered to sound German when communicating with a customer in Austria.

Admittedly, more research is needed to determine how receptive customers might be to that capability. Many industry studies have found that people value qualities like honesty, sincerity, and authenticity in their dealings with technology.

Maxim Serebryakov, CEO of Sanas, says his company’s accent-changing technology has huge potential to change the world.

“The world has shrunk, and people are doing business globally, while at the same time they have real difficulty understanding each other,” he said in a statement. “Digital communication is critical for our daily lives.”

Like many other speech technologies, the first applications for Sanas’s technology is in customer care and technical support, but Serebryakov also sees possibilities in entertainment and media, for example, where producers could make their films and programs understandable in different parts of the world by matching accents to localities.

But the real need now is in the contact center, where Larson sees a day when synthesized speech could replace human agents, who might come with accents or other barriers to intelligibility. Background noise and long wait times in call center applications are other areas of improvement where speech can help.

Going forward, for speech to stay a few steps ahead of the uncanny valley, the industry will need to keep innovating and bringing the technology to other domains. The stage is already set, according to Miller.

“Speech-enabled IVRs and voicebots are not as far along as chatbots and the personal assistants that pop up to assist customers or employees,” he says. “Text-based assistants are demonstrating value by doing such things as summarizing conversations on Zoom, Slack, or between customers and agents; drafting emails or chat responses which can be personalized by an employee; or carrying out background searches in the midst of a conversation among people.”

Such technologies, he adds, “are engaging and often have positive outcomes.”

The next step, Miller adds, is to speech-enable popular apps for text-based chatbots or personal assistants. “It won’t feel creepy if it’s truly helpful,” he says.

Yudkowsky also sees an opportunity here. “Even now, ChatGPT can transcend its text-based limitations. But its creativity is still questionable,” he says.

And the opportunities for speech don’t stop there.

In the end, few can argue with the assumption that there is going to be a larger number of smart devices in homes and businesses, and speech technology is likely to become increasingly ubiquitous as it gets integrated into more devices, platforms, and applications. Greater acceptance of the technology can only follow from there.

Leonard Klie is the editor of Speech Technology. He can be reached at lklie@infotoday.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues