SpeechTEK Speakers Call for Conversational Technologies
NEW YORK—While there have been no major breakthroughs in the industry in a few years, speech technology vendors are making small advances in updating speech systems to be more conversational, adaptive, and natural, speakers maintained during an early morning panel at the SpeechTEK conference Monday.
At IBM, for example, much of the current focus is not just on speech recognition but on "technologies that make machines act more like us," said David Nahamoo, speech chief technology officer at IBM Research.
Speech recognition, Nahamoo said, has benefited from recent advances in neural networking and machine learning, but still has a way to go before it can mimic true dialogues.
Neural networking and Web technologies have definitely sped up the pace of innovation and improvements in speech recognition accuracy, according to Nahamoo.
But where speech continues to come up short, he added, is in the ability to conduct a real dialogue. "There's been no real progress on this in thirty years," he stated. "It's been stuck in the mud for a while."
Dan Miller, founder and senior analyst at Opus Research, also sees the need for speech to become more conversational. This, he said, is "his vision" for the technology.
While advances have taken place in semantics, natural language understanding, and even artificial intelligence, making speech more natural and adaptive has been elusive, according to Miller.
The biggest change needed, he added, is not in quality but in "how fluid it is and how easy it is to interact with."
Roberto Pieraccini, director of advanced communication technologies at Jibo, said another technology on the horizon blends speech with cameras and facial recognition to allow systems to read lips as a way to improve speech recognition accuracy.
In an afternoon session, Nandini Stocker, senior voice user interface (VUI) designer at Flare Design, pointed out that speech will always have difficulties mirroring human-to-human conversations because computers cannot pick up the nonverbal turn-taking cues that people give off when engaging with others.
There's a real art and science to designing interactive voice response (IVR) systems that get the timing right between questions, she said. And then it takes a lot of trial and error to provide the right amount of time for the caller to respond to a question in the IVR, she added.
Even then, it's hard for a system to anticipate all the ways that customers can respond, and simple things like repeating an answer can throw the system into error mode, Stocker said.
VUI designers can help the system by turning barge-in on and off as appropriate and giving prompts that guide how the caller responds, she added, but the most important step is for the designer "to be clear about what you're looking for."
To that end, Stocker strongly urged session attendees to include statements that tell the caller how to answer a prompt. With a bank IVR, for example, a prompt that asks the caller the reason for his call could end with "You can say account balance or make a payment."
Also key to making speech systems more conversational will be incorporating emotion detection, Nahamoo contended. This involves not just identifying the emotional state of a caller to an IVR, for example, but how the system responds as well. The emotional state portrayed needs to be appropriate to the type of application, the company, and other factors, he and other speakers maintained.
Speech will also need to become more multichannel and multimodal, particularly taking advantage of advances in mobile technology, added Tom Schalk, vice president of voice technology at satellite radio provider Sirius XM. "With the use of mobile, the use of voice has gone up as well," he stated.
In that regard, the opportunities for more virtual assistant applications such as Apple's Siri and Microsoft's Cortana will expand greatly, Schalk argued.
And when it does, authentication will also need to be multifactor, blending technologies beyond just the basic speaker identification and voice biometrics, according to Bernie Brafman, vice president of marketing at Sensory. Sensory's TrulySecure solution, for example, merges speech with facial recognition to improve authentication for mobile devices, Brafman said.