Speech Recognition Nears Human Accuracy
IBM says its Switchboard speech recognition system has reached what it is calling a new industry record error rate of just 5.5 percent, lower than the 6.9 percent achieved a year ago and near the 5.1 percent error rate of humans. The company credits improvements in deep learning and computer processing for the increased speech capability.
“The important thing is that these [artificial intelligence] technologies are producing increasing improvement in cognitive learning and speech recognition,” says Michael Picheny, senior manager of IBM Watson multimodal. “It used to take many years for these types of improvements, but deep learning is very powerful.”
In a company blog post, George Saon, IBM’s principal research scientist, explained that IBM combined long short-term memory (LSTM) and WaveNet language models with three strong acoustic models, including one that had multiple feature inputs and another that was trained with speaker-adversarial multitask learning.
“The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples, so it gets smarter as it goes and performs better where similar speech patterns are repeated,” Saon wrote.
“We worked to reproduce human-level results with the help of our partner, Appen, which provides speech and search technology services,” Saon added. “As part of our research efforts, we connected with different industry experts to get their input on this matter too.”
Microsoft, in the meantime, has made similar breakthroughs in speech recognition accuracy. Last October, researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition word error rate of just 5.9 percent.
“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, executive vice president of the Microsoft Artificial Intelligence and Research group.
In Microsoft’s case, much of the recognition accuracy is being attributed to the use of the latest neural network technology available through the Microsoft Cognitive Toolkit.
Moving forward, Microsoft researchers are working on ways to make sure that speech recognition works well in more real-life settings, to help the technology recognize individual speakers when multiple people are talking. A longer-term goal is to teach computers not just to transcribe the words that come out of people’s mouths but to answer questions or take action based on what they are told.
Mary Meeker, a partner at Silicon Valley venture capital firm Kleiner Perkins Caulfield & Byers, in her company’s Internet Trends Report last year, also noted the increasing trend toward greater speech recognition accuracy. When it comes to speech-enabled search, she cited Chinese firm Baidu’s accuracy at 96 percent, SoundHound’s Hound and Apple’s Siri at 95 percent, and Google Now at 92 percent.
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex,” added Julia Hirschberg, professor and chair of the Department of Computer Science at New York’s Columbia University, in a statement. “When we compare automatic recognition to human performance, it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated. This scientific achievement is in its way as impressive as the performance of [IBM’s] current ASR technology and shows that we still have a way to go for machines to match human speech understanding.