IBM Speech Recognition Nears Human Rate

IBM says its Switchboard speech recognition system has reached what it is calling a new industry record error rate of just 5.5 percent, lower than the 6.9 percent achieved a year ago and near the 5.1 percent error rate of humans.

"The important thing is that these AI technologies are producing increasing improvement in cognitive learning and speech recognition," says Michael Picheny, senior manager of IBM Watson multimodal. "It used to take many years for these types of improvements, but deep learning is very powerful. There is increased computational capability. Computing GPUs are increasing by a factor of two every year."

The combination of deep learning and GPU capacity makes it possible for the leaps in speech recognition capability, Picheny explains.

In a company blog post, George Saon, IBM's principal research scientist, explained that IBM combined Long Short Term Memory (LSTM) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these had multiple feature inputs, while the other was trained with speaker-adversarial multi-task learning.

"The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples so it gets smarter as it goes and performs better where similar speech patterns are repeated," Saon wrote.

"We worked to reproduce human-level results with the help of our partner, Appen, which provides speech and search technology services," Saon added. "As part of our research efforts, we connected with different industry experts to get their input on this matter too."

Companies, including IBM are advancing their cognitive learning capabilities by leveraging cloud-based systems to add vast amounts of computing capability at a very low cost, Picheny adds. "It's a very exciting time for cognitive computing."

"The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex," added Julia Hirschberg, Columbia University professor and chair of the department of computer science, in a statement. "When we compare automatic recognition to human performance, it's extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated. This scientific achievement is in its way as impressive as the performance of [IBM's] current ASR technology, and shows that we still have a way to go for machines to match human speech understanding."

As the recognition rate continues to improve, Picheny sees speech recognition becoming embedded into sensor technology at an ever faster rate.

"We can quickly add intelligence to almost anything in the environment," Picheny says. "For example, in health care, someone wearing a sensor can fall down, then quickly interact with a health care professional to describe any problem."

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues