Baidu's Deep Speech Recognition Bests Google, Apple, and Bing

If you've taken a call in a public place, you're likely familiar with the practice of holding a finger in your non-phone ear to hear the person on the other end of the call. But this could become a thing of the past, thanks to Baidu, the Chinese Web search engine giant that recently unveiled a speech recognition system called Deep Speech.

Baidu's lead scientist, Andrew Ng, and his colleagues at Baidu Research have developed Deep Speech, which improves speech recognition accuracy in noisy environments as well as far-field and reverberant scenarios. The researchers said that central to Deep Speech's success is a "well-optimized recurrent neural net training system that uses multiple graphics processing units." Results of the tests were published in December on the Cornell University–funded Web site arXiv.

"Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments," wrote the authors of the paper. "In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects.

"We do not need a phoneme dictionary, nor even the concept of a phoneme. Using a combination of collected and synthesized data, our system learns robustness to realistic noise and speaker variation," the article continued. "Taken together, these ideas suffice to build an end-to-end speech system that is at once simpler than traditional pipelines yet also performs better on difficult speech tasks."

Baidu scientists found that:

Deep Speech outperformed previously published results on the widely studied Switchboard Hub500 (a corpus of speech data used across systems), obtaining a 16.5 percent word error rate.
Deep Speech outperformed public Web APIs (Google Web Speech, Wit.ai), as well as commercial systems (Bing Speech Services, Apple Dictation), especially in noisy backgrounds.
In noisy environments, Deep Speech's word error rate was 10 percent better than competing products.

"The Baidu results for speech recognition are extremely promising, and it seems very likely that as this technology moves out from the lab into real-world applications, it'll provide significantly more accurate speech recognition, especially for speech in noisy environments," said Deborah Dahl, principal of Conversational Technologies, in an email. "The Baidu deep learning approach, combined with use of massive computer resources for training, makes it much more practical to use large amounts of speech data for training, and more data is the biggest contributor to speech recognition performance. The test results on the Switchboard corpus are especially encouraging, because the Switchboard data is very challenging."

In addition to Baidu, many other companies, including Google, Apple, IBM, and Microsoft, are also working on deep neural networks for speech recognition. Bill Meisel, president of TMA Associates, says neural networks are not a new idea.

"These are more complex models than the conventional Hidden Markov Model [HMM] technique, which has been around since at least the 1980s," said Meisel in an email. "I do see Deep Speech as disruptive. HMM technology involves making some assumptions, such as which phonemes to use in the modeling. Deep neural networks can avoid making some assumptions by, in effect, letting layers of the network derive the best representation. It may even be more disruptive if such models are used for the next step, natural language interpretation."

Dahl has a different take on Baidu. While she praises the company's research, she believes that it is "more incremental than disruptive.

"There's a lot of room for improvement," she says, noting that the Baidu system still has a reported 16.5 percent word error rate. "In addition, while improvements in speech recognition by itself will be helpful for applications such as Web search and dictation, intelligent interactive systems will also require improvements in natural language processing and dialogue management."

Ng, who has worked with deep learning and speech recognition technology for several years and was responsible for finding and leading the team at Google's Brain project, says that he is pleased with Deep Speech's results.

"Deep learning, trained on a huge data set—over 100,000 hours of synthesized data—is letting us achieve significant improvements in speech recognition," he said in a statement. "I'm excited by this progress because I believe speech will transform mobile devices, as well as the Internet of Things. This is just the beginning."