Speech Technology Magazine

 

Q&A: Recent Deep Neural Net (DNN) Advances in Speech

For over 30 years, David Thomson has developed and managed software and algorithm creation for speech recognition, speech synthesis, speech compression, and voice biometrics and managed speech R&D for voice interface systems than have handled over 20 billion calls. He is presenting the SpeechTEK Univeristy course, "Recent Deep Neural Net (DNN) Advances in Speech," at SpeechTEK 2018. He recently joined CaptionCall, a company that builds services for the deaf and hard of hearing. Conference program chair James A. Larson interviewed Thomson in advance of this year's event.
Posted Apr 4, 2018
Page1 of 1
Bookmark and Share

For over 30 years, David Thomson has developed and managed software and algorithm creation for speech recognition, speech synthesis, speech compression, and voice biometrics and managed speech R&D for voice interface systems than have handled over 20 billion calls. He is presenting the SpeechTEK Univeristy course, “Recent Deep Neural Net (DNN) Advances in Speech,” at SpeechTEK 2018. He recently joined CaptionCall, a company that builds services for the deaf and hard of hearing. Conference chair James A. Larson interviewed Thomson in advance of this year’s event.

Q: How has speech recognition methods and accuracy changed over recent years?

A: ASR accuracy began to climb around 2010 with the application of deep neural nets (DNNs). We also have access to more data, faster processing, and more experience.

Q: Has ASR matched human accuracy?

A: No. You see press reports on how ASR matches or beats humans on a particular test set, but in general, humans still reign. ASR is getting better and the gap is narrowing, but we’re still 10-20 years from ASR that scores consistently better than live agent.

Q: What can we do with the new ASR technology we couldn't before?

A: It isn’t so much that we have new applications, rather, the old applications are now good enough to attract mainstream users. Since vocabulary size and accuracy have improved significantly, applications like video captioning, call transcription, “How may I help you?” prompts, virtual assistants, and language translation that sort-of worked before, are now much more comfortable, more natural, and more popular.

Q: Why did we have to wait so long for neural nets (invented in 1958) to deliver these big improvements?

A: We needed a lot more data and a lot more processing. Also, the pre-DNN methods (like GMMs) were highly refined, so it took the research community a long time to get everything exactly right before DNNs overtook them.

Q: What do you expect to happen in the next 5 years?

A: DNNs have a lot of mileage left, so we’re going to see continued accuracy gains. We’re going to see a lot more DNNs running on custom hardware. I know this prediction has been made for decades and general-purpose processors always end up winning, but I’m going to make it again. DNNs will run far better and cheaper on custom silicon. Finally, ASR is good enough it’s often not the limiting factor anymore, so I think more attention from machine learning experts is going to go towards natural language processing and advanced dialog design.

Q: What else (besides ASR) can you use neural nets for?

A: The range of stuff using DNNs is mind-boggling and the performance is astonishing. DNNs can color black and white photographs, predict the stock market, identify objects in an image, perform handwriting recognition, make a medical diagnosis from an X-ray, and power guidance for self-driving cars.

Page1 of 1