Q&A: Recent Deep Neural Net (DNN) Advances in Speech
For over 30 years, David Thomson has developed and managed software and algorithm creation for speech recognition, speech synthesis, speech compression, and voice biometrics and managed speech R&D for voice interface systems than have handled over 20 billion calls. He is presenting the SpeechTEK Univeristy course, “Recent Deep Neural Net (DNN) Advances in Speech,” at SpeechTEK 2018. He recently joined CaptionCall, a company that builds services for the deaf and hard of hearing. Conference chair James A. Larson interviewed Thomson in advance of this year’s event.
Q: How has speech recognition methods and accuracy changed over recent years?
A: ASR accuracy began to climb around 2010 with the application of deep neural nets (DNNs). We also have access to more data, faster processing, and more experience.
Q: Has ASR matched human accuracy?
A: No. You see press reports on how ASR matches or beats humans on a particular test set, but in general, humans still reign. ASR is getting better and the gap is narrowing, but we’re still 10-20 years from ASR that scores consistently better than live agent.
Q: What can we do with the new ASR technology we couldn't before?
A: It isn’t so much that we have new applications, rather, the old applications are now good enough to attract mainstream users. Since vocabulary size and accuracy have improved significantly, applications like video captioning, call transcription, “How may I help you?” prompts, virtual assistants, and language translation that sort-of worked before, are now much more comfortable, more natural, and more popular.
Q: Why did we have to wait so long for neural nets (invented in 1958) to deliver these big improvements?
A: We needed a lot more data and a lot more processing. Also, the pre-DNN methods (like GMMs) were highly refined, so it took the research community a long time to get everything exactly right before DNNs overtook them.
Q: What do you expect to happen in the next 5 years?
A: DNNs have a lot of mileage left, so we’re going to see continued accuracy gains. We’re going to see a lot more DNNs running on custom hardware. I know this prediction has been made for decades and general-purpose processors always end up winning, but I’m going to make it again. DNNs will run far better and cheaper on custom silicon. Finally, ASR is good enough it’s often not the limiting factor anymore, so I think more attention from machine learning experts is going to go towards natural language processing and advanced dialog design.
Q: What else (besides ASR) can you use neural nets for?
A: The range of stuff using DNNs is mind-boggling and the performance is astonishing. DNNs can color black and white photographs, predict the stock market, identify objects in an image, perform handwriting recognition, make a medical diagnosis from an X-ray, and power guidance for self-driving cars.
So you've decided you need an Intelligent Virtual Assistant. Your competitors have one, and your customers expect to be able to communicate with you this way. But an IVA is an extension of your brand that deserves as much thought as any other platform. Speech Technology Magazine recently asked Jane Price, SVP of Marketing, Interactions, about the ins-and-outs of constructing an IVA and giving it personality through a persona.
At SpeechTEK 2018 in Washington D.C., the Speech Technology Magazine team had the chance to talk with a series of experts about what developments they see developing in speech technology over the next year. Michael McTear, Allyson Boudousquie, Debra Cancro, and Crispin Reedy sat down to talk with us about the developments they see coming down the pipe--and what they would like to see improved.
We all know a good presentation when we hear it - but the question is, which speech characteristics contribute to that greatness? Speech analytics can tell us.
Bruce Balentine is a Chief Scientist at Enterprise Integration Group specializing in speech, audio, and multimodal user interfaces. In almost three decades of work with speech recognition and related speech technologies, Balentine has designed user interfaces for telecommunications, desktop multimedia, entertainment, language training, medical, in-vehicle, and home automation products. Balentine will moderate the panel "How Can Digital Agents Make Use of User Emotion?" at the SpeechTek conference in April. Conference chair James A. Larson interviewed Balentine in advance of this year's conference.
Crispin Reedy is a Voice User Experience designer and usability professional at Versay Solutions. She has over 15 years of experience on the front lines of the speech industry, in the design, usability, and tuning disciplines. She is presenting the SpeechTEK University course "Strategizing Customer Experiences for Speech" on Wednesday April 11 at SpeechTEK 2018. SpeechTEK program chair, James Larson, talked to Reedy in advance of her conference session.