The Outlook for Deep Neural Networks and Speech Technology

Deep neural networks (DNNs) are making large improvements in the performance of speech-enabled interactive voice response (IVR) systems, enabling companies to expand their usage of both to automate calls.

Though they showed great promise in the mid-1990s, the results of neural networks proved disappointing and the approach was all but dead until recent DNN methods began to power breakthroughs in performance and training speed.

Now DNNs are sophisticated enough not only to deliver natural-sounding speech like the common consumer-based applications, such as Apple’s Siri, Microsoft’s Cortana, and Amazon’s Echo, but also to be used in sophisticated enterprise applications for customer service and for text-to-speech and speech-to text applications.

Speech technology experts say that developers and users of these systems are just starting to scratch the surface of the interactions that these systems can perform, so they expect much greater market penetration and expansion of capabilities and usage over the next couple of years.

How They Work

DNNs act much like the human brain, explains David Thomson, vice president of speech research for Interactions.

Neural networks, which have been in use in other applications for a few decades, are pieces of software that in speech technology usage sense signals. Just as the human brain has multiple layers of neurons that people use to piece together and understand elements of speech, DNNs use multiple layers of software code to parse signals into understandable elements of speech.

The initial layers take in the audio in stages and eliminate the irrelevant information, such as background noise, that interferes with many older speech recognition systems, Thomson explains. The middle stage layers transform the audio signals into something that is meaningful. The final layers provide the output communication in a way that can be understood by the system software.

Then the DNNs quickly stitch together those separate sounds into individual words and, in turn, string the words together into proper sentence structure and syntax. Additionally, the DNNs eliminate grammatical errors, as well as greatly decrease the number of unrecognizable words that require callers to default to a call center agent or touchtone entry. The DNNs also correct for regional accents when the caller is speaking, though they have yet to mature to the level that they can use different regional accents in their responses. They are available in a few different languages, however.

In that way, DNNs are superior rules-based, speech-based IVR systems. The legacy systems require callers to follow fairly strict words and sentence structure. Though a few of the more advanced systems could handle a few language variances, the general rule was that if the caller’s speech is outside of that structure, then the call would default to an agent, but typically only after asking the caller to repeat himself one or multiple times. This often meant a frustrated caller, the higher expense and staffing complexities of live agents, and the chance of lost business.

With improvements in speech recognition from DNNs, however, the calls are handled more accurately, sharply reducing caller frustration and meaning that live agents are needed only for the most complex calls or ones needing special attention, like a call about credit card fraud. Even in those instances, the speech-enabled IVRs can record some of the important initial information, reducing the time that a more expensive live agent needs to be on the phone. Therefore, speech technology experts expect further deployment of DNN-enabled speech technology to expand the self-help capabilities that companies across various industries can offer to their customers.

The Outlook for Deep Neural Networks and Speech Technology

How They Work

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions