Speech Recognition Just Got Supercharged

Article Featured Image

At last month's Interspeech, a technical conference on speech processing and application, it was evident that academia and industry are pushing the limits of what's possible for automatic speech recognition (ASR) by leveraging technological advances in deep learning. ASR alone is expected to be a $30 billion market.

Thanks to the recent development of efficient training mechanisms and the availability of more computing power, deep neural networks have enabled ASR systems to perform astoundingly well in a number of application domains, way beyond general artificial intelligence-driven voice recognition systems like Siri and Alexa. OpenAI's general-purpose language model, GPT-3, demonstrated impressive results and will certainly spark further innovation in the natural language processing research community.

For the last five years, LibriSpeech, a speech corpus of 1,000 hours of transcribed audiobooks, has been the most-used benchmark dataset for ASR research in both academia and industry. Many prestigious research groups have been testing their new ideas using this dataset, which has rapidly advanced ASR results.

Facebook, ASAPP, and Google have broken the 98 percent speech accuracy barrier in 2020. The race will continue, and so will innovation. Ultimately these innovations make customers the winner, but 100 percent real-time accurate transcriptions also feed directly into various business benefits, whether it's augmenting and amplifying customer service agents' skills and emotional intelligence or transcribing videos with speech-to-text.

Multistream convolutional neural networks (CNN), where audio is processed for better robustness to noisy audio, is one of the main contributors to the successful research outcomes from the LibriSpeech race. The extra processing time that causes higher latency has been tackled to maintain the same accuracy, and ASR systems can take advantage of models that offer reliable transcriptions for noisy environments like agent-customer conversations in contact centers.

Having spent time with technology operations leaders at Fortune 500 contact centers, a consistent theme when discussing speech analytics is that there are no new insights into why customers are calling, and there’s nothing actionable from reports, so little changes.

Change is coming, and the advancements in artificial intelligence are raising the bar for what should be expected in returns for investments in speech recognition. Contact centers that support both voice and digital channels will benefit from ASR as a backbone for transcription, voice of the customer (VoC), customer sentiment, and coaching.

To date, transcription has been an expensive component of contact center technology with little return. Most companies record, batch, and store between 10 percent and 20 percent of calls for later analysis and transcription. Resulting business decisions are based on this narrow dataset, which is historical by the time it's reviewed and not an accurate representation of everything happening across the totality of CX operations. With advanced ASR systems,  transcription of 100 percent of calls can be achieved in real time. It's a game changer for companies that can radically improve their CX operations to keep and grow happier customers faster.

Paired with real-time AI-driven analysis, real-time transcription makes it possible to support and prompt customer service agents with the best responses and actions to take with customers. It allows the voice of customer (VoC) insight to be captured from every call and enables automation of thousands of micro-processes and routine tasks, like call summaries. One of the major communications companies in North America is already leveraging AI to automate the dispositioning of notes at the end of each agent call, leading to a 65 percent reduction in handling time for that specific task.

Customer Sentiment Analysis and Prediction

CX leaders currently use customer satisfaction (CSAT) or Net Promoter Score (NPS) surveys, which typically see response rates from 5 percent to 15 percent. Now innovative machine learning algorithms can leverage the transcriptions and speech analytics of every customer interaction to predict both the sentiment and satisfaction of customers.

Data can be delivered in real time to discern the trending intents of customer calls and automatically categorize each reason, which can provide a ground-level to a cloud-level view of what's happening in the customer landscape. This deep understanding of why customers are calling and how that compares over time delivers rich insights and trends, not just to customer service teams but also to product, marketing, and sales teams. When businesses can adjust and respond quickly to consumers—even applying anomaly detection to identify real-time crisis issues and address them before they become catastrophic—revenue and satisfaction can rise.

Large CX organizations that leverage this modern ASR approach to transcription can reduce their overall customer experience (CX) spending by 30 percent, which can translate into hundreds of millions of dollars in savings annually.

Deep learning is providing revolutionary changes to ASR, which has taken more major leaps in the last decade than it did in the 30 years prior. This radical improvement in ASR accuracy will allow enterprises and their customers to embrace voice recognition products more comfortably than at any time in history, so let's get talking.

Rachel Knaster is senior vice president of product at ASAPP, an artificial intelligence provider. She previously worked with the IBM Watson team.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues