October 19, 2023
By Kevin Brown enterprise architect, Miratech.
Inside Speech

Real-Time Transcription Serves an Immediate Need (or Lots of Them)

For all the talk about generative AI and large language models in the speech technology industry, the current buzz around contact center water coolers, virtual or otherwise, is real-time transcription of calls.

After-the-fact or lagging call transcription has been around for nearly two decades, with solid use cases driving the requirement in contact centers. Real-time monitoring and coaching started in earnest about 10 years ago with the likes of CallMiner, Cogito, Observe.ai, and others.

But for various reasons, a growing number of organizations want real-time transcription results. One large health organization wishes to analyze topics and sentiment in real time across all calls and feed the results to a dashboard. Think speech analytics on steroids, with a dose of sentiment analysis. A business process outsourcing company, as another example, wants to immediately capture caller conversations for its post-call documentation, using callers’ exact verbiage, thereby eliminating agents’ potential interpretation differences.

Many organizations are interested in captioning calls, much like closed captioning on TV, to help employees with hearing impediments. This also could help all agents, who at times might lose focus; real-time transcription with desktop captioning would allow agents to quickly look at what was said and catch up with the conversation.

Real-time translation is becoming so accurate that several organizations are considering using real-time transcription to fuel translation. This would allow an agent who doesn’t speak a caller’s language to read her side of the conversation and then type a response that is spoken back to her in the appropriate language using text-to-speech.

So what is the key that sets real-time transcription apart from its slower cousin? This isn’t a trick question—yes, speed is certainly a major requirement. Use cases such as translation or audio captioning have zero tolerance for delay. In this day of microservices in the cloud, the differentiators that provide nearly instant transcription from speech to text are the latest speech recognizers (to be covered in a future column) and good old-fashioned computing power. An automotive analogy would be a sleeker design with a lower coefficient of drag and a more powerful engine. Yes, a car can go quicker without a more powerful engine, but add in that extra power with lower drag and you have speed!

A quick look back 15 years to 2008 shows that the amount of computing power for post-call transcription for post-call speech analytics equaled many servers, many points of failure, much care and feeding of operating systems, security updates and the overhead of hardware, and networking costs, all for a call center of 1,000 to 2,000 concurrent agents. And double that for redundancy. Today that power is virtualized in microservices running in an always-available fabric across geographies. Now scaling up and down is extremely fast, with the added benefit of licensing costs that are based on usage instead of peak volumes. And the cloud approach provides real-time transcription for call centers of 50,000-plus agents, which wasn’t possible 15 years ago.

The other key component of useful real time transcription is accuracy. A real-life example: If you read a transcription of a conference call after the fact, inevitably you will find strange translations from the speech-to-text. But usually, they are infrequent enough, and within context you can understand what was spoken.

Running captions on live television events demonstrate where the lack of accuracy devalues having captions turned on. Inevitably, when you need captions to provide information you can’t clearly hear, the captioning will be wildly inaccurate or will drop out.

The CCaas platform vendors are being held to a much higher accuracy standard with real-time transcription, so the move to higher-performing speech-to-text engines is paramount to their success. With the increasing demand for real-time transcription, the higher costs of increased speed and accuracy are now part of the “ticket to play” to be competitive. It remains to be seen whether CCaaS vendors will pass the costs directly to customers or use the additional functionality to prod organizations to move from on-premises to the cloud.

Kevin Brown is an enterprise architect at Miratech with more than 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@miratechgroup.com.