July 7, 2022
By Kevin Brown enterprise architect, Miratech.
Inside Speech

The Conversational Cloud Promises Breakthroughs, and Plenty of Complexity

As speech services become ubiquitous, reliance on them drives yet more pressure to deliver seamless and accurate communication with both humans and non-humans. Who would have thought just a few years ago that satisfaction with a particular car, for example, could hinge on the quality of its speech technology?

Speech technologies have evolved from physically present—embedded in devices, on-premises applications, and platforms like interactive voice response for contact centers—to cloud-delivered and, in some use cases, part-embedded and part-cloud. Opus Research coined the term “conversational cloud” to recognize this reality while pointing out that terms such as contact center-as-a-service (CCaaS) and unified communications-as-a-service (UCaaS) do not address the fact that multiple services are required for contact centers and even communications. Opus describes the conversational cloud as a product of combining call processing, speech processing (both automated speech recognition and text-to-speech rendering), speech and text analytics, and multiple flavors of cognitive resources.

As I noted here in February 2020, in “Speech Recognition Has Finally Come of Age. Now What?”, technology pundits have tended to overlook speech technologies in recent years because they are delivered via APIs instead of being embedded within applications. But what they’re missing, which Opus Research recognizes, is that the intermingling of artificial intelligence and data-infused resources in the conversational cloud has caused a leap in speech capabilities that make Moore’s Law look trivial by comparison.

The conversational cloud isn’t necessarily limited to contact centers or a specific speech interface; it can be a hybrid of organizations and devices. Prior to the cloud, speech was maintained in silos within organizations. Copying a well-tuned speech recognizer model to a different data center within the same organization was a massive task not long ago, and mixing speech models from different speech applications was impossible. With the conversational cloud, imagine a healthcare organization where providers use dictation and their administrative organizations, including contact centers and third-party payer companies, benefit from speech tuning of healthcare procedures, pharmaceutical names, patient names, and more. Thanks to AI tuning capabilities, it is suddenly faster and more convenient to use a speech interface than a keyboard.

With the conversational cloud, a nearly infinite number of use cases can be created, tuned, and accessed. Going back to cars, manufacturers used to develop new products in secret and attempt to keep details from their competitors for as long as possible. But at some point, that “secret gizmo” became a commodity. The Center for Automotive Research frequently prefaces its predictions of future automotive capabilities on financial and human resource constraints. So it should be no surprise that manufacturers moved away from proprietary speech-enabled infotainment to first Apple CarPlay, beginning in 2014, and then Android Auto and Amazon Alexa.

The next step to speech-enable everything controlled by a driver is under way, and the conversational cloud is fueling it. Manufacturers are currently competing on the hardware for the interface, but as we’ve seen in many other automotive developments, it will likely become a commodity sooner rather than later.

Are you thinking that this conversational cloud sounds like recent speech technology gains that have appeared like magic, and it will arrive with only minor technical hurdles to cross?

Think again.

Thought must be given to protecting personally identifiable information (PII), payment card industry (PCI) information, and more. Existing regulations such as the GDPR hint toward needing to provide an opt-out for users who don’t want their audio recorded, but then should organizations allow customers to use speech services if they choose not to allow their inputs to be recorded for tuning purposes?

Current work in several conversational cloud provider organizations is focused on disambiguating users’ recordings enough to satisfy privacy requirements—meaning both regulatory compliance and end users’ expectations. It will also need to effectively scrub/mask information such as payment card data and PII. All of this could lead to creation of an independent auditing council that will apply a seal of approval consumers can look for when choosing which products and services to purchase or use, much like today’s PCI compliance that drives whether organizations can accept payment other than cash.

In the world of speech technology, we are now technically moving too fast for people, processes, and regulations to keep up. Some deep thought will be necessary to prevent having to make major course corrections. Speech technologies are entering a cloudy phase indeed.

Kevin Brown is a customer experience architect with more than 25 years of experience designing and delivering speech-enabled solutions. He can be reached at kevin.brown@voxperitus.com.

The Conversational Cloud Promises Breakthroughs, and Plenty of Complexity

Modulate Tops Hugging Face's Transcription Benchmark

LALAL.AI Launches Lynx Voice Cleanup Mode

VoicePing Releases VoicePing 3.0

Voiskey Officially Launches

Deepgram Brings Nova-3 Speech Engine to Snapdragon Devices

DeepL Acquires Mixhalo

The Voice Can Sound Right, and the Video Can Still Be Wrong

Canary Speech Partners with NeuroLexIQ

Voice-Only Outreach 'Structurally Misses' Gen Z and Millennial Debt Holders, Says Vodex AI CEO

Voicelyt Launches Voice Score

DXC Partners with ElevenLabs

Nabla Launches Dictation for Mac

Fish Audio Raises $52 Million in Seed Funding

Deliverect Partners with SoundHound AI

OrcaRouter Launches OrcaDub