July 25, 2022
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Conversational Agents Move Toward Interoperability

Conversational voice agents that can speak and listen to users via voice-enabled devices containing speakers and microphones have gone from novelty to ubiquity since Apple’s Siri was released more than a decade ago. Today there are numerous examples of conversational agents—with which users can also communicate via typing and reading text—in both the consumer and business realms:

conversational agents that need special hardware and software, such as the aforementioned Siri, Amazon Alexa, and Google Assistant;
chatbots that provide text and/or voice interfaces for users and run-on general-purpose hardware;
interactive voice response (IVR) agents that run on servers and use telephony services.

To be widely useful, though, conversational agents should be interoperable—users must be able to do these three things:

Use any voice-enabled device to connect to any conversational agent anywhere in the world.
Switch between conversational agents.
Share data among conversational agents.

Unfortunately, many of today’s conversational agents do not achieve these features completely. Let’s look at four approaches (three used today and one potentially in the future) that support varying degrees of interoperability features.

Walled Gardens

Amazon Alexa and Google Assistant each have several capabilities available from Amazon or Google that are executed on the vendor’s platform. Users access capabilities using smart speakers (containing both microphones and speakers) from the same vendor. In Amazon Alexa, capabilities are called skills; in Google Assistant, they are called actions. Amazon Alexa and Google Assistant are sometimes called walled gardens because users must use speakers supplied by the vendor to access capabilities executed on the vendor’s platform.

Vendors encourage developers to create capabilities for their walled gardens by providing these elements:

Resources including speech recognition, natural language processing, dialogue management, and speech synthesis software. Developers write capabilities that use these resources.
Tools for developing, debugging, and monitoring conversational capabilities. These tools make it easy for developers to write capabilities.
Conventions and guidelines for developing capabilities that enable easy and consistent user interfaces for end users.

In exchange, developers agree to let vendors do the following:

Capture and use data captured during the execution of their capabilities to train the vendor’s resource software to improve performance.
Exclusively market their capabilities in the vendor’s online store, making the capabilities available worldwide and easy to download and install on personal devices.

Write Once, Deploy Twice

To make a capability available to users of multiple walled gardens, developers write multiple versions of each capability for deployment to different walled gardens. Unfortunately, it takes time and effort to create and maintain multiple versions of the same capability. Third-party tools minimize this problem. Using tools from vendors such as Jovo, Voiceflow, BotTalk, and True Reply allows developers to write each capability once and then generate equivalent capabilities for each walled garden. But if vendors make updates to only one deployed capability, it diverges from its equivalent capability in another platform. This can lead to maintenance problems.

Voice Interoperability Initiative (VII)

With the Voice Interoperability Initiative (VII) from Amazon, a user speaks into a voice-enabled device with special hardware, causing it to switch between conversational agents. Each conversational agent has its own “wake word” that enables customers to speak to their desired conversational agent by simply saying its wake word. For example, using special hardware within a voice-enabled device, users can switch between a domain-specific collaborative agent and Alexa, a general-purpose conversation agent. To see how effective this can be for users, see the video-taped demonstrations at https://www.youtube.com/watch?v=j53qMUBql5M.

Interoperable Conversational Agents

As noted, interoperable agents would enable users to connect to any voice agent; switch between voice agents; and share data among voice agents.

Two techniques can be used to locate conversational agents so users can connect to them:

Developers register their conversational agents with a DNS-like service that enables users to locate and connect to conversational agents.
Developers supply metadata describing the capabilities and locations for their respective conversational agents. Browsers, search engines, and aggregators use this metadata to (a) discover conversational agents relevant to the user request; (b) prioritize them with respect to user supplied criteria; and then (c) connect the user to a remote agent.

Interoperable agents would enable users to employ software rather than special device hardware to switch between conversational agents.

To share data among multiple interoperable agents, a user first speaks values to a conversational agent; then other conversational agents request that those values be shared, subject to privacy and security constraints. This saves the user from having to remember and speak the same values multiple times, and privacy constraints can ensure that private data is not shared.

Both Stanford University and the Open Voice Forum (an independently funded and governed nonprofit industry association, which operates as an open-source association of the Linux Foundation) are pursuing interoperable voice agent initiatives. While the goals are similar, the two initiatives use different technical approaches.

Our Interoperable Future

Once interoperability initiatives are realized, a whole new level of convenience will open to users. You’ll be able to use any device to connect to any conversational agent anywhere in the world, and it will be easy to switch between agents and share data among them.

Users may be familiar with similar interoperability features in desktop and mobile devices that take advantage of the device’s operating system. But with no operating system, it is a challenge to provide these features for independent conversational agents

Efforts under way at Stanford University and the Open Voice Forum will lead the way to conversational agents becoming fully interoperable. Experiments like these will produce techniques for commercial implementations of interoperable voice agents that overcome the shortcomings of today’s approaches. x

James A. Larson, Ph.D., is senior advisor to the Open Voice Network and is the co-program chair of the SpeechTEK Conference. He can be reached at jlarson@intotoday.com.

Conversational Agents Move Toward Interoperability

Modulate Tops Hugging Face's Transcription Benchmark

LALAL.AI Launches Lynx Voice Cleanup Mode

VoicePing Releases VoicePing 3.0

Voiskey Officially Launches

Deepgram Brings Nova-3 Speech Engine to Snapdragon Devices

DeepL Acquires Mixhalo

The Voice Can Sound Right, and the Video Can Still Be Wrong

Canary Speech Partners with NeuroLexIQ

Voice-Only Outreach 'Structurally Misses' Gen Z and Millennial Debt Holders, Says Vodex AI CEO

Voicelyt Launches Voice Score

DXC Partners with ElevenLabs

Nabla Launches Dictation for Mac

Fish Audio Raises $52 Million in Seed Funding

OrcaRouter Launches OrcaDub

Deliverect Partners with SoundHound AI