The Ongoing Effort for Interoperable IVAs
Intelligent virtual assistants (IVAs) like Apple’s Siri, Amazon’s Alexa, and Google Assistant are now available to nearly half the people in the world. Who would have thought this possible 10 years ago? But billions of people have smartphones with assistants like Siri and Google Assistant, and millions more have smart speakers like the Amazon Echo. It’s becoming routine to talk with these generic assistants, and they are getting pretty good at answering everyday questions about sports scores, celebrities, or the weather.
But they could do much more. Developers can now create specific add-on applications that are accessible through the generic assistants, using tools like the Alexa Skills Kit or Google DialogFlow. This makes it possible for businesses, government organizations, schools, and even individuals to build voice applications that serve their customers and users, much like websites.
Although thousands of add-on applications have been developed, what they don’t do now is work on more than one platform—Alexa Skills work only on Alexa, DialogFlow applications work only on Google. This means that if an organization wants to use voice assistants for customer support, it has to develop and, even more important, maintain several versions of them. This is expensive and time-consuming. Not only can errors lead to confusing user experiences, but inconsistent information about pricing or inventory can also cost the enterprise if it has to make good on discrepancies (“But Alexa told me that the price was $49.99!”). If applications could work on more than one platform, it wouldn’t matter whether the user is using Amazon Echo, Google Home, or their mobile phone—they would still be able to talk to their bank, their school, their grocery store, or their local government, all from the same device, with the same user experience.
If we want one application to run on several different platforms, then we need (of course) standards. Standards can let applications cooperate, just like web pages, to solve complex user problems. With the right standards in place, users will be able interact with one assistant, find another assistant that specializes in a different topic, get some information, and then come back to the first, activity akin to how we browse the web now.
Several standards efforts are in fact looking at these problems right now. I talked about some early activities in last year’s “A Tangled Web of Intelligent Assistants”, and it’s time for an update.
Here are the three major efforts:
The World Wide Web Consortium Voice Interaction Community Group (Voice Interaction CG)
This Community Group of the World Wide Web Consortium is working on an architecture that will allow intelligent virtual assistants to find each other, share data, and generally work together. The group has just published version 1.2 of an architecture report that defines components for interoperable intelligent processing. The group will soon start work on standard ways that these components can exchange information. An important part of the architecture is the Provider Selection Service, which has the goal of making it possible for IVAs to find other IVAs that can provide a specific service. Think of it as “Google for Voice.”
The Open Voice Network (OVON)
The Open Voice Network is a directed fund of the Linux Foundation that, according to the website, is “dedicated to making voice assistance worthy of user trust—especially for a future of voice assistance that will be multi-platform, multi-device, multi-modal, and multi-use.” This group is also working on use cases and requirements for a voice assistant architecture as well as a Voice Registry System (VRS). The VRS is similar to the Provider Selection Service mentioned above. It enables users to find voice services that meet their needs, independent of the platforms on which they are hosted.
Amazon Voice Interoperability Initiative (Amazon VII)
Amazon has initiated an effort called the “Amazon Voice Interoperability Initiative” or Amazon VII. The goal of this effort is to enable users to interact with different IVAs on the same platform; users would simply use an IVA’s wake word to summon it. Users could invoke another IVA if the first one can’t satisfy their request. Amazon VII has published a design guide to enable developers to use these features.
How Are These Efforts Related?
These initiatives are all independent of each other; they have similar goals but slightly different ways of achieving them. Amazon VII differs from both the OVON and Voice Interaction CG models in that it enables two or more IVAs to be coresident on a single device and lets the user choose which IVA to invoke. This means that users still have to know which IVA goes with each application, but at least they don’t have to have multiple smart speakers to access multiple IVAs. The OVON and Voice Interaction CG models assume that there is a primary IVA that connects users to additional, enterprise-specific IVAs. It also does not appear that Amazon VII has an automatic discovery process, but users are encouraged to explore available agents and learn about their capabilities.
OVON and the Voice Interaction CG are more similar. Both of their architectures provide for automatic discovery of IVAs: In the case of OVON, it’s the Voice Registry System; in the case of the Voice Interaction CG, it’s the Provider Selection Service. As each group explores possible solutions for automatic discovery, it’s likely that common patterns will emerge.
How Will IPA Standards Get Adopted?
Standards will be adopted when they are useful, easy to use, and have open-source implementations. A common misconception about standards: Adoption isn’t possible without the participation of an industry’s major players. But the major players often feel it’s not in their best interests to support standards and so they might not participate. An open-source alternative based on standards can often win out over even large proprietary systems. There are many examples of this, but I think the best one is the World Wide Web itself, which overcame proprietary systems like AOL, CompuServe, and Prodigy—remember those?
Please look into these organizations, read their proposals, join their efforts, and maybe even try your hand at an implementation!
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.