November 10, 2016
By Deborah Dahl Principal - Conversational Technologies
Standards

Let’s Get Siri, Cortana, and Alexa to Work Together

Virtual assistants—whether personal assistants like Siri, Cortana, Google Now, or Alexa, or enterprise assistants tailored for a company—are becoming an increasingly ubiquitous, and important, form of voice application: Business Wire recently reported that the market for virtual assistants will reach $15.8 billion by 2021. Unlike traditional IVR systems, these virtual assistants are characterized by open-ended spoken and text natural language input. Interactions can be initiated by the user, or both the system and the user, which results in a much more natural interface than the strictly system-driven dialogues common in IVRs. Virtual assistants are also starting to be able to have limited conversational interchanges, at least to the point where the user can ask follow-up questions.

The best-known assistants are all based on proprietary technologies. We are starting to see some opening up of APIs, like Alexa Skills Kit and SiriKit, to third-party development, and companies such Openstream and Pandorabots offer standard and/or open authoring frameworks for virtual assistant applications. True interoperability between platforms, however, remains elusive.

Yet such interoperability would garner huge benefits. What if applications developed for one platform could also run on other platforms? Right now, if you wanted to develop a customer service application using Alexa as a user interface, and then wanted to support Siri, Cortana, or Google Now, you would have to develop a brand-new application. This might remind you of the situation with IVRs 20 years ago, when every IVR platform had a completely different set of development tools, APIs, and authoring approaches. Moving to a new platform was extremely difficult, and any developers who wanted to work with several platforms had to learn a new set of skills for each platform.

With IVRs, the solution was the establishment of standards published by the W3C Voice Browser Working Group (now closed). These standards include VoiceXML for defining dialogues, SRGS for controlling speech recognition, and SSML for controlling text-to-speech output. The VoiceXML family of standards made IVR applications substantially more interoperable and greatly reduced the need for developers to master APIs from different platforms.

What would we need to do to get the benefits of these kinds of standards for virtual assistants? Can the existing standards help? What new standards are needed? The W3C has started up a new group to look into these questions. This new group is the Voice Interaction Community Group (https://www.w3.org/community/voiceinteraction/). W3C community groups, unlike the more traditional working groups, are exploratory and investigative in nature, which is appropriate for new technologies like virtual assistants. The work of community groups includes publishing reports on use cases and requirements to help clarify what standards are needed in a specific area. They can publish reports on frameworks describing how existing and potential standards could work together. Community groups can also create proposed specs for new standards. These specs can be sent to a working group for formal standardization or can simply be used by the community to test applications and features.

Another important feature of community groups is that they’re open to everyone, without cost, and without requiring a W3C membership. This policy encourages input from as wide a variety of participants as possible. Basically, anyone who has the interest and motivation can contribute to the work of a W3C community group.

The Voice Interaction Community Group is just beginning to define ideas for topics to work on. Some ideas discussed so far include (1) languages for defining intelligent, conversational dialogues; (2) communication standards between different virtual assistants; (3) standards for statistical language models to support more flexible speech recognition; and (4) standard semantic representations for concepts common in virtual assistant applications, like time and location. Communication standards between virtual assistants is an especially interesting topic. You may have a personal assistant that knows about you, your interests, and your preferences, but what if you need a more specialized enterprise assistant for a specific task like shopping, customer support, or interacting with a smart environment? Could your personal assistant help discover an appropriate specialized assistant? Could it tell the specialized assistant about your shopping preferences?

This is a great time to get involved—sign up, start a discussion on the mailing list, and share your ideas!

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.