October 1, 2008
By Deborah Dahl Principal - Conversational Technologies
Standards

Opening the World of Multimodality

We are starting to see some very exciting multimodal applications, especially in the area of voice search. Being able to ask for information by speaking, and then seeing the results on a screen, is a powerful paradigm for interacting with a mobile device. On the one hand, speech input avoids annoyances inherent in graphical user interfaces (GUIs), such as scrolling through long lists and torturous keypad entry. On the other hand, GUI output avoids some of the annoyances inherent in voice interfaces, such as the problem of conveying large amounts of information by voice.

Today’s multimodal voice search applications are available from companies like Tellme Networks, Google, and vlingo. But these are proprietary applications. What if third-party developers could create similar applications just as easily as millions of developers today can create Web pages?

The key to opening up multimodal applications to a wider developer base is standards. Like HTML for the Web and VoiceXML in the voice world, upcoming multimodal standards will enable many more developers to create multimodal applications.

My column, "A Framework for Multimodal Apps" (July/August 2008) covered the Multimodal Interaction (MMI) Architecture being developed by the World Wide Web Consortium. This architecture defines an Interaction Manager that coordinates components to support speech, graphics, pointing, and other modalities. It has some very attractive features, such as a natural extensibility to new modalities and excellent support for distributed applications. This ability to be distributed supports applications using widely dispersed components. Distributed components could be used to create modality mashups, where modality components are developed by third parties with specialized expertise and are then integrated into new applications using the MMI Architecture.

However, the architecture by itself isn’t enough to create applications. We need specific software, such as Web browsers and voice browsers, and markup languages like VoiceXML and HTML, to actually build applications. A Web server running an Interaction Manager could communicate with distributed modality components over the Web, such as a remote speech recognizer or text-to-speech engine. In this case, the components would communicate using HTTP, the standard Web communication protocol.

Alternatively, one or more modality components or the interaction manager might run locally on a device. Both of these approaches would use the MMI Architecture, but with different implementations of components.

To make the MMI Architecture more concrete, the W3C Multimodal Interaction Working Group recently published a document on authoring called "Authoring Applications for the Multimodal Architecture" (www.w3.org/TR/mmi-auth). This specification provides one example of how to create multimodal applications using the MMI Architecture. The sample application, a simple online ordering application, is based on current standard technologies, including SCXML, HTML, and VoiceXML. In the example described in this document, the Interaction Manager is implemented with SCXML, the graphical modality is implemented with HTML, and the voice modality is implemented with VoiceXML. The SCXML Interaction Manager sends messages to the Web browser and voice browser telling them to speak or display information. The Web browser and voice browser interact with the user and deliver the user’s input back to the Interaction Manager, which then moves on to the next step in the application.

This design is very good for supporting distributed applications. However, today’s Web and voice browsers have no direct way of receiving messages from an external component like the Interaction Manager. Both types of browsers have been designed to take charge of the interaction with the user. This makes sense in a unimodal voice or Web application, but multimodality by its nature requires coordination across modalities. By introducing the Interaction Manager for this coordination, another component, not the HTML or VoiceXML application, interacts with the user.

While future versions of HTML and VoiceXML might respond to external control messages, even with today’s browsers techniques exist that can be used to simulate the ability to receive messages in straightforward ways. These are illustrated in detail in the authoring document.

With the principles of the MMI Architecture and specific examples illustrated in the authoring document, we are now coming a long way toward opening up multimodal applications to a wide range of developers. This will, in turn, accelerate the development of a whole new world of multimodal applications.

Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Opening the World of Multimodality

Conversational AI to Reach $41.39 Billion by 2030

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API