September 10, 2012
By Deborah Dahl Principal - Conversational Technologies
Standards

Discovering Multimodal Components

When we use applications on smartphones and tablets, we tend to think about interacting just with the device and, in some cases, with a cloud-based server as well. But this doesn't have to be the case. An application can support interactions involving multiple devices providing various services, with each device providing services that take advantage of its unique capabilities. For example, a smartphone is more convenient than a large monitor for interacting with touch, but a large monitor is better for displaying information to a group of people. Why not take advantage of the capabilities of multiple devices to do what each does best, and control the large display through the smartphone?

But an application using capabilities from multiple devices needs to find them. Capabilities local to a device are easy to find. A developer can just tell the application to use the microphone when it needs to capture speech input. Specifications like the W3C's Media Capture API provide a uniform API for accessing a device's own microphone and cameras, but can't find a capability that exists in the environment not provided by the same device.

The W3C Multimodal Interaction Working Group has recently published a note that starts to address this issue. The note is a compilation of use cases and requirements for registration and discovery of multimodal modality components. To give you an idea of the things that could be possible with standardized ways to discover and register multimodal components, here are a few examples from the note.

1. Smart cars. Automobile functions could be standardized so that they can be managed by other devices, allowing users to control and personalize the car's various adjustments—seat, navigation, radio, or temperature settings—simply by using their mobile device, storing their preferences in the cloud or on the device.

2. Intelligent conference rooms. Today's conference rooms are full of sophisticated technology, including projectors, audio systems, interactive whiteboards, and video systems that support remote participation. Using this technology isn't always easy. A standard way for participants' devices to discover and make use of the conference room's services could be extremely valuable.

3. Health notifiers. The variety and capabilities of medical sensors and imaging devices are continually increasing. Every sensor has its own user interface, which may be very complex. In many cases, users have to undergo special training to operate the device and interpret its displays, which raises the possibility of medical errors caused by user mistakes. However, a multimodal interface on a mobile device could provide standardized options to control sensor operations by voice or gesture. Applications could synthesize information from multiple sensors (for example, blood pressure and heart rate) and transmit a notification to the doctor in the event of an unusual combination of readings.

4. Smart home. Home appliances, entertainment systems, and other devices can be controlled and monitored through a user's own device, either at home or remotely. Adult children could monitor the systems at an elderly parent's home to make sure that the temperature is not getting too high or too low.

5. Public spaces. Users could access public information—a map of a mall, facts about an historical site, or information about the exhibits in a museum—through their devices. Public multidevice/multiuser applications could be used for fun, such as for musical interaction or controlling a lighting display.

The W3C Discovery and Registration work builds on the W3C Multimodal Architecture and Interfaces (MMI Architecture) specification, which provides a high-level communication API for the components of multimodal systems. The new work adds several ideas to the MMI Architecture, such as the idea of a manifest that the components can publish, providing information about:

The identity of the component;
The behavior of the component;
The list of content or media handled by the component;
The commands or actions to which the component responds;
Information about the context of use of the component; and
The semantics of the component for a specific domain.

The note also discusses the processes of discovery (or finding available components), registration of components, and, once the components have become part of an application instance through registration, querying them. Future versions of the note will elaborate on these. The discovery and registration work is at a very early stage of development. The first publication covers use cases and requirements and discusses general ideas about how distributed components can interact. This is an ideal time to provide input on these ideas. Please send comments to the group's mailing list at www-multimodal@w3.org.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

Discovering Multimodal Components

The Customer Moved My Cheese

New Ford SYNC App Offers Allergy Data

A Voice Comes to Facebook

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.