There is an exciting and rapidly evolving world of standards in speech technology, which is why this magazine launched this column. It will cover important standards in speech, emerging standards, and innovative new ways to apply them to improve existing applications and create totally new ones.
In this column’s first installment we’ll address a new standard for representing user inputs. This is EMMA (for Extensible MultiModal Annotation), and it comes from the World Wide Web Consortium (W3C).
Two very important components of voice interaction applications are speech recognizers and voice browsers (typically driven by VoiceXML documents). In voice applications, these components play very distinct roles. Speech recognizers convert the user’s speech to text, while voice browsers take action based on the meaning of the user’s speech. Conceptually these roles are separate, but in current platforms they are combined.
If speech recognizers and voice browsers were implemented as distinct components communicating with a standard interface, then both could be mixed and matched interoperably. A single voice browser could use different vendors’ speech recognizers to handle different languages, depending on which languages each was able to handle. However, this kind of separation requires agreement in the industry on a standard way of representing the output of speech recognizers.
This problem is even more severe with multimodal applications. Not only is it necessary to represent users’ spoken inputs, but also to represent them from a range of other modalities, such as handwriting, keyboard, mouse, and movement of the device. Multimodal inputs can even require understanding how several different types of inputs combine to yield a single meaning. An example of this would be a combination of voice and pointing, such as saying What Italian restaurants are near here? accompanied by a mouse click on a map. Clearly, proprietary approaches to representing user input across modalities would quickly become extremely complex and make multimodal application development very expensive.
EMMA is a new standard from the W3C to address these problems. Essentially, EMMA standardizes the representation of user inputs across modalities, the intentions behind them, and extra information about the input (the annotations). Here are some of the important features of EMMA:
• It can help to decouple speech recognition from voice browsers. Because the EMMA format is standard, if one vendor’s speech recognizer creates a standard EMMA result from a user’s speech, that result can then be used by a different vendor’s voice browser.
• EMMA can help to uniformly represent inputs that were created by any of several modalities, such as speech, typing, text, or pointing, so that the application doesn’t have to deal with modality-specific details.
• The standard can help to combine inputs from several modalities, such as speech and pointing, to create a single interpretation.
• Through annotation, it can provide the ability to represent the level of certainty of an input and alternate possibilities. This is a familiar concept in speech applications, but can also be used in traditional graphical user interfaces, in particular, for accessibility. For example, the exact location of a mouse click might be uncertain if the user has a tremor. So instead of just sending mouse coordinates to an application, a representation of a mouse click in EMMA also can include several sets of alternative coordinates and the likelihood of each alternative.
• It offers a way to provide the basis for a standardized archiving format for user inputs, such as logging software that takes an EMMA document and stores it in a database. A standardized archive format based on EMMA could then provide a standard way to query the database and create reports about system performance.
More information about EMMA can be found at www.w3.org/2002/mmi. It’s also possible to provide feedback about EMMA by sending comments to the multimodal mailing list at email@example.com.
Implementations of EMMA are currently being sought by the W3C to confirm the standard’s usability and value in speech and multimodal applications. Getting more involved is easy. New contributors who can bring new ideas to the standard and help bring them to fruition are always welcome.
Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.