Enter EMMA 1.1

Article Featured Image

Commercial speech platforms for IVR began to emerge around 15 years ago. They were completely proprietary and vendor-specific, and it was expensive, time-consuming, and difficult to create speech applications.

The W3C VoiceXML standard, introduced in 1999, made speech applications much easier to build, and led directly to today's multibillion-dollar voice-enabled IVR industry. However, as the number of VoiceXML implementations began to increase, the need for a standard interface between speech recognizers and VoiceXML platforms became evident.

In response, the W3C Multimodal Interaction Working Group worked to standardize a way of representing the results of speech recognition, resulting in the EMMA 1.0 specification, which became a formal W3C standard in 2009. Though originally intended to support speech recognizer interfaces, EMMA 1.0 goes well beyond, representing inputs from other modalities (ink, camera, biometrics), supporting integration of composite multimodal inputs that combine speech and clicking or typing, and providing detailed annotations such as timestamping, which can be extremely useful for logging, analysis, and tuning.

EMMA 1.0 has been implemented a number of times, including in AT&T's Speech Mashup, Openstream's Cue-Me platform, Microsoft's Tellme platform, and Microsoft Office 2010 (for ink input). The Multimodal Interaction Working group has received considerable feedback from these commercial implementations about features that would make EMMA easier to use, more powerful, and more convenient.

The first draft of EMMA 1.1, published February 9, 2012, addresses this feedback. This column looks at just two of the new features—support for human annotation and better integration with Emotion Markup Language (EmotionML). Readers are encouraged to go to the spec (http://www.w3.org/TR/emma11) to see the full set of new features.

Support for Human Annotation

During the development of a commercial speech application, even after deployment, measuring recognizer accuracy and tuning is critical to success. For example, initial speech recognizer performance in development may be low because speech grammars need to be adjusted and recognizer parameters, such as confidence thresholds and timeouts, are not optimal. The first step in accuracy measurement and tuning is to compare what the recognizer actually did with what it should have. Users might be using words developers didn't anticipate. This leads to incorrect recognitions (so-called "out of vocabulary" errors). So part of the testing process involves human annotators listening to the users' speech and deciding what the right recognition should have been. If the annotators find that words or phrases are missing, the new words can be added to the grammar. Clearly, human annotation of the results from processors such as speech recognizers is extremely important, and it would be convenient if human annotation were supported in a standard, vendor-independent way. Although human annotation can be done in EMMA 1.0 using the open-ended feature "info," there is no standard way to annotate recognizer results.

EMMA 1.1 provides a new annotation element that allows the correct result to be recorded, as well as information about the annotator and the annotator's confidence in the annotations. This will make it much easier to analyze speech recognition results.

Integration with EmotionML

EmotionML is a language for representing emotions being standardized by the Multimodal Interaction Working Group along with EMMA 1.1. It is designed to represent the results of processors that recognize emotions from face, voice, or other modalities, provide input to TTS or avatars that express emotion, and support research on emotion. Identifying strong emotions in customer calls is clearly important in call centers. How emotions are expressed is important, so while EMMA 1.0 supports annotation of the medium (acoustic, visual, or tactile) and the mode (voice, mouse, keyboard, pen, camera, etc.) of an input, EMMA 1.1 adds "expressed-through" for more precision. For example, if emotion recognition was done with computer vision techniques operating on a video, the medium would be "visual" and the mode would be "video." However, it would be useful to have more specific information on exactly how the emotion was expressed. If the emotion is expressed through the face, it would be represented as "expressed-through='face;'" emotion expressed through body motion would be represented as "expressed-through='locomotion.'"

EMMA 1.1 is at a very early stage of development. The Multimodal Interaction Working Group is very interested in comments, which can be sent to www.multimodal@w3.org.

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

EMMA Success Leads to New Challenges

GPS, data analysis top innovations list.