Updating the Standard for Spoken Dialogues
VoiceXML has catalyzed growth of the speech industry during the past 10 years by providing a standard markup for defining voice dialogues. As VoiceXML was used in applications, requests for new features emerged, and now the World Wide Web Consortium’s (W3C) Voice Browser Working Group is working on a new version of VoiceXML. The latest draft specification of VoiceXML 3.0 was published in August. To find out more about it, I recently talked with Dan Burnett, director of speech technologies at Voxeo and co-editor of VoiceXML 3.0.
Dahl: Why did the Voice Browser Working Group decide to work on a new VoiceXML version?
Burnett: There are three reasons we created VoiceXML 3.0. First, there have been requests for new features. In particular, we’ve had requests for better media control and for speaker identification and verification (SIV) capabilities. Second, there are many people who would like extensions or alternatives to the VoiceXML Form Interpretation Algorithm (FIA), which defines how VoiceXML markup is to be executed, or who would like to have no FIA. Third, there are other W3C specifications outside of the voice and multimodal areas that we would like to be able to work with VoiceXML, such as HTML and SMIL (Synchronized Multimedia Integration Language).
Dahl: What’s new in VoiceXML 3.0?
Burnett: To support these feature requests, we divided the functionality of VoiceXML 2.1 into modules connected by an extensible framework. In addition, we are defining profiles, which are collections of modules providing desired combinations of functionality. For example, there will be a legacy profile. The goal of the legacy profile is to be as similar to VoiceXML 2.1 as we can manage. We will also provide a basic profile that will provide simple media capabilities and the ability to play and recognize, but which won’t have VoiceXML forms or the FIA.
Specific new functionality includes better media control, including support for video, SIV capabilities, and media synchronization primitives, which not only enhance media control but also allow for more precise specification of simultaneous input handling. In VoiceXML 3.0, speech recognition, recording, and SIV can all be done simultaneously. VoiceXML 3.0 will also offer more extensibility in terms of how module functionality is pieced together. For example, we’re considering allowing author override of default interform and interdocument transitions.
Dahl: What kinds of new applications will VoiceXML 3.0 make possible?
Burnett: VoiceXML 3.0 will allow for authoring everything from simple video players all the way up to sophisticated conversational systems. Because VoiceXML 3.0 will integrate easily with other W3C languages while conforming to the Multimodal Architecture, it should be simpler to build mashups including voice as only one component.
Dahl: Can VoiceXML 3.0 be used to develop multimodal applications?
Burnett: Yes. One could construct an interaction manager in the Multimodal Architecture using SCXML, a graphical user interface modality component based on HTML, and a voice modality component using VoiceXML 3.0.
Dahl: Will existing VoiceXML app-lications still work with VoiceXML 3.0?
Burnett: The legacy profile will be as similar to VoiceXML 2.0 as we can manage. It may not be 100 percent code-compatible with VoiceXML 2.0, but someone who can code in VoiceXML 2.0 should have no difficulty coding in the legacy profile.
Dahl: How long will it be before VoiceXML 3.0 is a standard?
Burnett: The specification is currently in the working draft stage and making good progress. We are aiming for all core functionality to be in the specification by the end of 2010. Voxeo is committed to seeing VoiceXML 3.0 progress and to seeing it become an effective successor to VoiceXML 2.0.
Dahl: How can Speech Technology magazine readers learn more and get involved with VoiceXML 3.0?
Burnett: The latest draft of the specification is available at www.w3.org/TR/voicexml30, and the Working Group welcomes comments. Send public comments to the mailing list email@example.com. To help drive the specification, you can join the W3C and the Voice Browser Working Group. Contact me at firstname.lastname@example.org, or Jim Larson at email@example.com, for more information.
Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.