VoiceXML: A Developer's View

Sit in on a discussion of speech recognition topics these days, and chances are the conversation will soon turn to VoiceXML and its expected impact on the industry. But will it really register as a major seismic event in the recognition industry, or as just another tremor barely noted in passing? VoiceXML is sometimes easier to describe by what it is not. VoiceXML is not a universal solution to making web pages voice accessible. It is not the "voice version" of HTML, and - no - it is not the same as VoxML. It is, however, a promising technology for building telephone-based speech recognition applications faster and cheaper, while leveraging a company's existing investment in web technology. Most experts are heralding the "coming of age" of speech technology. Even the cynics grudgingly admit that Automated Speech Recognition (ASR) has arrived and that PC-class computers really can understand speech. Why, then, is the market not awash with great ASR-based "killer apps?" Why can't VCRs listen to spoken instructions and why do we still hear the dreaded "press one for a listing of options" when we call our friendly neighborhood bank? Although excellent speech recognition engines are available, speech recognition applications are still difficult to build. Assembling a world-class ASR-enabled call center, for example, requires a background in software design, telephony, networking, client-server systems management, programming, databases, linguistics and speech recognition technology, not to mention project management and considerable political skills. VoiceXML is an attempt to remedy this problem, at least for telephone-based applications. VoiceXML consists of a scripting language and related technologies designed for development and deployment of speech-recognition enabled software applications. Some of the most common applications of VoiceXML will be systems to allow customers to access information in a company database over the phone. Some simple examples include movie, weather, and traffic information phones, order tracking applications, e-mail readers and personal information managers. More complex applications include speech-recognition enabled call centers for catalog shopping, airline reservations, stock trades and financial services management. Perhaps the key virtue of a VoiceXML system is its ability to retrieve and utilize information already stored in a corporate web server. This allows a company to leverage work already done in creating a web site and avoids having to directly access corporate databases. Overview
(The basic architecture of a VoiceXML system is shown in Figure One.) A customer, using a conventional landline or cellular phone, calls a designated phone number. The call is answered by a computer system at a VoiceXML gateway site. The gateway system retrieves the initial VoiceXML script from a VoiceXML content server, which can be local to the gateway or located anywhere on the World Wide Web. A portion of the VoiceXML gateway called the interpreter parses and executes this script playing prompts, hearing responses and passing them on to a speech recognition engine that is also part of the gateway system. When the script has collected all the necessary responses from the user, the interpreter assembles them into a request to the VoiceXML content server. The content server responds with a dynamically generated VoiceXML page containing the information requested by the user. The process can be repeated indefinitely to produce the appearance of a conversation between the user and the VoiceXML server. Components
Any web site can be a VoiceXML content server. No special hardware or software is necessary. Servers respond to requests by generating either canned or dynamically generated VoiceXML scripts, which are passed by HTTP back to the gateway. VoiceXML scripts look very much like HTML documents. For example, a PROMPT tag indicates that the gateway system should play back a piece of recorded audio to the customer. A FIELD tag is used to indicate an input field. The presence of the FIELD tag is a cue to the speech recognition engine to listen for user input and interpret it according to a grammar specified in the script. Like conventional web pages, VoiceXML scripts may have embedded server-side or client (gateway-side) script. A specialized tag called OBJECT allows the incorporation of platform- specific functionality. Many VoiceXML scripts will probably contain a combination of "pure" VoiceXML and pre-written modular components written in Java or ActiveX. Interpretation of the script and the interaction with the user is controlled by the VoiceXML gateway. Gateways are special collections of hardware and software, which form the core of VoiceXML technology. Essentially they provide the presentation services component of VoiceXML, analogous to the web browser in conventional HTTP service. (Figure Two lists the various tasks handled by the VoiceXML gateway.) Incoming calls are answered by the telephony services and signal-processing component. Gateway systems are provisioned in a manner similar to IVR systems and can in fact be located "downstream" of a PBX or automatic call director. This architecture allows callers to request transfer to a live operator if they encounter problems. Once a call is received, the VoiceXML interpreter begins parsing through and executing the instructions in the VoiceXML script. When the script indicates that user input is required, the interpreter hands off control to a speech recognition engine that "hears" and interprets the spoken response. The speech recognition component is entirely separate from the other components of the gateway. A VoiceXML interpreter can use any compatible client- server recognition engine, or even switch engines during a script to improve performance. DTMF (touch-tone) input can also be interpreted, allowing the application to use a construct like "press 0 or say operator." Grammars for speech recognition are stored with the content on the VoiceXML server. They may be in a custom format specific to the recognition engine used, or be written in standard Java Grammar Specification Format (JGSF).
VoiceXMLª Forum Founders
Submit VoiceXML 1.0 Specification
Piscataway, N.J. Ð The VoiceXML Forum recently announced that the World Wide Web Consortium (W3C) has acknowledged the submission of Version 1.0 of the VoiceXML specification. At its May 10-12 meetings in Paris, the W3C's Voice Browser Working Group agreed to adopt VoiceXML 1.0 as the basis for the development of a W3C dialog markup language.

The Forum's founding members, AT&T, IBM, Lucent Technologies and Motorola made the W3C submission. Acknowledgement by the W3C will help to accelerate and expand the reach of the Internet through voice-enabled Web content and services. The VoiceXML Forum will host the next meeting of the W3C Voice Browser Working Group in September 2000.

"As the W3C Voice Browser Working Group begins to define the speech interface framework that extends the Web to voice-based devices, we will use VoiceXML as a model for our dialog markup language. The W3C speech interface framework will include integrated markup languages for dialog, grammar, speech synthesis, natural language semantics and multimodal dialogs, as well as a standard list of reusable dialogs," said Jim Larson of the Intel Architecture Labs, who is Co-chair of the W3C Voice Browser Working Group.

The VoiceXML 1.0 specification is based on years of research and development at AT&T, IBM, Lucent Technologies and Motorola, as well as on comments from VoiceXML Forum supporters. Since the release of VoiceXML 1.0 in March 2000, the Forum has nearly doubled its supporter membership to more than 150 companies. Based on the World Wide Web Consortium's industry-standard Extensible Markup Language (XML), Version 1.0 of the VoiceXML specification provides a high-level programming interface to speech and telephony resources for application developers, service providers and equipment manufacturers. Standardization of VoiceXML will:

  • simplify creation and delivery of Web-based, personalized interactive voice-response services;
  • enable phone and voice access to integrated call center databases, information and services on Web sites and company intranets;
  • and help enable new voice-capable devices and appliances.

The VoiceXML Forum will continue its activities to support and promote VoiceXML as a standard method for providing voice access to Internet content and services.

VoiceXML currently provides two mechanisms for generating outgoing speech or other audio. Recorded sound in WAV or similar formats can be used to speak information or prompts. A text-to-speech (TTS) engine can perform the same function. Like the recognition engine, any compatible TTS engine can be used to power a VoiceXML gateway. Recorded audio is served by specifying the URL of the WAV file. The audio files may be located on the remote web servers or cached locally on the VoiceXML gateway to reduce the traffic load between gateway and web server. Support for streaming audio is expected in the future. Finally, the gateway also serves as an HTTP client, sending messages and receiving VoiceXML pages from the web server. Communications between the gateway and the content server follow standard HTTP protocols. Outgoing requests are in the form of an HTTP "get" or "post" command. It is this feature of VoiceXML that is key to rapid application development. Since the incoming request from the gateway looks exactly like a request from a conventional browser, a company can leverage its existing investment in web technology by using VoiceXML. Although new VoiceXML pages may need to be created, the underlying infrastructure, including database design, stored procedures and CGI scripts can be reused with little or no modification.
What's In a Name?
Unfortunately, one of the barriers to adoption of VoiceXML has been its history of name changes. Misinformation abounds about VoiceXML, VoxML and VXML and the differences among them. The concept for a voice markup language arose simultaneously in several industrial settings including Motorola, IBM, Lucent and AT&T. Motorola based its early work on an extension of the XML standard and called it VXML. In early 1999 the four companies combined efforts at a voice markup language using the VXML name. Motorola, concerned about intellectual property issues, cooperated with the other three, but trademarked the name VoxML to describe its own flavor of the language. Subsequently the big four settled on VoiceXML as the name for the proposed markup language and organized the VoiceXML Forum to guide development of the language as an open standard. VoiceXML is now the name of choice and eventually will be the sole flavor of the language. Until then, developers, be aware that the specifics of VoxML and VoiceXML differ and, in general, VoxML and VoiceXML scripts are not compatible with interpreters for the opposing standard.
Voice Portals
As mentioned earlier, a VoiceXML server may be located anywhere on the World Wide Web. In practice, many companies will choose to host their own VoiceXML gateway co-located with their web server farms. This reduces the time delay that might result from having the VoiceXML gateway separated from the content server by miles or even continents. It also allows the company to control the quality of the presentation of the content by specifying its own choice of recognition engines and TTS voices. By controlling both the gateway and the content servers, customers can also reduce the possibility of security problems. Despite the advantages, some companies may choose to provide content only, and allow gateway services to be handled by independent "voice portal" services. More than a dozen such voice portal companies are now planned. Early promoters of VoiceXML technology suggested that such voice portals were the wave of the future and that existing web pages could be "voicified" by the addition of a few VoiceXML tags. In fact the goal of turning existing web pages into "voice pages" has proven elusive. The basic design principles, which make a good web page, are very different from those used to create a high quality voice interface. Most high quality voice content on the web will originate from pages designed from the ground up to serve the VoiceXML market. VoiceXML is only now approaching its first birthday and is very much a technology in the making. VoiceXML is an open standard. Anyone can write a VoiceXML script or a VoiceXML interpreter. The VoiceXML Forum, which serves as a governing organization, has recently released its level 1.0 specification (see "VoiceXML Forum" sidebar) and at least two companies offer free toolkits related to VoiceXML or VoxML. Developers hoping to plunge into commercial development may have to wait a while. At this writing, no commercial-grade VoiceXML gateways or development products have been released, although several are in the pipeline.
Steve Ihnen is the chief technology officer of Voice Applications, Inc. and can be reached at steve_1@nwlink.com.
SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues