November 7, 2005
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

VoiceXML on Steroids

Researchers and practitioners are extending VoiceXML using various techniques to provide new functionality. These include the RDC library tags, xHMI meta language, and a prototype implementation of VoiceXML which supports dictation speech recognition.

RDC Tag Library – Developers frequently use Struts or other application frameworks to generate HTML. The goal of the Reusable Dialog Component (RDC) project is to provide a similar framework for VoiceXML. Like Struts, RDC has a tag library that hides the details of VoiceXML markup and embodies the best practices of VUI design, thus making it possible for developers to easily create VoiceXML applications.

The first release of RDC supports about two dozen components including <date>, <time>, <creditcardAmount>, <creditcardNumber>, <creditcardType>, <usMajorCity>, <usState>, etc. These components are basically variations of the <field> element from VoiceXML and contain references to a prompt and a grammar. It is easy to redirect the references to alternative prompts and grammars, so the same components support alternative spoken languages.

Developers may combine atomic components into composite components, for example, combining the atomic components <creditCardNumber>, <creditCardType>, and <creditCardExpiry> into the <creditCardInfo> composite component. The <group> element aggregates several components that may be active at the same time. There are currently two management strategies for groups:

(1) a simple directed dialog where the children components execute in document order resulting in system-directed dialogs widely used in VoiceXML, and (2) a rule-based directed dialog in which the children components execute according to rules defined by the developer in a "navigational rule set." The rule-based strategy can result in dialogs in which users are prompted to supply values for empty fields ordered in the same sequence as a corresponding paper form, ordered based on values previously spoken by the user, or ordered specified by values retrieved from a database, all depending upon the developer specified "navigational rule set."

The RDC template provides a mechanism for rapid prototyping before committing to a specific component for production. Developers initially may use a template with generic components, and iteratively refine them based on user testing, and finally commit the improved components for production.

RDC also supports <push>, <pop>, and <peek> for managing a stack. A component may also contain a finite state machine, a powerful mechanism for defining dialogs. The RDC tag library² requires JSP 2.0.

xHMI (an XML language for flow control and configuration of server side dialog systems) – The xHMI initiative supports additional tags that streamline dialogs to be more "natural." For example, the tag

<verify yesno= "YESNO" vcl="state city"/>

results in the explicit verification of values for state and city:

System: What state?
User: Oregon
System: Did you say Oregon?
User: Yes
System: What City?
User: Portland
System: Did you say Portland?
User: Yes

while the tag

<verify actor = "city" vcl="state"/>

results in the implicit verification of state as the user is prompted to speak the name of the city:

System: What state?
User: Oregon
System: What City in Oregon?
User: Portland

The latter requires only two dialog turns as opposed to the earlier dialog that requires four turns.

Prototype VoiceXML Implementation Using Dictation – Researchers at the Université des Sciences et Technologies de Lille⁵have implemented a <transcribe> tag within VoiceXML that converts speech to text using a dictation recognition engine rather than the conversational speech engines traditionally used in VoiceXML applications. The <transcribe> tab recognizes free form text without a developer-specified grammar. This tag could be very useful in several situations, for example to allow the system to express some utterances not modeled by the application, but pronounced by users and then provide feedback to users by re-using their input.

Researchers and practitioners are adding new tags to the standard tags of VoiceXML 2.0 and 2.1. Many of these tags will find their way into VoiceXML 3.0. In the meantime, using these new tags can result in the generation of higher quality speech applications faster and more efficiently, but at the cost of decreased portability across platforms that do not support these new tags.

James A. Larson is manager of advanced human input/output at Intel and author of the home study guide and reference The VXMLGuide http://www.vxmlguide.com/ . His Web site is http://www.larson-tech.com/ .

VoiceXML on Steroids

AI Voices Indistinguishable from Human Ones, Study Finds

Salesforce Launches Agentforce Voice

SyncWords Launches Vocalics for Real-Time Dubbing

Cleo Partners with CloneOps.ai