State-of-the-Art Speech Application Development

Writing speech applications with preparatory languages used to be difficult. The W3C speech framework languages (VoiceXML 2.0 and 2.1 for specifying dialogues, SRGS for specifying grammars, SSML for specifying how to render text as voice, and CCXML for specifying call management) have greatly simplified speech applications development by providing standard, portable languages for a growing speech application development community.

Open source browsers
With the increasing popularity of speech applications, several open source browsers (Table 1) and speech engines (Table 2) are now available. The VoiceXML Forum has created the VoiceXML Student Resource Exchange (http://www.voicexml.org/resources/vxml_university/education_exchange.html) for listing resources available to instructors of university and college courses involving VoiceXML training and research.

Table 1: Available Open Source Browsers

Name	URL	Capabilities
OpenVXI	http://fife.speech.cs.cmu.edu/openvxi/	VoiceXML 2.0 browser with simulated speech recognition, prompt, and TTS capabilities
publicVoiceXML	http://www.publicvoicexml.org/	VoiceXML 2.0 browser using Festival synthesis software; Speech recognition is not currently working
OpenSALT	http://hap.speech.cs.cmu.edu/salt/	SALT 1.0 browser using open source Sphinx recognition and Festival synthesis software
Oktopous	http://www.phonologies.com/license.txt	CCXML 1.0 interpreter
CCXML4J	http://sourceforge.net/projects/ccxml4j/	CCXML 1.0 interpreter written in Java

Table 2 Available Open Source Speech Engines

Name	URL	Capabilities
Festival	http://www.cstr.ed.ac.uk/projects/festival/	Speech synthesis engine
Sphinx	http://cmusphinx.sourceforge.net/html/cmusphinx.php	Speech recognition engine

Commercial quality speech platforms containing both browsers and speech engines are available for license from several VoiceXML platform vendors, who also provide reusable audio clips, grammars, and subdialogs—but only for use with their own platforms.

Reusable dialog components
Enter IBM… In September, IBM contributed its Reusable Dialog Components (RDCs) to the Apache Software Foundation and the markup editors for W3C standard speech languages to Eclipse. Both Apache and Eclipse will provide RDCs and editors as open source tools.

Developed by IBM research, RDCs are Java Server Page (JSP) tags that enable dynamic development of voice applications. JSPs that incorporate RDC tags automatically generate W3C VoiceXML 2.0 at runtime that can execute on any VoiceXML compliant platform. Speech dialogs built using RDCs will work together, regardless of the vendor that created them. Both the framework and a set of example tags are to be contributed to the Apache Software Foundation (http://www.apache.org/).

IBM's contribution of speech markup editors to Eclipse (http://www.eclipse.org/proposals/eclipse-voicetools/index.html) is aimed at making it easier for developers to write standards-based speech applications as well as create and use RDCs within those applications. Initially, voice tools will consist of editors for VoiceXML, the XML Form of SRGS (Speech Recognition Grammar Specification), and CCXML (Call Control eXtensible Markup Language). These editors will syntactically validate and provide content-assistance of the markup tags for each of the W3C standards. In addition, a new file wizard will be provided for each markup type, to create an "empty" file based on the required markup defined by the chosen DTD.

In coordination with their .NET servers, Microsoft provides a free Speech Server 2004 Standard Edition 180-day evaluation CD containing Microsoft Speech Server and a full version of the Microsoft Speech Application SDK. The SDK contains a powerful extension to Visual Studio.net for developing both telephony and multimodal applications. To download, use the search engine on http://www.microsoft.com/ to locate Microsoft Speech Application Software Development Kit (SASDK) Version 1.0 You can SALT-enable your Internet Explorer by downloading and installing the SALT plug-in. An open source version of SALT is available on the CMU Web site shown in Table 1.

Speech wars
In my "Technology Trends" column in the May/June 2003 issue of Speech TechnologyMagazine, I predicted "Vendors will no longer boast 'our platform supports VoiceXML and yours doesn't' — the speech wars of yesteryear. Vendors will now do battle with a range of tools that make life easier for speech application developers." As the Microsoft .NET and J2EE camps do battle, focus is indeed shifting from SALT vs. VoiceXML to tools for developing speech applications — with developers being the winners.

When the Apache and Eclipse open source tools are available, writing VoiceXML will become even easier. So easy that any Web developer will be able to churn out poor speech applications faster and more efficiently. While I can teach students how to code VoiceXML 2.0/2.1 in a few hours, I must spend the rest of the term teaching those students how to design good voice user interfaces.
Designing great voice user interfaces

The speech community needs to bridge the gap between great development tools and quality dialogue design. Let's not repeat the problems with the early Web, when thousands of developers used the "trial and error" approach to development, resulting in thousands of barely usable Web sites. The speech community needs to establish guidelines and best practices for designing quality voice user interfaces, as well as:

Provide examples of good designs for new developers to examine and mimic
Provide training, delivered in several forms including books, classroom courses, online tutorials, and workbooks
Embed accepted guidelines into new tools to generate good designs

It's time to uplevel our discussions from which language to the real issues—how to design good voice user interfaces for their intended user.

Dr. James A. Larson is manager, advanced human input/output, Intel, and author of the home study course VoiceXMLGuide, http://www.vxmlguide.com. He can be reached at jim@larson-tech.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

State-of-the-Art Speech Application Development

Nex-Gen Chat Solutions with Generative AI You Can Trust

Speech Technologies in the Low-Code/No-Code World

Meeting the Rising Demand for Voice-Based Biometric Systems

More Web Events

Tips for Reviewing Voicebot Vulnerability

Safety and Ethical Concerns Loom Large in Voice Cloning

Apple Proposes Acoustic Model Fusion to Improve Speech Recognition

Aculab Launches Audio-to-Audio Translation