March 8, 2004
By Bill Scholz President - NewSpeech LLC
Features

Building Speech Applications: Part One - The Application and its Development Infrastructure

A look at what it really takes to build a successful speech application

The design and deployment of a high quality speech application presents a unique challenge, requiring not only traditional systems analysis and software development skills but also the specialized skills of the speech scientist, human factors expert and business process analyst. In this issue we will describe what a speech application is, what a designer’s primary considerations include, and what the infrastructure is in which a speech application is constructed. Part 2 will focus on details of the application development process from requirement gathering through design, implementation using service creation tools and deployment using alternative infrastructures.

The task of producing a quality speech application can be approached from the starting point of the traditional software life cycle. As defined by the Software Engineering Institute (http://www.sei.cmu.edu/), the traditional software development life cycle (SDLC) has at least the following activities:

identification of business needs and constraints
elicitation and collection of requirements
architecture design
detailed design
implementation
testing
deployment
maintenance

The trio of architecture design, detailed design and implementation, which has served as the core application development for a generation, is an insufficient recipe for producing a speech application. Experience has shown that successful speech application software development requires additional skills typically offered by non-traditional contributors such as speech scientists, linguists, human factors experts, and business analysts. Speech scientists and linguists address a speech application’s requirement for audio prompts expressed in the user’s colloquial native language, and manage its requirement for spoken language recognition. Human factors expertise guides the process of dialog design. Business process analysis clarifies the integration between the voice user interface and the back-end business activity at the application’s core. The following summary of the speech application production process will focus primarily on these new roles.

VUI design - persona, style, new vs. repeat users
A speech application is modeled as a conversation between two participants who share a single, well defined goal. But just as you derive an impression of your conversant during a conversation, speech application users will derive an impression of an application, wherein they will impute human-like attitudes and behaviors to it. Thus an essential goal for a speech application designer is selecting the right “persona” for the application, where the contents of the prompts are tailored by careful choice of gender and age, and the selection of a general demeanor appropriate to the application’s domain.

The designer must also select the appropriate dialog style of the application. In a directed dialog application the prompts will precisely enumerate the user’s choices, restricting users to respond with one item from a list of words or short phrases. In a natural language dialog application, the user is given much more freedom through the use of general prompts such as “how may I help you?” The burden then falls on the application to recognize and ‘understand’ the wide variety of possible responses users are likely to produce. Mixed initiative applications permit users to respond with natural language within a tightly constrained domain, and fall back to directed dialog if the user chooses to respond with partial answers rather than longer sentences with complete answers. Finally, Form Filling, a variation of the directed dialog style may be considered, where the user is verbally populating a form by responding to prompts that identify each field (e.g., “zip code”… “quantity desired”…etc.).

The style of an application designed for repeat users differs sharply from applications built for new users (or infrequent repeat users). Repeat user applications typically have terse, abbreviated prompts often containing domain-specific jargon. Grammars are designed to permit users to speak using the same domain-specific jargon, and built-in help is minimal or absent altogether. All of these attributes are seen as beneficial by users who want to conduct specific business rapidly and efficiently. By contrast, applications designed for new users have carefully worded, unambiguous prompts which avoid esoteric or domain-specific terminology wherever possible. ‘Help’ is included to guide the newcomer, and graded prompts (those that become more explicit on each consecutive repetition) are common.

Prompt creation
Both art and mechanics are involved in the preparation of prompts. Much has been written on the art of prompt design: the necessity to carefully balance clarity of intent against excessive wordiness, precision against redundancy, and sterility of discourse against colloquialism and jargon. Once designers master the art of prompt composition, they are faced with the mechanical task of decomposing prompts into variable versus constant components and choosing whether to render prompts using text-to-speech (TTS) or audio recordings. The prompt decomposition mechanics are not independent of the application design since it is the application’s responsibility to reassemble complete prompts from fragments, where some fragments are only defined at runtime and must be retrieved from a database or back-end.

Developers typically use tools to manage the mechanics of prompt construction. The tools permit decomposition of the prompt into phrases, fragments, or words that are reassembled by the application at runtime. Mature tools permit easy recombination of fragments into sentences, then direct auditioning of various combinations to ensure seamless blending of the fragments. TTS technology has matured to the point where it can be used as the sole mechanism for prompt generation, or used just for dynamic information, and blended seamlessly with pre-recorded static prompts. Thus a prompt such as “Would you like the medium item for five dollars or the large item for six dollars?” could be assembled from four pre-recorded phrases (shown in italics) and four dynamically generated items rendered using TTS (shown in normal font). Application logic would control the assembly of the fragments.

Grammars vs. SLMs
The majority of speech applications deployed over the past decade use grammar- constrained recognizers – that is, for every turn of the dialog, the recognizer is re-armed with a fresh grammar constraining it to accept only those words and phrases with which we anticipate our users to respond. Since the recognizer is constrained to recognize only those words and phrases presented in the current grammar, recognition performance is no better than our ability to successfully anticipate our user’s responses to the prompt they were given. Fortunately this restriction can be significantly reduced through the use of Statistical Language Models (SLMs) which are constructed by averaging a large corpus of sample responses recorded live from users responding to a prompt. The samples in the corpus are statistically averaged to produce the SLMs, which are then used in place of the grammars to constrain the recognizer. The positive side of SLM use is that far greater variability in responses can be successfully recognized. The downside is expense: the necessary corpus typically requires thousands of samples, each of which must be transcribed in preparation for the averaging process necessary to build the SLM.

Back-end integration
The ultimate purpose of nearly any speech application is to automate access to some information store — possibly in the form of a database, a legacy application, or a custom implementation of a collection of business rules. In any case, the voice user interface front end must somehow be integrated with the back end system. The versatility of the Internet client-server architecture supports a number of strategies.

CGI (Common Gateway Interface)
If the application is coded using static markup language (VoiceXML, SALT), the time-proven CGI technique can be used in which client-side code reaches back to the server to invoke procedures in a cgi-bin subdirectory that were written using any convenient programming language. The invoked procedures in turn interface with the legacy or background task.

Server-side procedures
A more common implementation architecture is to execute code on the server at runtime (ASP, JSP, Servlet) which communicates with the client using dynamically generated markup, and with the back end or legacy process through an open API. JDBC or ODBC can be used for direct database interaction, or XML data can be exchanged with a LAN or Web connected target. Typical deployment architecture is illustrated in Figure 1.

Templates and reuse
Building successful speech applications “from scratch” is expensive. Dialog design, usability evaluation, prompt design, grammar development, back-end integration, testing and deployment can be expensive, time-consuming activities. If each of these activities had to be performed in their entirety for every speech application, the cost of development would become the single greatest barrier to entry. But the speech industry has been alive for several decades, and it is increasingly likely that many of the application components needed in new applications have already been built and tested. Component reuse is key to controlling the cost of app development. Major players such as Nuance and ScanSoft have been offering their clients reusable components for years to manage common activities such as address entry, credit-card entry, confirm-and-correct, speaker verification, and others.

In addition to reusable components, the industry has seen recent growth in the availability of application templates – a nearly-finished application that needs only vendor-specific customization to be ready to use. For example, ScanSoft provides templates for directory assistance, voice-activated dialing, auto-attendant and call center applications for healthcare and other industry verticals, such as utilities, hospitality and insurance. Vialto offers a customizable suite of components to provide intelligent, voice-driven access to Exchange. Nuance offers a packaged application to serve as a corporate greeter and call router. Similar templates for Corporate Voice Dialers (CVDs) are available from ScanSoft and Unisys.
The conclusion is obvious: before embarking on any new application creation project, search for templates or at least for reusable components to simplify your task – don’t build from scratch when you can reuse earlier work, often for a fraction of the cost of new construction. As development of your new application progresses, architect it for future component reuse whenever possible by isolating potentially reusable functions so they can be packaged and retained for easy access by future developers.

Develop in Markup Language vs. Programming Language
Two fundamentally different approaches to speech application development are competing for popularity today. They share in common the use of an open standard markup language (VoiceXML, X+V, SALT) as the vehicle for passing instructions from the application server to the media gateway client. But one development approach favors direct generation of the application in the target markup language, while the other favors developing the application in a high-level language, and then rendering the markup language dynamically during runtime. Directly coding an application in markup language tends to result in greater efficiency and better performance (faster response time) especially if server to client traffic uses the public Internet. On the other hand, applications coded in high level languages (Java, C#, C++) that render markup dynamically at runtime tend to be easier to develop and maintain, and can even leave the decision until runtime of which markup to use. Users of the BeVocal Café and hosting service have shown considerable success using the direct markup coding approach, while users of tool suites such as those offered by Audium and Unisys are producing successful deployments using the latter approach.

Debugging environment
Testing and debugging a speech application each have their own unique complexities. Near the top of the list is usability testing, the process by which the usability of the voice user interface is evaluated. The primary focus of usability testing is to validate the clarity of prompts and the utterance coverage of grammars; in other words, to verify that the prompts elicit the desired understanding in the users, and that whatever words or phases they use to reply are understood by the computer. Usability testing can be time consuming and expensive if a working prototype of an application has to be actually deployed to perform the tests, especially if the result is the need to change the call flow, prompts, or grammars. A less expensive alternative is to use Wizard of Oz (WOZ) testing, where the application designer or developer serves as a wizard, listening live to a caller’s utterances through a computer and using mouse clicks to steer the application through a graphic rendering of the call flow. Thus the wizard serves as both the speech recognizer and call flow processor for the application, yet creates an experience for the caller that is indistinguishable from interacting with a deployed application. WOZ testing can be used to collect usability data prior to committing time and resource to implementing call flow and building the necessary recognizer grammars. Multiple iterations of design - test - revise - retest can be inexpensively completed before actual deployment. The WOZ infrastructure can also be used as an effective method for collecting the acoustic samples required for SLMs.

Once development has proceeded to the prototype or pilot stage, different tools are required for testing. At this stage the new application is exposed to a wider user community, so here for the first time trouble spots start to appear. Problems might consist of prompts that, despite careful WOZ testing, still confuse or mislead a significant number of users. Trouble might result from a grammar that omits words or phrases that are used by an unexpectedly large proportion of users. Key to this analysis are quality analytic tools that can statistically summarize data in application logs for all dialog states.

Packaging: canonical representation rendered at runtime
Although some developers prefer developing applications directly in markup language, many developers have had success in building and packaging applications as a mixture of new code, library modules and XML call flow summaries. Tools are used to facilitate design and packaging, resulting in a Web archive package for a given application containing Java class archives, XML call flows, prompts (audio files or text) and grammars. The Web archive is then moved to the application server where it is executed at runtime. Included in the library modules is a markup language rendering engine that processes a media gateway’s User Agent ID, then returns dynamically-generated SALT or VoiceXML appropriate to that gateway. Thus an application developer need not focus on target markup language at development time. Instead, tools are used to construct the application and build a canonical representation of its call flow which is bound only at run time to the target markup language.

Lessons learned: suggestions from experienced project managers and dialogue designers (By Dr. K. W. (Bill) Scholz)

VoiceXML and SALT are succeeding not because they are good development environments — but because they allow application development to be decoupled from the underlying hardware and/or underlying speech engine. Whatever happens with lower-level protocols like VoiceXML and SALT, it’s likely that the voice user interface (VUI) market will continue to support many application vendors, each with their own form of higher-level development tools, because VUI requirements are highly specific to each company.
It’s essential to fully understand the underlying functionality of the business process being speech enabled BEFORE performing the dialog design. Attempts to overlap functional decomposition and dialog design are doomed to failure. Probe the customer to ensure the full functionality is known.
Educate the customer on characteristics of speech and speech applications before being prey to their advice on the operation of the application.
Rigorously document all assumptions rather than permitting them to remain implicit.
Dialog design should be done by a team working in close consultation with the client. The client should be walked through the final design to be sure the design truly reflects what the system is intended to do.
Wizard-of-Oz testing is absolutely necessary, particularly if you are developing a new user interface; theories on how users can best interact with the application should be tested before development is formally initiated.
Speech enablement (building a speech application for a customer with an existing related application) may entail moving “up” from an existing DTMF application or “down” from an existing call center application. Moving up from DTMF: the resulting speech application will take less telephone time and will result in greater user satisfaction. Moving down from call center: the resulting speech application will have a greater need for confirmation and error correction than the corresponding activity performed in conversation with a live agent, since the speech application lacks the benefit of understanding and context. Therefore the speech app may take considerably longer to complete than the equivalent transaction performed by a live call center agent.
For form filling applications, directed dialog is most effective (more than mixed initiative or natural language dialog). But if some fields on the form are optional, simple directed dialog is difficult; it’s easier to sequentially present each field and let user say either ‘value’, or ‘next.’
Establish a Customer Communication Plan. Schedule at least weekly meetings with client stakeholders and with project team.

Conclusion
The development and deployment of a successful speech application requires part science, part art and part focus on software development fundamentals. As Mike Cohen pointed out in last months issue of STM, now that we’ve overcome skepticism about the capabilities of the technology, our primary focus has become the design of the voice user interface with emphasis on conversational context, rigorous emphasis on methodology and close interaction with the growing community of practitioners.

Part two of this article will focus on details of the application development process from requirement gathering through design, implementation using service creation tools and deployment using alternative infrastructures. k

Dr. K. W. (Bill) Scholz is architect director, voice and business mobilization solutions for Unisys Corporation. He can be reached at Bill.Scholz@unisys.com. Contributions to the article were made by Beata Chrulkiewicz, Raymond Diedrichs, Joe Oh , Richard Smith, Suzanne Taylor, Brough Turner and Alan Weiman.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Building Speech Applications: Part One - The Application and its Development Infrastructure

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

Deepgram Launches Voice Agent API