Speech in My Pocket

Article Featured Image

Smartphones are pocket-size computers bundled with a telephone and many other gadgets, such as a camera, GPS device, MP3 player, and compass. Thousands of applications have been created for the market’s three major smartphones, extending their functionality and usefulness. In addition, smartphones are replacing many landline and cell phones. 

Smartphone users listen to and read prompts, menus, and messages, and enter information by clicking, typing, and speaking. They choose their information entry mode based on environmental conditions: In cold environments they typically speak rather than type, and in noisy environments they typically type rather than speak. Users driving cars typically speak and listen rather than type and read. Many users prefer speaking to typing on the mini-keyboards available on smartphones. When automatic speech recognition (ASR) systems fail to recognize a spoken word, smartphones can display candidate words that most closely match what the user said; then he can click rather than type or speak the correction. 

Smartphones’ increasing popularity is changing the way many users access interactive voice response (IVR) systems. IVR system vendors can take advantage of the rich user input/output methods provided on smartphones by using the following strategies:

1. Use existing IVR speech servers. Smartphone users call an IVR server and interact solely via voice, just as they do with cell phones and other mobile devices. The voice server executes a dialogue manager (often written using VoiceXML), which then listens to and speaks with the caller using speech recognition, touch-tone, speech synthesis, and prerecorded verbal messages. 

2. Transmit video and images to the smartphone. Enabling callers to see video and graphics in addition to listening to audio greatly extends the usefulness of IVR systems. Callers can see how to assemble, use, and troubleshoot equipment they just purchased, and review illustrations that clarify audio messages. Updates to the application are implemented on a server and automatically made available to users. Some VoiceXML browsers, such as Genesys and Voice Pilot, already transmit video in addition to recorded and synthesized voice. VoiceXML 3.0 also will support images and audio when it becomes an industry standard. 

3. Transmit text to the smartphone. By extending the VoiceXML browser to send textual prompts and messages to smartphones, callers can read prompts on their smartphone screens and listen to prompts through their smartphone speakers. Users can continue to use existing VoiceXML 2.0/2.1 applications by listening to and reading prompts and responding by clicking, typing, or speaking. 

4. Allow users to download and execute applications to smartphones rather than execute IVR applications on a server. Smartphone applications are usually written in HTML and scripting languages. The World Wide Web Consortium is working on an extension to HTML that will support ASR and text-to-speech (TTS). Because many of the less-expensive smartphones will not support ASR and TTS functions, server-based ASR and TTS engines can be used. A disadvantage with this is that users must download the latest version of the application to take advantage of updates and changes. The creation of multiple versions to work with each smartphone operating system is another problem. Application frameworks, such as Rhomobile, Appcelerator, Ansca, and Nitobi, could enable developers to author an application once and distribute it on all of the major smartphone operating systems.

5. Move the dialogue manager to the smartphone. Vendors should provide a lean and mean dialogue manager that executes on smartphones. This application downloads dialogue instructions from the server and uses the smartphone’s ASR and TTS, if available; otherwise, the application uses the server’s ASR and TTS functions. With this approach, users need to download only a single application while accessing up-to-date instructions and data from the server. 

Developing smartphone applications is like living in the Old West. Chaos reigns and few industry standards exist. I’m betting that eventually smartphones will contain their own dialogue managers. These systems will be able to execute mashup applications that access a variety of Web services. When that happens, it will make the smartphone useful for traditional IVR applications, including customer support, financial transactions, and voice search. 

James Larson, Ph.D., is a speech applications consultant, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, co-program chair for SpeechTEK 2011, and program chair for SpeechTEK Europe 2011. He can be reached at jim@larson-tech.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues