Can Speech Improve Cell Phone Interfaces?

Interactive Voice Response (IVR) systems are everywhere. For the past decade, consumers have dialed regularly into computer systems to obtain automated bank, portfolio, and account information, airline schedules, movie times, and the like. We have become so accustomed to using IVRs that, like answering machines, we may no longer be aware of whether or not we like them. They are useful, relatively easy to use, but mostly so prevalent that we frequently have no choice but to use them. Alternatively, automatic speech recognition (ASR) systems are only beginning to make a comparable impact on our daily lives. Automatic dictation systems, designed for individual users, have matured technologically faster than voice systems designed for use by the general public. The majority of the population has had limited opportunity to interact with computers and devices by voice, save for saying "one" instead of pressing the digit on their telephone keypads. When I started designing voice system user interfaces (UIs) for GTE Labs in the mid-eighties, clients and product managers were excited about ASR, but somewhat leery of it, particularly in terms of customer acceptance and expectations. My colleagues and I designed, evaluated and tried to improve any number of IVR application UIs while we waited for our clients to commit to offering ASR applications. Like all voice application designers, we were confident that ASR would offer a more natural form of communication, would be more efficient for completing transactions, and might even provide callers with a more pleasing interaction than IVR could. And so, like many voice UI designers, we faithfully conducted ASR-related research, including conversational analysis and human-machine dialog design. More than a decade has passed, and it is arguable whether speech recognition technology has achieved mainstream status. When asked their opinions, each of the industry leaders profiled in previous issues of this magazine identified exciting improvements in speech algorithms, in natural language processing, vocabulary management, and language modeling, but still conceded that ASR's mainstream status as still emerging. Most people agree that the technology is improving, if more slowly than some of us had hoped. Magazines report advances in speech technology regularly. But the problem of consumer expectation remains. UI designers approach products by analyzing the users' mental model of a product, that is, what they think a product can do and how they can make it do what they need it to do. In the case of speech technology, consumers have always had an unrealistic expectation of the capabilities of controlling or interacting with products by voice. It doesn't make sense to consumers that a computer "understands" only a few words and sentences. The important point is that user expectations, even unrealistic ones, have to be reflected in the design of a user interface for the product to work as intended. The fact is that many speech recognition systems have limited vocabularies and grammar. Often, when people begin to interact with conversational systems, they tend to be conversational in return. Consequently, a good speech interface is one that will steer them into providing only responses that the system can recognize. Understanding what speech recognition technology can provide today, as well as its limitations, I am enthusiastic about its potential for improving current product interfaces. The remainder of this article focuses on issues related to the design of voice interfaces for one class of products, cellular phones. Functionality
Cell phone interfaces aren't keeping pace with their functionality. Although cellular phones look like the telephones we have been using our whole lives, they provide many capabilities besides placing and receiving calls. They can be programmed for advanced calling features such as caller ID, call waiting, last number dialed, and call forwarding. Cell phones can contain call timers, ringer options, volume controls, and display options. Some incorporate paging and other non-telephone like functions. And finally, cell phones can hold personal phone directories containing hundreds of names and phone numbers. In recent usability evaluations, I have observed scores of people using almost every kind and brand of cellular phone. Consequently, I have developed a pretty good sense of which aspects of their usage are easy and which are difficult. In general, the functions that people know how perform with a regular phone, such as placing and receiving calls, are easy to do. Alternatively, using features that involve programming (i.e., following proscribed sequences of steps) seems to be quite difficult. Cell phone interfaces require users to customize options; to input names and phone numbers into the phone's memory, and to change and retrieve stored information using nine keypad buttons and a few oddly labeled function keys. To make matters worse, because phone displays are small (and getting smaller), text instructions and feedback are so abbreviated that they are often less than helpful. Can voice help?
It is clear to me that programming or customizing these features could do with a better user interface than a cell phone's tiny keypad and telegraphic display prompts. As an alternative, might a voice interface provide a smoother user interaction? My answer depends on what type of task the user is trying to accomplish. In the examples below, I argue that different user tasks call for different interaction mechanics. First, consider what is involved in programming features that have a discrete and small set of options, such as changing the phone's ringer alert from ring to vibrate mode. For these kinds of tasks, voice prompts might well guide the user through the sequence of steps more smoothly than scrolling through menu options. Now, consider the user's response to a spoken voice prompt. Would it be easier to press buttons or to use spoken commands to select among options? Note that the ASR solution requires speech-enabling the cell phone. For toggling features on and off, this may be a "nice-to-have" feature, but may be more trouble than its worth, given that the biggest challenge in designing even a simple speech dialog is in constraining user input. IVR may offer a simpler solution. IVR requires users to translate a natural response (e.g., "yes"), to a less natural one ("press one"). In that sense IVR seems less desirable than speaking the word "yes," but as long as the number of allowable user responses is small, the interaction is practicable. Next, consider phone list management, which is currently very unwieldy using a cell phone interface. Currently, the user is prompted to type in names and phone numbers by following [abbreviated] text instructions on the phone's [small] display. Inputting a single name and associated phone number into a phone's memory requires the user to complete a sequence of four to ten steps flawlessly. Currently they type in their information using nine very small buttons and function keys. (I am not a fan of using a phone keypad as a typewriter). Finally, because users can access names and phone numbers most efficiently using speed dialing sequences, users have to associate and remember phone entries with reference to speed dial locations. Additionally, although cell phones have the capacity to hold tens or even hundreds of directory entries, the average person programs only a few frequently-dialed numbers into the phone's memory. Contrast this with personal phone books and personal digital assistants that contain all the numbers a person might ever need to access and dial, not just the five or ten most frequently used. Moreover, entries are organized by categories that are meaningful to the owner (e.g., last name, first name, company name) as opposed to numbered phone memory locations. Finally, phone lists should be dynamic, in that it should be easy to add, change, or delete entries. Where it is certainly possible to program a comprehensive phone list into a cell phone's directory, I have not observed that is either easily or frequently accomplished. Phone vs. voice
A class of communication services is starting to appear that may ease the management of phone lists, and at the same time, bypass some of the interaction problems I've described. These are services that let users program and store directory information onto a central server, a computer, or another device, rather than into the phone itself. The speech recognition software is also located remotely, eliminating the need to speech-enable the phone. Different systems will offer different interaction methods for inputting directory information. In one type, a user will input his or her information by voice. An automated speech recognition attendant will control the interaction by prompting the user for certain information, and responding conversationally to the user's requests. A second type of service allows will user to type directory information into fields on a personal digital assistant, software, or an Internet page. Text-to-speech technology, along with name language modeling, translates the typed information into a spoken name. (Note that getting a speech system to pronounce typed names correctly is a complicated problem). In any case, once the directory information is stored the user calls the access number, and interacts by voice with the automated assistant to place calls to individuals listed in his or her personal directory. The conversational nature of the interface promises a natural, easy and pleasant solution to list data management problems. I have thought for some time that advanced cellular phone technology and capabilities have outpaced their push-button, display driven user interfaces. The current phone interface is not flexible enough for the customizing and programming that make wireless phones so valuable. There are a variety of features, services and functions that wireless phones offer people that are simply too hard to set up and manage. I think that voice interfaces have the potential to ease the burden on telephone interfaces. There is now greater choice in how to offer voice capability, i.e., to speech enable the phone, or to use the phone as an interface to speech services. The later seems particularly promising for offering the natural, efficient, easy to use, and pleasant interaction that UI designers have been promising all along.
Lydia E. Volaitis is Principal Research Scientist at the American Institutes for Research, 490 Virginia Road, Concord, MA 01742.
SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues