Why Tap When You Can Talk

Editor's Note: This is an exclusive excerpt from the book Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, released September 1. Some content has been edited for space and style.

As mobile devices become more capable, the user interface is the last remaining barrier to the scope of applications and services that can be made available to users of these devices.

Clearly, speech has an important role to play in creating better interfaces on small mobile devices. Not only is speech the most natural form of communication for people, it is the only mobile device interface that is not constrained by form factor. Even so, speech interfaces should be developed in conjunction with other modalities. People can speak much faster than they can type (especially on a small mobile device), but they can read much faster than they can listen.

While there are many situations where it would be safer and more convenient to speak and listen, there are indeed those situations where it would be inappropriate to use a speech interface altogether. So the overall mobile user interface needs to support speech along with other modalities of input and output, and allow the user to freely and easily switch between modalities, depending on preferences and situations.

While the benefits are apparent, creating usable speech interfaces presents significant challenges from technological and human factors perspectives. This is why, notwithstanding significant investment in this technology during the past few decades, speech interfaces have remained constrained in both their functionality and market success.

Existing Speech Interfaces

While there has been ongoing research on fully natural spoken dialogue systems for many years, designers of most successfully deployed speech interfaces have taken the approach of tightly constraining what can be said and then constructing the application around these constraints.

These highly constrained systems tend to employ a distinctive formula:

Constrain the speech recognition task as much as possible;
Construct the application around those constraints; and
Require integration of semantic meaning to speech recognition.

But, despite the market success of these systems, users tend to not like them. Users perceive them as inflexible and error-prone, even in cases where the system is eventually able to satisfy the task for which it was designed.

Although significant, these issues could be overcome if the goal were simply to allow the use of speech interfaces for a few key applications. Users can eventually learn to successfully interact with such applications if they are sufficiently motivated to use the speech interface. However, if the goal is to support the full range of applications that users might have on their phones, we cannot expect users to learn what they can speak into every state of each application.

In addition to these key usability issues, mobile application providers don’t want to construct applications around the constraints of a speech recognizer. Even if the intent were there, the vast majority of mobile application developers do not have the relevant domain expertise or available resources to handle the grammar development, speech user interface design, and ongoing recognition and grammar-tuning activities required.

Natural Language Dialogue Approaches

The field of speech technology has witnessed two challenging decades of work in combining speech recognition with natural language dialogue processing to create automated systems that can communicate in a more human-like manner. Obviously, if we could really achieve the goal of creating automated systems that have human-level spoken dialogue skills across a broad range of domains, this could be used to create highly functional user interfaces. Unfortunately, if we get only part of the way to human-level performance, the interfaces become even harder for people to use than the more constrained interfaces. There are two key reasons for this: boundary-finding and efficiency.

For any user interface to succeed, users need to have a mental model of what the system can and cannot do. Simple constrained systems can make this obvious to users by asking very specific questions (What city?) or by telling them their choices (say either “send,” “delete,” or “forward”). On the other extreme, if we could make a system understand everything a person understands, users could learn that they can talk to the machine in the same way they speak to another person.

The problem is that if you make the system understand much of what a human can understand, but not everything, how do you make this apparent to the user? How can the user know the boundaries of what the system can and cannot understand? This is not just limited to the words and sentences the system can interpret, but extends to dialogue constructs as well. That is, when people talk with other people, they don’t just respond to individual utterances—they use a tremendous amount of shared knowledge about the current interaction, state of knowledge of the other party, and knowledge of the world. Unless you can fully simulate this in an automated dialogue, how can you give the user a reasonable model of what the automated system can and cannot handle?

In the absence of the deep, contextual information present in human-to-human communication, natural language dialogue systems are necessarily inefficient. While this can create an annoyance for search tasks, it becomes completely untenable when applied to all but the simplest messaging tasks, particularly when we consider the fact that a user might need to correct some of what is recognized by the system.

Unconstrained Mobile Speech Interfaces

In our work at Vlingo, we have been making use of these principles in designing a broad speech-driven interface for mobile devices. Rather than building constrained speech-specific applications or attempting to make use of more complex natural language dialogue approaches, we have been working to create a simple but broad interface that can be used across any application on a mobile device.

Our efforts to create a simple, transparent model for users have resulted in the following product principles:

• Provide multimodal feedback throughout the recognition process: Using a combination of graphical and audio feedback, we let a Vlingo user know what action will be taken. A good speech recognition system provides a combination of tactile, auditory, and visual cues to keep users informed of what is happening. When the user first presses the voice key, we display a listening pop-up and reinforce the display with a vibration or ascending tone. When we finish recording and begin processing audio, we change both the wording and color of the on-screen display and play either another vibration or a descending tone. This feedback is useful on all platforms, but especially on touchscreen devices where users do not have the immediate haptic feedback of feeling a physical key depress and release. When the user’s recognition results are available, we play a success tone and provide text-to-speech (TTS) confirmation. TTS allows the user to confirm without glancing at the screen that we understood his intention. In this way, a user who might be multitasking is alerted to return his attention to the task at hand. Finally, in cases such as auto-dialing where we are about to initiate a significant action, we show a temporary confirmation dialogue, play a variation of the success tone, and again use TTS to confirm the action we are about to take. When we have correctly understood the intended contact, the pop-up, tone, and TTS provide assurance; in the case of misrecognition, the multimodal feedback calls the user’s attention to the problem so he can correct his entry before we initiate the action.

• Show the user what was heard and understood: When a user speaks, we show him what was recognized, how the system interpreted that speech, and what action will be performed as a result. Traditional IVR-descended mobile applications show how the system interpreted speech, but do not show exactly what the system heard. This can cause confusion in cases where misrecognition results in an unexpected and undesirable action to be taken. Additionally, there are cases where the user’s speech is correctly recognized by the system, but, for various reasons, the search engine provides unexpected results. Here, again, by showing the exact words that Vlingo heard, we help the user realize how to proceed. If the words were correctly recognized, but the search engine does not return the desired results, it is clear that the problem is not one of speech recognition. Speaking the same words again will not help; rather, the user needs to modify the terms of the search.

• Allow the user to edit the results: When faced with speech recognition results, a user can perceive the results as correct, incorrect, or almost correct. Instead of repeating the task for the almost-correct case, he might choose instead to invest in his previous efforts by correcting the results. Depending on the tasks and the user’s level of expertise, he might choose to correct recognition errors by speaking again, editing by typing, selecting from alternate speech-recognition results, or some combination of these methods. These mechanisms are mainly used to correct speech recognition errors, but they also handle cases where a user makes mistakes or changes what he wants to do.

• Preserve other input modalities: As a correlate to the principle above, speech is one way for a user to provide input, though he should be able to use other input mechanisms in cases where speech recognition is not practical or working well. Traditional IVR-descended applications require the user to speak again in the case of misrecognition, as opposed to Vlingo’s model of displaying results in an editable text field. In the traditional model, the user can lose confidence: Why trust the system will correctly understand a reutterance if the first attempt was not successful? If a second attempt is also unsuccessful, the user might abandon the task or application, deciding it is easier to type. Our model of providing recognition results in a fully editable text field that allows the user to correct errors in the mode he prefers: speaking again, choosing from a list of options, or using the familiar keypad interaction. This correction ability, particularly when paired with an adaptive loop that enables Vlingo to learn from successes and errors, increases user confidence in a voice-based system.

• Allow the user to add to results: When composing a message, a user often needs time to gather his thoughts. It is relatively common for the user in attempting to speak a text message to start the beginning of his message, pause for a few seconds while he decides what else to say, and then complete his dictation. It is also common for the user to reread what he has dictated and decide he wants to say more. The multimodal nature of our approach makes this use case easy to support. For messaging tasks, we place the cursor at the end of the recognized text. Once the user sees what we recognized, he can instantly initiate new recognitions and append text to his message.

• Give the user explicit control over actions taken: The action that is taken depends on the state of the mobile phone. In the case where there is currently an application in the foreground, the action taken is very simple: We fill the current text field with whatever the user just spoke. However, we believe that speech interfaces can serve a function beyond simply replacing the keyboard; they can also be used to help a user navigate through the various applications available on his phone. Thus, in addition to acting like a keyboard to allow the user to fill text fields, we also handle high-level application routing.

• Ensure the user is aware of the system’s adaptive nature: The most common user complaint about any speech recognition system is that it is not accurate enough. However, Vlingo includes an adaptive loop based on acoustic and language characteristics of the user and of all speakers of the user’s language. This component continuously improves the models of the system based on the ongoing usage—including corrections that a user makes to the system’s responses. Not surprisingly, a user who is dissatisfied with recognition results is significantly more patient and more likely to continue to use Vlingo if he is told that speech recognition improves over time.

Technology for Unconstrained Speech Input

A key enabler to create this style of speech input is to get rid of the need for application-constrained speech input. If we had to restrict users to particular words and phrases, we would not be able to provide the simple model we described above.

This, of course, presents a challenge since speech recognition on truly unconstrained input is not practical. We instead need to use modeling and adaptation techniques to achieve something close to this.

In particular, we have been successful in creating these interfaces using a set of techniques:

• Hierarchical language model-based speech recognition: We have replaced constrained grammars with very large-vocabulary hierarchical language models (HLMs) based on well-defined statistical models to predict what users are likely to say given the words they have spoken so far (let’s meet at ___ is likely to be followed by something like 1 p.m. or the name of a place). Unlike previous generations of statistical language models, the new HLM technology scales to tasks requiring the modeling of millions of possible words (such as open Web search, directory assistance, navigation, or other tasks where users are likely to use any of a very large number of words).

• Adaptation: To achieve high accuracy, we make use of significant amounts of automatic adaptation. In addition to adapting the HLMs, the system adapts to many user and application attributes, such as learning the speech patterns of individuals and groups of users, learning new words, learning which words are more likely to be spoken into a particular application or by a particular user, learning pronunciations of words based on usage, and learning accents.

• Server-side processing: The Vlingo deployment architecture uses a small amount of software (between 50 kilobytes and 90 kilobytes, depending on platform) on the mobile device for handling audio capture and the user interface. This client software communicates over the mobile data network to a set of servers, which run the bulk of the speech processing.

• Correction interface: We have designed the user interface to allow the user to freely mix keypad and speech entry and to correct the words coming back from the speech recognizer. Users can navigate through alternate choices from the speech recognizer (using the navigation buttons), delete words or characters, type or speak over any selected word, and type or speak to insert or append new text wherever the cursor is positioned. We think this correction interface is the key to allowing users to feel confident that they can efficiently enter any arbitrary text through the combination of speech and keypad entry.

As an example of the effects of adaptation, when Vlingo launched even initial users experienced high success rates of 82 percent, which grew to more than 90 percent during the subsequent 15 weeks. This significant improvement is due to a combination of accuracy gains from adapting to usage data, repeat usage that is more focused on real tasks instead of experimenting, and users learning to interact with the system more effectively.

Technology for Mapping to Actions

The other main technology component is to take word strings from users and map them to actions, such as in the case where users are speaking a top-level input (such as send message to…).

Our goal is to do this in a very broad way, allowing users to say whatever they want and find some appropriate action to take based on this input. Because we want this to be broad, it needs to be shallow. It is reasonable for the speech interface to determine which application is best-suited to handle the input.

For this “intent modeling,” we also use statistical modeling techniques and are developing statistical models that map input word strings to actions. We seed these models to a reasonable starting point, using knowledge of the domain, and then adapt a better model for real input based on usage.

We can reduce the input variety by giving users feedback on a “canonical” way of expressing top-level routing input. The general form is “<application_or_action> <content>,” such as Web search restaurants in Cambridge or navigate to 17 Dunster Street Cambridge Massachusetts. We provide this feedback in help screens and audio feedback.

The combined effect of these approaches has led to successful deployments of these unconstrained speech interfaces. Not only are users able to achieve sufficient accuracy for a wide variety of tasks, but they have come to view these interfaces as broadly applicable across a wide variety of tasks.

Usability Metrics and Results

To maintain a single-pointed focus on meeting user goals rather than advancing technology for its own sake, we incorporate user research, data mining, and usability testing into every major project release. For each key feature, we revisit the list of user personas, identifying what goals we will help our primary users achieve, what context they will be in when attempting to achieve those goals, and what elements are required to make the process easier, more efficient, and more satisfying. Throughout the product life cycle, we draw on the following tools from the field of user experience:

During the release definition phase: usage data, surveys, interviews, and focus groups;
During the design and development phase: iterative design, usability testing of paper prototypes, and live software;
During the quality assurance phase: beta testing; and
Post-release: usage data, surveys, and reviewing support incidents.

The Future of Mobile Speech Interfaces

There has been a tremendous amount of progress over the past few years. Just a few years ago, the state-of-the-art of mobile speech interfaces was mainly limited to very constrained device-based applications, such as voice dialing. In addition to the systems that we are deploying, we now see speech interfaces in a number of point applications, including unconstrained speech recognition in voice search from multiple sources, such as Microsoft’s voice-enabled Bing, and Google Search by Voice. We are also seeing dictation applications from major players, such as Nuance Communications. In addition, the latest Android phone released by Google includes a voice interface attached to the virtual keyboard, so any place where you can type, you can now speak.

But there is still a long way to go to a truly ubiquitous multimodal interface that works well across all applications and situations. The top-level application launching plus allowing speech input into any text field is the first step to this broad user interface. But to fully make use of this functionality, applications need to be designed with the speech interface in mind. While allowing speech into any text field does allow broad usage, if the application is designed to avoid text entry, it might not make good use of speech.

Once speech becomes part of the operating system of the phone, applications will evolve to take advantage of the changes in user behavior now that they have the option for spoken input across applications.

This is similar to what happened with touchscreen interfaces. While there were limited deployments of touchscreen interfaces on various devices, the situation changed dramatically when Apple released the iPhone in 2007. By integrating touch as a key part of the operating system, Apple transformed the user experience not only on its own devices, but across the industry. In addition to prompting other mobile device makers to incorporate similar interfaces in their own devices, application developers started taking advantage of this interface in their application design to create a wide array of successful applications. We expect a similar transformation to take place in the next few years as speech is built into the operating systems of devices. Once that happens, we can truly make use of the potential of speech interfaces to allow much richer applications.

Given the current constraints of text entry on mobile devices, mobile applications are designed to minimize the need for text entry—constraining the goals to what can be achieved with button and menu choices and small amounts of text entry (except, of course, for messaging applications, which cannot avoid the need for text entry). Once people have a much easier and more natural way to interact with their mobile devices, applications can be much more ambitious about what they can do. In particular, we can start to provide more open interfaces for people to perform a wide variety of tasks.

Our overall goal is to allow people to say whatever they want, and then have their phones do the right thing across a broad set of possibilities. So people should be able to say things like, Schedule a meeting with me, Dave, and Joe tomorrow around lunchtime, and the phone should be able to interpret this, find the right applications that can handle it, and provide appropriate feedback to the user. While this is ambitious, it is an example of something application developers and phone makers could not even contemplate without a speech interface. Once we see ubiquitous deployments of mobile speech interfaces, we expect applications to be developed with these more ambitious goals and we expect they will become more successful over time.

About the Book

Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics provides a forum for today’s speech technology industry leaders—drawn from private enterprises and academic institutions all over the world—to discuss the challenges, advances, and aspirations of voice technology, which has become part of the working machinery of everyday life.

This anthology is divided into three sections—mobile environments, call centers, and clinics—and represents the research findings of more than 30 speech engineers, system designers, linguists, and IT and management information systems specialists. Advances in Speech Recognition is edited by Amy Neustein and features an introduction by Judith Markowitz and Bill Scholz, who jointly wrote the book’s foreword. The book ends with an epilogue by Jim Larson, who forecasts the promises, and sometimes the perils, of advanced speech recognition technology.

Why Tap When You Can Talk

Aircall Acquires Vogent

Grok Voice Mode Comes to Apple CarPlay

Krisp Launches VIVA 2.0, an Infrastructure for Voice AI Agents

DomoAI Launches TTS and Integrates OpenAI's GPT Image 2.0 in Talking Avatar Workflow