Are Vendors Asking Speech To Do Too Much?

Have vendors asked too much of speech recognition in the non-telephony world? Is it natural, in human communication, for speech to be the sole means of interaction? When given a choice between using the graphical user interface (GUI) and the "command-and-control" speech recognizer on their PCs, why do users tend to choose the GUI? Could it be because speech recognition vendors have been asking speech to perform an unnatural act - trying to get speech to do the "whole thing," when other modes of communication would be more appropriate? Have they attempted to mimic the GUI (e.g., its menu structure) rather than take advantage of the unique strengths of speech? Interface Strengths
When, then, is speech appropriate? Speech recognition is at its best over the telephone network, because, of course, it is the most natural tool available when using a phone. Touch tone/DTMF is time-consuming and can be frustrating at times, while speech-driven menus offer a more personal touch to the caller. Regarding other devices, one appropriate use for speech is for hands-busy tasks and small devices such as cell phones and (soon) PDAs. Speech also offers the possibility of bypassing large hierarchical menu systems, taking the user directly to the objects of interest. Finally, spoken language offers the ability to describe objects of interest, selecting from large sets of entities or entities not on the screen. Weaknesses
Conversely, speech is less useful in describing locations and objects that are perceptually accessible to the user. For such cases, GUIs and other modes of communication are often more appropriate - modes such as pen-based interaction are better at supporting pointing, drawing, gesture, signatures, etc. However, GUIs are weak on precisely the points where speech is strong, for example, in having a poor ability to support description. In summary, there is a need to build multimodal interfaces that allow users to employ each mode for its strengths and to overcome weaknesses of others. Robust Operation
A major reason to consider multimodal interfaces is their potential for robust operation. Studies by Professor Sharon Oviatt at the Oregon Graduate Institute of Science and Technology's research center indicate that people are often good judges of their inability to pronounce certain words (such as foreign surnames), and will switch to a written modality if one is available. People will also switch modes if one is perceived to be at risk, perhaps because of external noise or vibration. Finally, people frequently switch modes in response to recognition errors. Apart from the use of different modes as alternates, there could be significant advantages to using them in combination. First, the opportunity to interact multimodally changes what is spoken, often making it easier to recognize. Oviatt has found that for map-based tasks, multimodal input results in 23% fewer words, 35% fewer spoken disfluencies, 36% fewer task performance errors and 10% faster task performance, as compared to a speech-only interaction. Another illustration of the benefit of the joint use of modalities is audiovisual speech processing, in which the visual channel helps reduce uncertainty of speech recognition in noisy conditions. More generally, by determining the best joint interpretation of the multiple input streams, multimodal systems can offer mutual compensation of modalities, using information from one mode to correct interpretation errors in another. The Quickset system is a perfect illustration of multimodal systems. QuickSet
QuickSet is a multimodal (pen/voice) interface for map-based tasks. With this system, a user can create entities on a map by simultaneously speaking and drawing. With pen-based, spoken or multimodal input, the user can annotate the map, creating points, lines and areas of various types. The system controls numerous backend applications, including 3D terrain visualization, military simulation, disaster management and medical informatics. QuickSet consists of a collection of "agents" that include speech recognition, gesture recognition, natural language understanding, multimodal integration, a map-based user interface and a database, all running standalone on a PC or distributed over a network. The multimodal interface runs on machines as small as Windows CE devices, as well as on wearable, handheld, table-sized and wall-sized displays. The components of the system are integrated via OGI's Adaptive Agent Architecture, which offers facilitated communication, plug and play connection, dynamic discovery of agents, asynchronous operation and "wrapper" libraries in C++, Prolog, Java and other languages. When the pen is placed on the screen, the speech recognizer is activated, thereby allowing users to speak and gesture simultaneously. For this task, the user either selects a spot on the map and speaks the name of an entity to be placed there or draws a linear or area feature while speaking its name. Speech and gesture are recognized in parallel, with the speech interpreted by a natural language parser. The interpretations of both streams are fused semantically via a unification procedure. In response, QuickSet creates the appropriate icon on its map and asks for confirmation. For speech recognition, IBM's Voice Type Application Factory, Microsoft's Whisper and Dragon System's Naturally Speaking can be used interchangeably. In general, analysis of spoken language and of gesture each produce an n-best list of interpretations and their posterior probabilities. Multimodal integration searches among these for the most likely, semantically and temporally coherent joint interpretation. The best such interpretation may contain neither the best scoring spoken nor gestural interpretation. In such cases, the multimodal architecture has corrected recognition errors. For example, in a recent study with Quickset, such mutual compensation provided a 41% spoken error rate reduction over speech-only interaction for non-native speakers (Oviatt, 1999a). Similar studies of mobile multimodal systems are in process. More Testing
A comparison of a standard direct-manipulation graphical user interface with use of the QuickSet interface for supporting a map-based military task may supply the answer. For this task, the entities placed on the map include military units, linear and area features. The users also employed a menu-based GUI designed by another company according to industry-standard interface guidelines. Four military personnel created and entered their own simulation scenarios via both interfaces. Analysis revealed that use of the multimodal interface resulted in a 3-to 4-fold speed improvement in the average entity creation time, including all error handling. Time to repair errors also was 6-fold faster with multimodals. Finally, the subjects uniformly preferred multimodal interaction. Although this is just one comparison, these results suggest potentially large niches of applicability for multimodal interfaces. Since the GUI followed standard guidelines, there may also be significant benefits to augmenting a major class of GUIs with multimodal technology. Next Steps
The virtue of multimodal interaction is that it leverages the power of speech for what it does best, while allowing users to employ other modes where appropriate. The challenge will be to harness speech effectively in a hybrid symbolic/statistical architecture that supports modality synergy at run-time. An important next step will be the development of an interface toolkit that will enable developers to build a new generation of multimodal systems without having to become experts in the underlying technologies.

Philip R. Cohen is Professor and Co-Director, Center for Human-Computer Communication, at the Oregon Graduate Institute of Science and Technology (http://www.cse.ogi.edu/CHCC), and President of Natural Interaction Systems, LLC. He can be reached at 503-690-1326 or pcohen@cse.ogi.edu.

Further Reading

Anyone interested in learning more about the development of multimodal interfaces can consult these works:

Cohen, P. R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., & Clow, J. (1997). Quickset: Multimodal interaction for distributed applications. Proceedings of the Fifth ACM International Multimedia Conference, 31-40. New York: ACM Press.

Cohen, P. R., McGee, D., and Clow, J., The efficiency of multimodal interaction for a map-based task, Proceedings of the Applied Natural Language Processing Conference, Association for Computational Linguistics, Seattle, April 2000.

Oviatt, S. L. (1999a). Mutual disambiguation of recognition errors in a multimodal architecture. Proceedings of the Conference on Human Factors in Computing Systems (CHI'99), 576-583. New York: ACM Press.

Oviatt, S. L. (1999b). Ten myths of multimodal interaction, Communications of the ACM, November 1999. Oviatt, S. L. (1997). Multimodal interactive maps: Designing for human performance. Human-Computer Interaction (special issue on Multimodal Interfaces), 12, 93-129.

Are Vendors Asking Speech To Do Too Much?

ServiceNow Partners with OpenAI on Voice AI

HawkSoft and Sonant Announce Voice AI Integration

AISpeech Launches Orphi Voice Assistant

Smartcat Enhances AI Video Translation