Who Will Win the GUI-VUI Race?
Will a graphical user interface (GUI) or voice user interface (VUI) emerge as the standard user interface on mobile devices? VUI emerged from the starting gate first in the early telephone days when callers spoke with human operators to place a call. At that time, a voice user interface was the only option. VUIs pulled further ahead when speech recognition enabled callers to both speak and press touchtone keys. This formed the basis of today’s VUIs, with callers responding to questions presented by a speech synthesis engine or a prerecorded message by speaking and/or pressing touchtone keys. (Technically, this is a multimodal user interface, or MMUI, because users have two modes for entering data, but most people refer to this as a VUI.)
When mobile devices entered the market, GUIs began to outpace VUIs. Most mobile devices contain a small screen on which information, especially menu options, is placed for the user to read. Most users select menu options by moving a curser across the options and then pressing a select/enter key. Many mobile phone users have become proficient at entering information by pressing the keys on a miniature keypad instead of using their voice to enter the data. The recent excitement about Apple’s iPhone and Google’s Android cell phones moved GUIs further ahead of VUIs. (At press time, neither Apple’s nor Google’s phones natively support speech recognition.)
End of the Line?
Does this mean the gradual end of VUIs? Yes. VUIs will gradually be replaced by fourth-generation interactive response systems that support multiple modes of input and output. As more powerful wireless networks become widely available, the speech/listen style of VoiceXML 2.1 will be replaced by a listen/read/speak/press style of interaction. Some vendors already have extended VoiceXML 2.1 to support video and still images for cell phone displays. These features will be standard in the forthcoming VoiceXML 3.0 language.
MMUI will follow two strategies: user’s choice and natural multimodal. In the user’s choice strategy, users can receive information via two channels, audio and visual. In eyes-busy, hands-busy environments, users can elect to use the audio channel. In a meeting or at a noisy airport, users might elect to use the visual channel and read all output on the small display screen. When entering information, some users will prefer to speak while others will prefer to press buttons.
To support a natural MMUI, designers create user interfaces that use either speech or keypad input for different tasks. For example, users often find it easier to select from among several options presented visually rather than to listen to a voice menu and then repeat the desired option. Users find it easier to scroll through a textual result than to listen to synthesized speech. However, messages such as You have new mail or Someone is trying to call you could be presented aurally, often with special sounds called earcons (the audio equivalent of GUI icons). One of the pesky problems with VUIs occurs when the speech recognition system fails to recognize a spoken word. VUI designers apply many strategies to avoid and correct these types of user interface failures. Mobile devices simply display the n-best list—a list of words that the speech recognition system considered while trying to recognize what the user said—and let the user select the word that was spoken.
Users sometimes switch channels when entering a single command or request. For example, a user might say, Put that there, then select the object to be moved and the target location. It seems to be a natural way for users to also transfer funds in banking applications, for example, by first selecting the amount of the transfer and then the target account.
In time, callers will use both visual and audio channels. This new style of user interface, the MMUI, will be a combination of GUI and VUI. Rather than having a horse race between GUI and VUI, the two user interface styles will merge together, benefiting from each others’ strengths.
So goodbye, VUI. You have served us well, but your time is nearly gone. Hello, MMUI. Glad to meet you. It’s about time we used multiple modes of communication rather than just visual or just oral communication modes.
James A. Larson, Ph.D., is co-program chair for the SpeechTEK 2008 Conference, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and author of the home-study guide The VoiceXML Guide (www.vxmlguide.com). He can be reached at firstname.lastname@example.org.