June 14, 2016
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

In a Mobile World, Voice and Graphical User Interfaces Need to Blend

Until a few years ago, graphical user interfaces (GUIs) and voice user interfaces (VUIs) occupied separate universes. GUIs were what you experienced on your desktop or laptop; your encounters with VUIs were likely limited to your cable company’s automated assistance. Now, with the ubiquity of smartphones, wearables, and other mobile devices with screens (some of them small), designers are faced with creating user interfaces that merge GUIs and VUIs into multimodal user interfaces (MMUIs).

But integrating components of GUIs and VUIs into an MMUI that is easy to learn and use is not a simple undertaking. For one thing, the following human limits greatly affect how GUI components should be integrated into MMUIs:

People have limited memory capacity. According to the so-called Miller’s Law (which some have contested, but we’ll use it here), people can generally hold between five and nine items in their short-term memories. Data displayed on screens extends this memory constraint. It is easier for many users to read and select options displayed on a screen than it is to remember and speak or type the options. To-do lists, appointments, and frequently used tools and applications can be positioned and displayed in specific locations on a screen.

People also have limited capacity for symbolic manipulation. Most output data should be displayed on a screen so users can examine, browse, and, most importantly, manipulate the data. People edit text, draw pictures, read documents, and perform spreadsheet-like calculations on displayed data. While some gifted people can manipulate complex mathematical equations entirely in their heads, most users find it easier to compute mathematical formulas displayed on a screen.

As for VUI components, an even greater range of human limits and tendencies influence their MMUI integration:

People respond to audio warnings. While blinking and jiggling text attracts the attention of users looking at a screen, audio attracts the attention of users looking anywhere. This suggests that error and warning messages should also be presented as audio, so users quickly perceive the problem.

When engaged in eye-intensive activities like driving, people can be distracted by GUI displays. Drivers’ eyes should focus on the road without switching to attention-grabbing displays. To avoid this kind of distraction, VUIs are preferred over GUIs.

People have difficulty performing multiple tasks at the same time. During tasks that require motion, which can include a variety of physical activities, many users find it difficult to select objects from a screen, either by pointing with their fingers or using a mouse. So when users are active and moving, VUIs are more convenient than GUIs.

People have difficulty reading small fonts. Wearables such as smart watches and rings have limited screen space or no screen space at all. VUIs are obviously the practical solution for these devices.

The right balance between GUI and VUI components may change depending on the device (desktop, tablet, mobile, wearable, car), application (entertainment, business productivity apps), or surrounding environment (work, home, car) people find themselves using or occupying. Users should be able to switch easily between GUI and VUI components as they shift among these different circumstances.

MMUI developers need very specific guidelines—preferably ones based on empirical evidence—to assist them in designing and implementing such interfaces. VUI guidelines can be found by visiting the Association of Voice Interaction Design’s Web site or by checking out How to Build a Speech Recognition Application: Second Edition: A Style Guide for Telephony Dialogues by David P. Morgan and Bruce Balentine. Many device vendors list GUI guidelines specific to their devices.

Such guidelines should explain which VUI and GUI components are necessary or useful for a range of situations, and they should be based on well-known principles of human perception and systematic usability testing. And for developers, the overarching goal should be to create MMUIs that are easy to learn—and easy (and safe) to use.

James A. Larson, Ph.D., is an independent speech technology consultant and teaches courses in speech technologies and user interfaces at Portland State University in Oregon.