Voice-First User Interfaces Speak to the Omnichannel Future
Until relatively recently, there were screen-oriented devices (TVs, desktops, laptops, and tablets) that presented information visually via a screen or display, and there were voice-oriented devices (phones) that used voice for both presentation and capture—two forms of user interface that evolved separately. But now the two interfaces have begun to merge.
Screen-oriented devices have been enhanced with voice: Comcast has voice-enabled TV controllers so viewers can use speech to manage their TVs; computer operating systems accept voice commands for many applications; and car dashboard screens can hear drivers issue spoken commands.
Meanwhile, voice-oriented devices have been enhanced with screens: Telephone handsets have been replaced largely with smartphones containing touchscreens, and VoiceXML, used to develop voice-only interfaces for phones, has been extended to Visual VoiceXML to support the display of prompts, menus, videos, and illustrations on mobile device screens. Omnichannel systems enable the coordination of voice and visual channels.
Even smart speakers now have screens. Amazon recently introduced Alexa Show, which extends the verbal dialogues of smart speakers to include the display of onscreen information. This enables do-it-yourself applications that listen to verbal instructions and present short demonstrations of specific tasks. Users control the dialogue by speaking while keeping their hands free. For example, users can view each step for preparing and baking a cake by speaking commands, such as “next” or “repeat,” to progress through short video segments.
This new voice-first style of interface refers to a system that primarily accepts user input via voice commands and may augment audio output with a screen display. Voice-first interaction requires rethinking the design of voice-only dialogues. Users speak commands for search requests and view results on one or more screens. The presence of screens extends the functionality of voice systems in the following ways:
- Screens make invisible data visible. Voice-only user interfaces make it harder for users to visualize the object they are attempting to review or manipulate. Being able to see an account statement, for example, makes it easier for users to review the details.
- They extend human memory. A screen enables users to actually see what they might not be able to recall quickly. For example, helping users select and speak one of several visible options is more efficient than having them try to remember desired options.
- They simplify complex tasks. Tasks are made easier when users can select and concentrate on a single task step. For example, in fillable form applications, users can select and complete the specific answer to a particular question.
- They display current status. When resuming complex tasks after a disruption, a visual road map of steps, both complete and incomplete, helps users review their progress.
- They display feedback. Users feel uneasy if they don’t receive feedback confirming that their actions were completed successfully. When displayed objects are manipulated, users can view the results of their interactions. Visual feedback also gives a signal that users can speak.
- They enable discovery. One of the biggest problems users face with a new system is determining what they can do with it and how to use it. The process of discovery enables users to ask high-level questions such as “What can I do here?”, which displays commands on the device screen. Discovery also saves users from having to memorize sets of commands.
There are plenty of user interface issues to resolve. We need more case studies, better usability metrics, and extensive experimental data about how voice-first dialogues should be designed and implemented. Indeed, the design of such dialogues is still in its infancy. What information should be presented verbally and what information should be presented visually? Should users select visual options by touching the screen or speaking? If you want to separate visible information into categories such as “active” or “status,” where and how should each category be displayed—in the corner of the screen, on a separate screen, or made visible upon demand? May users view information on multiple screens? May multiple users share screens—and how do you manage screen sharing?
To be sure, voice-only interfaces won’t disappear entirely; they will continue to be used for eye-busy applications. But as screens, microphones, and speakers become more available, whether co-located on the same device or connected via the Internet, the use of voice-only dialogues will decrease and voice-first dialogues will become more popular and widely used, taking full advantage of all those proliferating screens.
James A Larson is program chair for SpeechTEK 2019. He can be reached at firstname.lastname@example.org.
At the 2019 SpeechTEK conference Yves Normandin of Nu Echo, Inc. and Deborah Dahl of Conversational Technologies, will present "A Comprehensive Guide to Technologies for Conversational Systems." Conference chair Jim Larson talked to Normandin and Dahl to get a sneak peek of the session, and learn about conversational system technologies.
At the 2019 SpeechTEK Conference (April 29-May 1), Bruce Balentine, design consultant specializing in speech, audio, and multimodal user interfaces, will be presenting "Discoverability in Spoken user Interfaces." Conference Chair Jim Larson interviewed Balentine to get a sneak peek at the session and talk about discoverability.
The 2019 Conversational Interaction Conference emphasized innovative applications using the latest speech technologies, alternative architectures to ease the creation of conversational apps, and best practices for designing speech dialogs.
App developers' jobs might be getting less complicated, again