Move over GUI, Hello SVUI
As the name suggests, a multimodal user interface supports multiple modes of interaction, modes that include text, images, and audio. People generally operate in a single mode per device interaction: For example, one could type and read text in one interaction, click on and view images in another interaction, and speak and listen to audio in yet another. But I’d like to focus here on a special type of multimodal user interface, which I’ll call screen and voice user interface (SVUI), that uses one mode for input and a different mode for output during a single interaction.
Specifically, users speak audio into a microphone and view text and images on a screen. SVUI is becoming popular on mobile devices, which have both microphone and screen. SVUI is also a popular addition to smart speakers with an attached screen. The screens serve several essential functions: They present pictures and videos and can display a dashboard containing current status and available controls/commands that users may speak to or manipulate directly. Screens also serve as extensions to user memory, displaying the results of recent actions, alerts, and help information.
As these kinds of interfaces become more prevalent, they will lead to the following advances in functionality and hardware:
1. Selecting options will expand.
There are several techniques for selecting options displayed on screens. The current widely used technique is to use a mouse to position a cursor over the desired option in a menu and click. SVUI techniques involving a screen and voice control include these: (1) users can simply speak the desired option; (2) if an option is represented by an icon or picture, users could request that numbers be displayed by each option and then speak the number of the desired option; (3) if filling in a form, users can speak natural language phrases, such as “Send the package to Fred on 14 Elm Street in Dallas, Texas.” The natural language processing facility computes that the recipient is Fred, the address is 14 Elm Street, the city is Dallas, and the state is Texas. SVUI saves users from recalling data from memory; instead, they can recognize it on a screen and speak a value or option.
SVUIs can also take advantage of user history and AI techniques; the SVUI can provide predictive suggestions for completing user actions. Here are some examples:
- Word completion.When spelling a word (such as the name of an individual), the SVUI could display a menu of words beginning with the letters that the user spoke, so the user can select and/or speak the final letters of the word.
- Sentence completion.The SVUI could display a collection of words that complete a phrase or sentence, based on similar phrases the user previously entered.
- Automatic form filling.The SVUI could insert values into a displayed form that the user entered into previously completed forms. The user can edit any form value that was incorrectly supplied by the SVUI.
- Help in context.A dashboard or schematic of a process is displayed on a screen. If the user asks for help, the dashboard/schematic is overlayed with the names of commands, enabling the user to control the process by speaking.
2. New hardware—both microphones and screens—will be everywhere.
New micro-electronic mechanical systems (MEMSs), basically a microphone on a chip, will make microphones cheap and plentiful. They can be deployed in wearables (watches, lapel microphones, jewelry), home appliances (refrigerators, stoves, coffee makers, TVs), and control boxes (for garage openers, lighting controls, and environmental controls). Using an SVUI, users can speak commands to control MEMS devices.
And in addition to large, high-resolution screens for watching movies and TV, as well as specialized display devices for games and virtual reality, small, inexpensive screens will also pop up everywhere, in wearables, home appliances, and control boxes. Using the combination of SVUIs, embedded small screens, and microphones, users will be able to directly interact with any device.
3. Centralized control will move closer to reality.
The ubiquitous mobile phone will evolve to become a hub for managing devices and appliances. Using a SVUI, users will be able to control any device in the house. Mobile devices will be able to send messages to and receive messages from devices throughout the home to automatically turn them on and off. The mobile phone hub will also be able to send messages to and receive messages from servers, both in the house and in the cloud. Users will be to switch between the SVUIs for various devices, giving the user centralized control of everything in the home.
The future promised by SVUIs comes with several challenges familiar to existing system designers, including these:
- ensuring user privacy, especially with concerns about third-party eves- dropping and surveillance;
- protecting against hackers;
- managing complex systems (and hiding that complexity from the user);
- recognizing speech in noisy environments; and
- collecting and applying training data for voice recognizers and dialogue managers.
In addition, new challenges are emerging, including how to format and arrange content onscreen; choose the command names and options to be spoken by users; and develop guidelines, methodologies, and tools to construct SVUIs. And problems we haven’t anticipated will undoubtedly arise.
Call to Action
SVUIs are in the future, but we can be proactive and take steps now to experiment with predictive suggestions. What do users find useful, inhibiting, and/or distracting? When it comes to designing user-device interactions using spoken voice for input and screens for output, what should the user speak and what should be presented to the user onscreen? Now is the time to anticipate and plan how SVUIs will be used correctly and efficiently in the future.
What’s more, how do we identify and validate the new guidelines, methodologies, and tools for designing dialogues over multiple interactions? How do users move among multiple individual user-device interactions? The answers to these questions will help inform the design of our user interfaces of the future.
James A Larson, Ph.D., is senior advisor to the Open Voice Network and is the co-program chair of the Speech Technology Conference. He can be reached at email@example.com.