July 1, 2011
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Multimodal User Interfaces Supplanting Voice-Only Apps

Combining multiple technologies spawns some exciting experiences. Silent movies once provided entertainment consisting of a stream of video that later was enhanced by audio, leading to motion pictures.

The old black phones enabled conversations between two people; but with IVR applications, one party now can be a computer. A user converses with a computer using a visual graphical interface. Smartphones can combine audio and visual components into rich conversations involving audio and video. As rich conversations evolve, they will be preferred over individual video or audio conversations.

The shift from a single mode to multiple modes for interaction with computers and mobile devices has begun in earnest. Smartphones enable users both to “dial by voice” and “key in” phone numbers. Users may navigate and initiate apps by speaking voice commands, clicking keypads, or touching screens. Those primitive forms of multimodal interactions will lead to more natural user-computer interactions resembling person-to-person interactions.

Trends. Several phenomena leading to more natural user-computer interactions include the increased use of:
• statistical language models to allow more flexibility in how users phrase their communication;
• speaker verification technologies to include voice biometrics, vision, and other technologies to replace user ID and passwords;
• video clips for visual demos and improved usability;
• multilingual applications to interact in users’ preferred languages; and
• personalization to generate conversations relevant to users’ current interests and context.

Trendsetters. Current platforms and languages, such as VoiceXML 2.0 and HTML 5.0, are being extended to support the above trends. A W3C effort will define a speech technology API for HTML so that Web pages can speak with and listen to users. Currently, the Voice Browser Working Group is developing VoiceXML 3.0 to let users “see” as well as “hear.” Several VoiceXML 2.0 vendors, including Converse and VoxPilot, have extended their platforms to support video output.

Proceed cautiously. Many companies have large investments in VoiceXML 2.0 voice-only applications. Meanwhile, some developers consider it prudent to make only minimal changes to current VoiceXML 2.0 applications to support new trends. As a stopgap measure, some suggestions have emerged on upgrading VoiceXML 2.0 apps without rewriting them:

1. Route textual prompts to both the verbal and visual channels so users can read and hear the prompts on their mobile screens.
2. Provide a graphical view of menu options that illustrates the relationship among options, such as an illustration of the components of an engine and how they fit together or a flowchart to illustrate the steps in a process. The visual structure makes it easier for users to locate a desired option.
3. When the speech recognition engine determines it has low recognition confidence in the word the engine is trying to recognize, display a list of candidate words (called the “n-best list”) on the screen and ask users to select the word they uttered.

Those changes are easy to implement because they don’t alter the basic structure of the voice application, yet users can employ both voice and video channels. As a result, developers would have time to redesign the application to take advantage of multiple modes of user input.

New class of applications. Simple visuals integrated with traditional VoiceXML-style applications enable a new class of multimodal applications that were awkward or impossible to support with a voice-only interface, such as hands-busy applications, including hands-on training, assembling, diagnosing, troubleshooting, and repairing. Visual representations of those tasks are useful when the user can see and hear instructions while his hands manipulate objects.

Benefits. VoiceXML augmented with videos and graphics provides a limited form of rich user interfaces. Users with mobile devices with the right displays will reap the benefits of visual presentations, while traditional phone users will benefit from the voice-only portion of the application. The apps can be tested and analyzed before constructing a full-fledged multimodal user interface in the future.
By developing and implementing multimodal applications for computers of all sizes and shapes, they become easier to use.

James A. Larson, Ph.D., is an independent speech consultant. He is program chair for SpeechTEK and its sister conference, SpeeckTEK Europe. He was the former chair of the W3C Voice Browser Working Group, and he teaches courses in speech user interfaces at Portland State University and the Oregon Institute of Technology. He can be reached at jim@larson-tech.com.

Multimodal User Interfaces Supplanting Voice-Only Apps

Modulate Tops Hugging Face's Transcription Benchmark

LALAL.AI Launches Lynx Voice Cleanup Mode

VoicePing Releases VoicePing 3.0

Voiskey Officially Launches

Deepgram Brings Nova-3 Speech Engine to Snapdragon Devices

DeepL Acquires Mixhalo

The Voice Can Sound Right, and the Video Can Still Be Wrong

Canary Speech Partners with NeuroLexIQ

Voice-Only Outreach 'Structurally Misses' Gen Z and Millennial Debt Holders, Says Vodex AI CEO

Voicelyt Launches Voice Score

DXC Partners with ElevenLabs

Nabla Launches Dictation for Mac

Fish Audio Raises $52 Million in Seed Funding

Deliverect Partners with SoundHound AI

OrcaRouter Launches OrcaDub