June 23, 2021
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

3 Trends That Will Shape IVA Development

As we’ve all witnessed in recent years, the world of voice and speech technology is constantly changing, and perhaps no sector exemplifies this better than intelligent virtual assistants (IVAs). Once more of a consumer novelty, IVAs are becoming critical parts of businesses and organizations of all types. The following three trends will shape the development of future voice assistants.

Trend No. 1: A Voice Assistant for Every Business

General voice assistants—like Siri, Alexa, Google Assistant, and Bixby—were designed to be jacks of all trades, covering a wide range of topics. However, assistants are expensive to create, so new general voice assistants are unlikely to appear. Instead, future voice assistants will focus on serving the specific needs of organizations and their customers.

Just as every business has a web page to describe its products and services, every business will have a voice assistant to engage current and potential customers with its offerings. Examples of more narrowly focused voice assistants include Beeb, the BBC’s voice assistant; JiLL, developed for real estate firm JLL; MBUX, Mercedes-Benz’s voice assistant; and eno, from Capital One.

There is a growing need for interactive, flexible conversational assistants in addition to the traditional web-based assistants. The benefits are enormous: Voice assistants can provide a complete set of guided instructions for self-help: selecting, ordering, delivery tracking, installing, troubleshooting, debugging, and repairing a company’s products, and connecting to a live-help agent when necessary.

And IVAs are becoming easier to create. Start-ups such as Jovo, RASA, Alan AI, and SpokeStack provide development tools, templates, and code generators for voice assistants. Educational books, webinars, presentations, workshops, videos, and classes are available for developer training. A community of expert speech developers is now available to implement voice assistants.

The upshot: Every business will soon have its own voice assistant, and these voice assistants won’t be limited to a single platform, as we’ll see in the next trend.

Trend No. 2: Platform-Independent Voice Assistants

There are several very compelling reasons to make voice assistants work on multiple platforms:

Customer convenience. Voice assistants need to be convenient for customers. Voice assistants are always with you, after all, if they can be accessed from anywhere (at home, at work, while traveling) by using microphones and speakers embedded into smartphones, wearables, cars, appliances, and smart speakers.

Concerns about vendor behavior. A particular platform vendor may change its usage policies and service fees. Some organizations may suspect that vendors eavesdrop on private conversations between organizations and their customers.

Customer usage changes. Customers may switch platforms as new ones become available. Your company may find new ways of messaging and connecting with customers that require alternative platforms.

Limited capabilities. Some platforms may lack the most current security techniques. Other platforms may not be able to discern customer sentiments, which could affect how your voice assistant responds to customers. Other vendors may prevent the use of sonic branding.

To improve usability, voice assistants will support a variety of modes (more on this point later), such as voice, text, graphics, video, and, possibly, tactile devices.

There’s a disadvantage to developing nearly similar voice assistants for multiple platforms: When updates are required, the developer must either modify each code base or modify a master copy and generate the code for each platform. To overcome this problem, the Open Voice Forum is formulating protocols for agents that are interoperable so that an agent on one platform can be invoked by agents from another platform.

Trend No. 3: Multimodal

U.S. data suggests that, on average, humans can speak three times faster than they can type (120 words per minute vs. 40) and that they can read (and understand) twice as fast as they can accurately listen. This suggests that user interfaces should support speech data in, screen-based data out. Text has other advantages as well; it can be used in situations where voice is difficult to hear (noisy airports), there is noise pollution (traffic, construction), or there are privacy concerns (restaurants, hotel lobbies).

Voice assistants (such as those implemented in VoiceXML) will be enhanced to overcome the shortcomings of speech-only. Voice menus will be displayed on a screen ideally located close to the user, and long verbal descriptions will be presented as text and/or graphics. Convenience will be a factor here. Earphones augmented with displays, possibly on the user’s watch or some other wearable, would seem to win out over stand-alone speakers, whose popularity will likely decrease as they gain displays.

Of course, the shift to multimodal means that text-only chat assistants will adapt as well, as they’ll be enhanced with voice—with emotion embedded in the voice presentations—to overcome the shortcomings of using only text.

Web pages frequently are “responsive” to the size of the screen on which they are displayed, automatically adjusting the layout to the screen’s dimensions. But when a screen is small or nonexistent, the content could be presented as voice content. A picture, for example, could be rendered as a spoken caption. With such multimodal capabilities in place, users could easily switch between devices as they go about their daily business without losing any of their interactions’ content.

The next step is emotion detection. If a voice assistant can detect the user’s emotional state, then it can respond appropriately—if the user’s emotion suddenly changes from calm to angry, for example, the assistant might respond by changing its phrasing to be more calming. Capturing the user’s likes and dislikes are also important; help chats and advertisements could be targeted to users’ interests.

The use of emotion and the tradeoffs among speech, graphics, text, video, and other modes is still not fully understood. More research, experience, and testing are needed to develop guidelines for the current balance of speech and other modes, and for how to use emotion appropriately in user interfaces.

To put it mildly, voice systems are evolving. If you have not done so already, start preparing for your business to support voice assistants. And plan on your new voice system being platform-independent and multimodal.

James A. Larson is senior advisor to the Open Voice Network and co-program chair of the SpeechTEK Conference. He can be reached at jlarson@infotoday.com.

3 Trends That Will Shape IVA Development

Trend No. 1: A Voice Assistant for Every Business

Trend No. 2: Platform-Independent Voice Assistants

Trend No. 3: Multimodal

Deepgram Launches Streaming Speech, Text, and Voice Agents on Amazon SageMaker AI, Integrates with Amazon Connect

Wispr Raises $25 Million to Build Its Voice Operating System

Curantis Partners with nVoq

Read AI Introduces Operator Mobile and Desktop Apps