May 3, 2010
Features

Will Multimodal Kill the Speech Star?

In September 1979, a British New Wave group called The Buggles released their debut single, “Video Killed the Radio Star”—a song about a music career cut short by the advent of music videos. But on another level, the song also was about change, evolution, and the need to keep pace with new and emerging technologies. And in that respect it offers some vital lessons for the worldwide speech technology community.

As everyone knows, speech technologies are evolving. We now live in a multimodal, multichannel world. And all of us—speech vendors, enterprises, and end users—need to keep pace with the latest innovations or be left behind like washed-up radio stars.

Before the growth and rise of multimodality, the speech industry had a “sort of monopoly” on the way people communicated with computers, according to speech application consultant Jim Larson, who is also co-chair of the World Wide Web Consortium’s Voice Browser Working Group. If a customer wanted to check her bank balance, then the only real option was to make a phone call and interact with an interactive voice response (IVR) system.

“Speech technologies were practically the only way people could interact with the computer,” he says. “Those days voice was it. And I suspect all the voice browsers felt pretty good about that.”

But along came new innovations: mobile phones, smartphones, and applications that allowed text and video to appear on the screens of handheld devices. In some cases, that text or video had the potential to replace speech technology. When this

happened, Larson says there were two reactions: People viewed the change either with fear or as an opportunity. Larson, who has been in the speech game for decades, was firmly in the latter camp, recognizing an opportunity to provide speech to a new breed of applications.

“I see new opportunities for speech in this multimodal, cross-modal environment to enable applications that could not really be done before or could not be done easily,” he says. “So I see an opportunity for speech vendors to supply more speech technology because the marketplace is expanding into new areas.

“Some people view multimodal as competition to speech. I view it as not a competition, but a collaborative, a good assist to speech…in many cases it complements speech.”

Bill Scholz, president of the Applied Voice Input/Output Society (AVIOS), agrees with Larson’s assessment. As the speech technology marketplace moved forward, Scholz says, it became increasingly appropriate to pair speech with other modes of input.

“Early on [multimodality] really imposed something of a threat because suddenly people had alternatives to the simple use of speech,” he says. “That then turned from being a threat to being a challenge for the speech recognition folks.”

Larson notes that because applications and technology are used in a variety of environments and under many different circumstances, multiple input modes are not just convenient but necessary. In an office or automobile, speech recognition works well. However, this is not necessarily the case in a noisy warehouse or crowded airport.

“It’s nice to have some kind of backup input mode,” Larson says, citing keypads and touchscreens as examples. “[I] believe that every application should have at least one backup form of input if, for some reason, the users are not able to use the primary form.”

In addition to convenience and ease of use, Larson notes that multiple modes of input are important for providing access for disabled users. “I think it’s necessary,” he says. “It’s a very natural thing to use multiple modes of input to the computer.”

Again Scholz concurs, asserting that multimodality provides an opportunity for end users to exploit different means of communication at different times: “It allows the end user to use a particular communication method that is most appropriate both for the situation and to the particular communication that the individual is trying to convey.”

Avoiding Extinction

So if multiple channels and multimodality are really are a boon for the speech industry, then what should vendors and enterprises do to keep pace as technology rushes forward?

“I think the successful people will look at these things with anticipation and see opportunities where if they move fast enough they can take advantage of them,” Larson says. “The companies that look at this and are paralyzed by fear will eventually fall by the wayside.”

Larson stresses the importance for companies to conduct extensive user requirements analysis and research to discern what users need and how they behave in various circumstances. However, he is quick to add that users don’t always see the big picture, so companies might need to supplement their opinions and preferences with insight from designers.

When implementing multiple modes of input, Larson cites two general approaches. The first, which corresponds with the approach of the W3C’s Multimodal Interaction Working Group, he calls the distributed approach.

Larson describes this approach as a fairly sophisticated system of cooperating modules or entities—speech recognition and global positioning systems, for example—that communicate with one another. “There is a whole framework devoted to what I think is a fairly sophisticated approach to multimodal that could work in both centralized and distributed environments,” he says.

The second approach, which Larson says is much simpler, is an ad-hoc approach in which vendors decide which Web-based services are needed for an application and then create mashups by basically integrating those services.

“It’s a way of bringing together diverse services and new, exciting, and unusual ways to create new applications,” he says, noting no standards govern these mashups and they aren’t easy to migrate from machine to machine. “In general, I think these two approaches will eventually merge together.”

Larson predicts a convergence of the discipline of the former and the creativity of the latter. “Eventually we’ll take the best of both,” he says.

Larson also encourages companies to work toward setting standards so that applications can work on multiple devices; simple standards will benefit both users and vendors by allowing applications written for one machine to work on another. “Right now it’s the Wild West,” he says. “Everybody does everything differently, and it’s kind of an ad-hoc, chaotic situation.”

Scholz suggests that vendors focus on tight integration of input from other modalities, such as when a touchscreen is used to reduce ambiguity or increase the accuracy of a speech recognition response.

“That’s integration of two different modalities—touch input and speech input—but they are not operating independently,” he says. “They are working in conjunction with one another. That again creates a need for the recognizer vendors to focus explicitly on the multimodal, multichannel input and integrate the two in the input interpretation process.”

Additionally, Scholz stresses that because statistical language models—which are expensive and complicated to create—are being used to constrain utterances rather than handwritten grammars, speech recognition vendors must drive down the cost of their creations.

Multichannel Integration

Ryan Joe and Aphrodite Brinsmead, associate analysts at Ovum, also see the speech industry evolving with the rise of multimodality and multichannel communications. However, Joe notes that the voice channel is still the primary way customers contact enterprises, which is why vendors need a strategy that enables them to properly integrate multichannel solutions into existing portfolios.

“Just because it’s there doesn’t necessarily mean they’re going to be able to market it or deliver it in a way that adds value for their enterprise customers,” Joe says, stressing the importance of data transference from one channel to another, such as from email to voice interaction.

“There needs to be that unsiloed flow,” he adds, noting that some companies send outbound SMS messages to customers, but lack the ability to read and reply to customer responses—many of which, Joe says, end up in a “black hole.”

Enterprises need to work closely with vendors to monitor customer behavior, determining which channels are being used and whether the investment in a multichannel solution is relevant, Brinsmead adds. She points to companies like NICE Systems, with its multichannel analytics, and ClickFox as companies leading the multimodal, multichannel pack.

Another company working to stay ahead of the curve is Nuance Communications. According to Amy Livingstone, the company’s senior director of enterprise marketing, the majority of customer touches to Nuance systems come from handheld devices.

“It’s no longer whether multimodal, but in what form,” she says. “We really find the relevance of speech has increased along with this trend.”

Livingstone asserts that smartphone adoption has dramatically changed the traditional view of contact center and customer care technology. “Enterprises are building applications for customer care for smartphones, and we work with a lot of our customers who have done that or are doing that,” she says, noting that Nuance is heavily involved in customer care applications for both feature phones and smartphones.

But, Livingstone stresses, speech is still important, and customers must have the ability to quickly and easily connect with a live agent: “When I hit a snag, when I need more help, I need to be able to get through to an agent, and then that agent will need to pick up [the entire] context from the interactions up to that point.”

Nuance sees the rise of multimodality and multichannel communications as a boon for speech, Livingstone adds, predicting speech will remain a central mode of input—just not the only mode.

“There is a vision we have for a greater interaction between automation and the agent that’s made possible by this kind of multimodal interaction,” Livingstone says. “The demand for more usable and compelling applications is simply growing, and speech is a critical part of that equation.”

As the speech industry prepares for its multimodal future, it would be useful for vendors and enterprises—particularly those in the U.S.—to study the European market and learn from its example, successes, and failures.

According to Scholz, a fluid use of multimodality requires high bandwidth. To quickly and efficiently push comprehensive video information down to users’ handheld devices, and to collect speech and multichannel input from devices and push it up to the server that’s processing it, you need bandwidth—and lots of it.

“Europe has traditionally been ahead of us in the implementation of networks with good bandwidth,” Scholz says. “We [in the U.S.] are struggling with what we call 3G right now, while much of the rest of the world is already happily doing 4G.”

This, Scholz says, makes all of the difference and allows mobile devices to be used to complete important tasks. He predicts that as 4G becomes more available in the U.S., the use of multimodal applications will increase and people will begin to take them for granted. He also predicts the rise of a new type of user interface, and with it a new acronym: multimodal user interfaces or MUIs.

“MUI is a target term right now,” he says. “It hasn’t really become fully accepted by the community, but it is being used more and more.… This is indicative of what is to come: that there will be a multimodal user interface sub-area within the profession that will focus on all [these] issues.

“We’re going to see a hockey-stick curve increasing the capability of smartphones and what they are able to do as MUI design increases and bandwidth to the palm of your hand increases.”

Larson also sees differences between the U.S. and the more advanced European markets—differences that affect vendors, enterprises, and end users—but expects the situation will resolve itself as the U.S. builds its communications infrastructure.

“I think we should look toward Europe as a way to do things, learn from their experiences, figure out what worked right, figure out what worked wrong, and not make their mistakes, and leverage their successes,” he says. “We’re in a good position now.”

Looking toward the future, Larson says more functions will be placed on mobile devices. “[The cell phone] will become a sort of Swiss Army knife for everything under the sun,” he says, noting that to some degree this has already happened with people using mobile phones as alarm clocks and cameras.

As a result of this trend, Larson predicts that smartphones—with their limited screen size—will become more complex to use, resulting in a new kind of interface that includes, but is not limited to, speech.

Additionally, Larson imagines a future in which the form factor of mobile devices will change dramatically—even resembling the stuff of science fiction—with component parts embedded into jewelry, belts, etc.

“Hearing aids have Bluetooth enabled so you can pipe output from your computer into your hearing aid,” he says. “Perhaps there will be glasses that will have screens embedded in the lenses so you can see things as well as hear them.”

As with any new endeavor, Larson says, mistakes will be made and products will fail. However, this is natural when people attempt to harness new technologies to solve new problems.

“We can anticipate that there will be some failures, but there will also be some thriving successes,” he says. “With energy and creativity, people will find really interesting ways of using things that we can’t imagine right now. There are applications out there that we can’t even think of now that will suddenly become apparent.”

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Will Multimodal Kill the Speech Star?

Conversational AI to Reach $41.39 Billion by 2030

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API