Waiting by the Phone

It’s a familiar scenario: A young couple drives down a deserted Texas road in the middle of the night. The moon is full, a wolf howls off in the distance, a dark, dilapidated house ruptures the horizon ahead, and then the inevitable happens—the young couple’s car gets a flat tire. Pulling over to the side of the road, they exit the car and scratch their respective heads, unsure of how to change the flat. At this point, our young couple has two options:
1) Go to the aforementioned dilapidated house for help and risk being massacred by a chainsaw-wielding madman, or
2) Use a 3G mobile phone to call into their motor club’s interactive voice and video response (IVVR) system, proceed through a number of voice prompts, watch a video that shows—right on their handset’s display screen—how to change a flat, and continue on their way, safe, sound, and in one piece.

Clearly, the second option is preferable. However, since our young lovers find themselves in the United States—and not, say, Europe or Asia—they will be forced to seek assistance elsewhere. Option two just isn’t a reality in the U.S.

"We don’t have any IVVR deployments in the U.S.," says Daniel Hong, senior analyst at research firm Datamonitor. "Quite simply for us, we don’t have IVVR. There’s definitely nothing in the U.S. that I know of that has gone live."

According to Hong—who cites the potential benefits of IVVR as improved customer experience, reduced costs, and revenue generation—such voice and video systems require a robust infrastructure, a standardized 3G wireless network, and the proliferation of 3G mobile devices and applications, all of which are, for now, sorely lacking in the U.S.

"Right now in the U.S. market, it’s still too early," he says.

Hong’s sentiments are echoed by Bill Scholz, founder of the consulting firm NewSpeech Solutions and president of the Applied Voice Input/Output Society (AVIOS). "There has been a significant problem launching any of this technology in the U.S. because of the fact that we don’t have a single, standardized 3G wireless network," he says.

"The 3G capabilities in the rest of the world are well-standardized and even compatible. There’s no way, no convenient way, that a vendor could sit down and prepare and deploy a set of applications that would be universally accessible over these networks in the United States. And that is a huge deterrent to trying to move forward."

Advanced Overseas
Both Hong and Scholz agree that in Europe and Asia, IVVR technology is more advanced because carriers there have adopted a standardized 3G network—which would obviously benefit our young couple. More important for the speech technology industry, it has already benefited the work of Umberto Basso, founder and CEO of Italy’s H-Care.

H-Care develops multimodal, multichannel, self-service, and customer care platforms. Most recently, the company deployed its Human Digital Assistant (HDA), an advanced IVVR solution, for Fiat Group Automobiles (FGA).

"What we are doing at H-Care is working on a platform which enables multichannel, self-service capability through the Web, mobile video calls, and multimedia messages," Basso says. "Basically what we enable is to have a real-time, very high-quality face rendering based on a 3D model which would represent the brand’s customer care rep, and this…[creates] on-the-fly communication [that is] very personalized for every customer."

Using H-Care’s IVVR product, Fiat customers can access an online car configuration program to design a personalized car and book a test drive. The system will place an outbound call to remind them of the scheduled date, and can even call again after the test drive to deliver a customer satisfaction survey.

The system also has an inbound telephone component. Users can also call into the IVVR, navigate through menus with voice commands, and watch streaming videos.

"The component used in [HDA] is called a ‘Face Engine,’ which allows for real-time, high-quality video creation, and it delivers a streaming version of the video useable for Web applications or a streaming version of the video useable for IVVR," Basso says. "We bring to life a virtual character. Everything is done on the server. This Face Engine is capable of managing thousands of concurrent, different video renderings to serve a large customer base."

Basso says that behind the Face Engine is the "Brain Server," which is used both on the Web and in the IVVR or standard IVR channels to build the logic behind the face so that all of the dialogues, paths, and prompts are generated dynamically.

For Fiat, the HDA is a pixie-haired avatar named "Chiara," who has a fine bone structure and pleasant demeanor; she politely phones users with reminders and assists them through menu options via their mobile devices.

"H-Care is a leading company which supplies new multichannel solutions," says Mauro Veglia, senior vice president of customer services at Fiat. "FGA selected H-Care for piloting the HDA experience because of its technical excellence in graphical rendering solutions. H-Care’s HDA platform is supporting a 3D server-side rendering suitable for streaming over Web, kiosk, and mobile channels, and backed with a single point authoring tool for process design and content management."

According to Veglia, the Fiat HDA interacts with almost 160,000 unique visitors each month.

"[Fiat is] embracing the technology on both the Web and on the video call. Most of their customers are mobile customers—they’re in the car—so they have to reach customers while they are on the go," Basso says. "So you want to know how to repair a broken tire and, yes, of course, you can have a standard prerecorded video delivered to your phone."

Could Take Years
And while H-Care advances its solution in Europe (and recently added text-to-speech and speech recognition technologies from fellow Italian firm Loquendo to its platform), here in the U.S., we may have to wait years for true IVVR solutions to become available, according to Hong.

"I think it will take another three to five years," he says. "Our [mobile] devices just can’t do it right now."

The only vendor coming close to taking advantage of some IVVR technology in the U.S. is CosmoCom. According to Steve Kowarsky, executive vice president of CosmoCom, the company has one live deployment in the U.S. Well, sort of.

"There’s one live application, and it’s thriving," he says. "It’s a rather special application. That application is in conjunction with a service called video relay service for the hearing impaired."

Kowarsky says that the Americans with Disabilities Act entitles the hearing impaired to telephone service. In the past, small keyboards connected via acoustic coupler modems to phones allowed users to place a telephone call through a relay service, type a message, and have it spoken out the other end. The recipient of the call would then speak a message that would be typed and relayed to the caller.

"The new generation of relay is video relay," Kowarsky says. "The hearing impaired person signs to the person in the middle, and the person in the middle speaks what is signed and signs back what is said. It is much faster and much more comfortable."

So is this an IVVR? Kowarsky admits the video relay service provided for Communication Services for the Deaf is less of an IVVR and more of an IVR system, in which "the single V stands for video and not voice."

Please Invest
So with no pure IVVR deployments in the U.S., the obvious question becomes whether enough is being done to further deployment. Companies like Convergys, Avaya, CosmoCom, Nuance Communications, VoiceObjects, Genesys Telecommunications Laboratories, Envox Worldwide, Intervoice, and many others are all working with voice and video to further deployment. But as one might expect, the progress reviews are mixed.

"The IVR ports being shipped have a lot of video functionality, but there’s a national investment that has to be made on top of the IVR platform, like purchasing a video server," Hong says. "And, not to mention, you have to create the applications."

Most current deployments are pretty simple video applications, Hong adds. "A lot of the people deploying video don’t know what the hell to do," he states. "They just want to try it out. They think that they can potentially generate revenue later on or improve their user interface. Having that video component can improve the customer service, but, as of right now, just in terms of the design of what should be there, we’re really at the beginning."

Scholz also sees limited progress to date, but remains somewhat more optimistic. "[Apple CEO and cofounder] Steve Jobs is doing everything he can to convince the world that if everyone would just quietly standardize on the iPhone AT&T network, all the problems of the universe would be solved," he says. "I’m not sure I fully embrace Steve Jobs’ Holy Grail. But, nonetheless, the iPhone and the attempt to push the AT&T 3G technology is certainly one way to start to move this whole area forward."

Scholz sees this as a way to get mobile devices that are compatible and capable of executing the same applications into the hands of millions of people. But he also sees a problem.

"The technology that can best support a video application is a technology that lets the video output be displayed—originating from a server and being displayed on your handheld device—and lets you talk back to it," he says. "That means we have a true session in place."

Given the existing standards and protocols that support a true session, the leader of the pack is Session Initiation Protocol (SIP), according to Scholz.

"At this point then, there are a number of Voice over IP clients that utilize SIP and can maintain a session as I just described it," he says. "The problem is, though, that not all wireless vendors will permit SIP to be used on their particular brand of a 3G network. Very specifically, AT&T appears to feel that if they let people use SIP and Voice over IP on the AT&T 3G network, then people will start making free Voice over IP calls rather than buying the digital minutes from the vendor."

Other vendors don’t have those same barriers, Scholz says.

"For example, with Sprint Nextel there are devices that you can run video-enabled SIP calls over the EVDO network—the 3G network," he says. "But we get back to the same complaint: That’s not universal; that’s just the small subset of the American wireless public that happens to use Sprint Nextel."

Vendor Neutrality
Scholz thinks the closest we’ll get to any kind of universality in the U.S. will be when "we will start seeing applications that become accessible across multiple vendors’ handsets almost invisibly," he says. "So that will certainly provide some of the benefits of true universality of standards as they have in Europe and Asia, but through a somewhat different technique."

Additionally, Scholz looks to the forthcoming version 3 of VoiceXML as a step in the right direction.

"That will help guide the platform vendors into creating compliant VoiceXML interpreters, and that will further facilitate the production of video," he says. "They’re going to add a new tag. Right now video is produced through VoiceXML by abusing the audio tag. They will have their own unique media tag that will handle it with a good deal more versatility in V3. So that’s coming very shortly."

Deborah Dahl, principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s (W3C) Multimodal Interaction Working Group, also sees progress.

"The standards are getting more mature," she says. "The basics are in place. There are a few applications here and there, but I think there needs to be more imagination about what the potential is for this kind of technology."

Dahl also points to the Multimodal Interaction (MMI) architecture recently drafted by the W3C as a positive move. "The MMI architecture is going to be able to easily support combining voice and video applications," she says.

Current IVVR applications lack the ability to interact with the screen itself, Dahl maintains. "But with a true multimodal application you can also touch the screen, move things around, and interactively zoom and shrink things by voice and by touching," she says. "So I’d say probably that the true multimodal application might be more in the next wave. You can do so much without it, but it’s really unlimited once you have that multimodal interaction."

Still, despite their technological limitations, excitement remains high in the U.S. for IVVR technologies. Among those most excited about IVVR and its potential uses is James Larson, an independent speech applications consultant, VoiceXML trainer, and cochair of the W3C’s Voice Browser Working Group.

"I can see [IVVR] helping people to replace a faucet in the sink, to replace a battery in their car, to check the air pressure on their car, and to do all kinds of home repair things," Larson says.

Better with Video
"I think almost all IVR applications can be improved by the use of video, especially those that want to sell or upsell," Larson says. "Then you can display images of what the device or the gadget is that you’re looking at and show how it works. You can apply all of the techniques that advertisers have used in commercials to IVR systems."

Larson also sees opportunities for IVVR to be used as an educational and marketing tool. "This kind of information on demand with video is going to be an important part of how people in developing countries can learn how to do things much faster and better than a traditional educational system. I think it’s going to help a major part of the world. It’s going to change not only education, not only entertainment, but I can see it being a truly economic force in the world to change how we live. I think this is going to be on a par with the printing press or with the telephone. I think it’s going to be a big deal."

But he is still keenly aware of the technology’s current limitations as well. "None of this is prevalent because it’s not really available [here]," Larson says.

And Hong doesn’t see that changing in the immediate future. He stands by his original assessment—which is, of course, bad news for our young couple from Texas and their flat tire. "I’d say in about three years we’re going to start seeing some live deployments, but it will still be a very low number."

Editor's Note: To see the Human Digital Assistant created for Fiat by H-Care in action, click here.

Waiting by the Phone

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

SwitchBot Launches Voice-Enabled KATA Friends