November 1, 2007
By Leonard Klie Editor, Speech Technology and CRM magazines
Features

The User Becomes a Standard Bearer

As one of its core principles, the GetHuman movement suggests that those who design and build speech applications need to regularly survey users of their systems and then "use this data to trend improvements over time...and to benchmark against the industry." Certainly, when it comes to designing speech applications, benchmarking and a commonly accepted set of practices are sorely missing. "I’d be shocked to find two firms with the same process," says Juan Gilbert, an associate professor of computer science and software engineering at Auburn University in Alabama.

It’s not like the industry hasn’t come up with benchmarks in the past. In fact, voice user interface (VUI) designers and design firms have gathered a number of best practices and created templates, but they have been understandably unwilling to share them with the rest of the industry. "A VUI is proprietary intellectual property. There certainly are templates for designing certain types of dialogues, developed by companies with hundreds of applications that have already gone through rigorous testing and deployment efforts, but they’re all siloed," says Christoph Mosing, vice president of professional services at Envox Worldwide. "Nuance keeps its template to itself and will not share it with Envox, and vice versa.

"The design aspects of a speech project are very expensive, and companies leverage their experience on previous designs to win new clients. When you’ve done 10 to 15 large speech applications, you gather a certain amount of knowledge that gives you a competitive advantage, and it’s not something you’re going to share."

Other designers agree and don’t see things changing anytime soon. Having a strong process based on years of experience and lessons learned "is one of the ways as a designer that I get clients. It’s a selling point for me," notes Susan Hura, founder of SpeechUsability and a member of the board of the Applied Voice Input/Output Society (AVIOS).

So how then, if you can’t expect your colleagues in the industry to provide you with the means to know if an application is good, do you come to that conclusion? While there are no real universally agreed-upon standards, templates, or formulas that lay out a set application design methodology, the one concept that is as close to a standard as the design field might see is user-centric design. Some would go so far as to call it the industry’s one de-facto standard. Others just call it intuitive.

User-centricity is "a design philosophy you should have for making anything, whether you’re making a speech app or an electric can opener," says VoiceXML expert Ken Rehor, founder of the VoiceXML Forum. Hura agrees. "In business, with any decision you make, you have to ask yourself how this is going to affect the end user," she says.

Erin Smith, senior VUI designer at Convergys, defines user-centric design as a process that first involves "taking a step back and putting yourself in the users’ shoes," and then "stepping actual customers through the system to determine where it’s working well and where they are struggling."

"I like to get the end users involved before I write a single prompt," Hura adds. "Even something as simple as writing a sample question and sitting down with customers can be extremely valuable in terms of setting the direction for the application."

This kind of end-user input should be gathered not only during the creation of the application, but again when it’s rolled out and every time after that when any kind of a change to the system is made, she and other design veterans suggest. "A user-centric design is not limited to the design phase, says Lizanne Kaiser, senior principal consultant for voice services at Genesys Telecommunications Laboratories. "It needs to take place at iterative stages throughout the whole process."

The process takes several steps, according to Kaiser. From the beginning, "you want to understand who the customer is and create a profile of the prototypical customer—what is she motivated by, what are her pet peeves, and what is she looking to get from the system?" she says. "You then want to create use-case scenarios written from the customer perspective, and once you get a preliminary design going, you want to role-play through the design based on those profiles."

Before fielding the actual application, Kaiser recommends that the designer make a prototype of the application, conduct a usability study with it among customers, and then do a limited rollout with real customers. "You’ll find that customers act differently with real money than they do when they’re in a controlled testing environment," she says.

"You can bring users in, let them use it, and see if it meets their goals," Hura adds. "What’s so beautiful about user-centric design is that it gives you a good way to measure an application. If people can use it, then it works and it’s a good design."

Learning from the Past
It was those kinds of activities that eventually led the industry to conclude that in converting a touchtone-based system to a speech-enabled system, it’s never a good idea to write speech applications on top of existing dual-tone, multifrequency (DTMF) dialogues. That, Smith says, was one of the biggest design methodology flaws of the past. "Speechifying a simple DTMF is not beneficial. It adds no value and doesn’t enhance the system," she explains. "The company liked the system, wanted to make it sexy with speech, and figured you could just add it. Users just got upset because it took longer to get through the system."

Today’s designers are aware that what customers can do with DTMF is not necessarily what they want in a speech application, Kaiser adds. "Speech reacts very differently than DTMF," and people react very differently to them as well," Envox’s Mosing maintains.

User-centric design was also the basis for discovering that designers couldn’t just rely on a straight menu system for all interactions. "Having a system pose a question, the user input an answer, and then pose another question, that’s a mistake," Smith adds. "Callers are looking for a more natural conversation flow with an application."

It also helped uncover the fact that an application with very basic prompts doesn’t work either. "There was no art to it. It was boring," Smith continues. "It was just repeating the same prompt when the system did not get what the user said, and that was no help to [the end user]."

Before long, user-centricity prompted designers to move away from creating applications word for word based on what clients told them to do. "We now know that it’s not for the business to come to us as VUI designers and say, ‘Here’s what I want you to follow,’" Hura says.

"There’s always a big trade-off between what the business wants, what’s do-able in a speech application, and what the customer wants," Mosing adds. "What the client wants and what should be done are not always the same. It might be in sync with what the marketing department wants, what sales wants, what call center agents want, but that’s not necessarily what the customer wants."

Companies typically look only at the bottom line, and do not really consider the user and how he will use the system, Smith notes.

As an example, Genesys’ Kaiser recalls one design in which the business’ sole concern was keeping users contained within the system. "To keep up with its finances, the business needed to keep customers in the system. They locked customers in and forced them to make a self-service choice," she says. "Because the business wants it that way, and we design it that way, doesn’t mean the customers will use it. We found that customers are very smart in figuring ways around it. They were zeroing out, hanging up, or intentionally putting in an error to get to a live agent.

"Now, with much better user data and awareness, we know we can’t just contain them. They work around it, and it just ends up costing [the business] more."

The problem for the designer really boils down to a question of loyalty. The company—and not the end user— is the one ultimately paying the bill. In that environment, it’s next to impossible to strictly adopt a user-centric methodology. "When you design, you have to be really clear about the business’ objectives from the beginning," Mosing says. "Then you have to balance the business requirements with the callers’ expectations."

But finding that happy medium between what the business wants and what the user expects is no easy task, and that, more than anything else, is what experts believe causes most designs to fail. It’s also another reason for the lack of a prescribed set of guidelines for a design specification.

"In the end, nothing can be packaged because there’s a lot of customization that gets done based on business and customer goals," Mosing states.

"There really can’t be one VUI design practice across the board," Kaiser agrees. "Each one has to be customer-specific. I could see things evolve around best practices, and having a shared industry-wide understanding would be a good thing, but you can’t have a uniform standard because each individual company’s individual customer interactions are so unique."

All the experts agree that measuring how a design meets business and user goals depends on the business, and therefore, it’s always going to be an individual thing. "There are so many things that are dependent on an application’s goals and target audience that you’re not able to have a formula," Rehor contends.

VoiceXML’s Role
Adding to the design complexity is the abundance of speech application vendors. Creating a design that works for one vendor’s platform might not be the best thing if the business decides to go with a different vendor. Add to that the fact that different vendors have different strengths when it comes to their applications. Take, for example, the many languages that companies need to speak in today’s global economy.

"Companies can leverage several different [speech recognition engines] for different languages," Envox’s Mosing explains. "One vendor may have a stronger product for one language and not for another. As our customers become smarter about speech components and the cost of speech applications, they want to be less dependent on one vendor for all their speech applications. The real value of speech is on the application side. As the number of players grows, people want to use more options."

This multivendor environment is forcing speech application designers to abandon vendor-specific, proprietary application-building kits, tools, platforms, and interfaces in favor of applications built to VoiceXML specifications. With the introduction of the Media Resource Control Protocol (MRCP) in April 2006, speech solution designers now have a choice in how they integrate different speech engines into their applications. They can either write directly to VoiceXML and then use MRCP to plug it into an application or platform, or they can write directly to a vendor’s specific application programming interface (API).

MRCP controls the media resources, like speech synthesizers, speech recognition engines, and voice biometrics servers, over a network. In essence, MRCP allows developers to manage these diverse applications, and the devices needed to run them, and creates a common language to speak to them. The application gets written in VoiceXML, and MRCP is the layer that plugs it in.

"MRCP is the middleware that enables the design to be independent of the ASR technology that will be selected eventually, so you do not have to write for one specific ASR brand," Mosing explains. "It’s especially a good idea for very large deployments because you can add as many ASRs as you need."

In the past, the purchase of a VoiceXML platform locked the buyer into a single speech engine, or a small handful of them. Platform vendors then required custom integrations with each supported speech engine, and those integrations often came at a heavy price.

No Longer Proprietary
Today, "it’s a bad idea to adopt anyone’s proprietary API, just because it’s no longer necessary," Rehor says. At least that’s been the case since the creation of VoiceXML. "The whole thing now is to design with portability," he says. "VoiceXML takes advantage of all the platforms out there. MRCP affords the customer the ability to switch out one vendor for another. You can also configure an application for several different engines."

"I can see where it would be a benefit," Convergys’s Smith maintains. "It gives the customer flexibility in that if they want to move to another platform, they can do it."

But, as with anything else in the speech world, there are some caveats when implementing a speech application using MRCP. According to speech application designers at LumenVox, drawbacks include a lack of backward compatibility, reduced control over core engine features limited to the MRCP definition, and subtle differences that might exist in each vendor’s implementation of the standard.
At LumenVox, officials have identified the following benefits to using MRCP:

Little transition time implementing multiple core speech technologies;
The engine client does not need to directly link to the vendor’s DLLs or shared libraries; and
Any communication between client and server is done strictly over the network, thereby removing any limits on operating systems or hardware on which the client can be implemented.

Conversely, the negatives associated with MRCP include:

A generally more time-consuming implementation due to compliance issues;
A lack of backward compatibility;
If done from scratch, enabling the client to handle the details of network communication with the servers can be a large project; and
Some vendors might add custom options to the MRCP specification, thereby limiting the advantages MRCP has as a vendor-agnostic choice.

Writing directly to the API also has its pros and cons.
The pros, according to LumenVox, include:

Backward compatibility;
Greater functionality and control over grammars;
Vendors can better accommodate specific requests for changes to how an engine works;
A shared workload between client and server eliminates the need to use network resources; and
The vendor can allow a group of servers to be treated as a single resource center, providing for greater load balancing and automatic fail-over.

On the negative side, writing to the API often means:

A complete rewrite of the interface every time it is integrated with a new speech engine;
Clients are limited to operating systems and hardware supported by the vendor; and
An increase in the memory footprint of the client-side application.

"Choosing an integration path for your speech application project boils down to what is right for your organization," LumenVox advises on its Web site. "If you’re looking for more flexibility to access a mix and match of speech technologies, then using MRCP may be the best choice for you. If you need greater functionality and an easier integration process, then writing to the API might be the best solution."

And while VoiceXML and MRCP are quickly becoming standards in the industry, other design standards are probably years away. "VoiceXML took years to get done, and the only reason it happened was because large corporate buyers pushed for it and demanded the industry to come up with it," Mosing says.

But, just because standards are hard to identify doesn’t mean designers wouldn’t welcome them. "There have been a number of instances where people came out with a few best practices," Rehor says. "Taking it to the next step and defining standards and best practices is a good goal for the industry."

So who’s going to do it? "The majority of people working on these things come from the engineering side. We need someone from academia with a background in psychology to do this kind of research," Hura offers. "This is the kind of work that gets done in universities. It’s not the kind of thing that businesses develop for themselves."

_{Is a Standard on the Horizon?}

The closest thing to a standard right now for voice application design is something the VoiceXML Forum first introduced at the SpeechTEK conference in New York this past August. Called SLAML (short for the Session Log Annotation Markup Language), it describes a methodology for collecting, storing, and retrieving runtime data for speech-based services and applications. It includes specifications for logging application servers, automatic speech recognition, and VoiceXML browsers. The SLAML specification is still in its early phases, as the VoiceXML Forum just now is starting to collect feedback from the industry.

"When you’re developing a speech system, it’s important to tune the application and all the engines," says Ken Rehor, one of the original authors of VoiceXML and founder of the VoiceXML Forum. "Often, there are different software components from different vendors running different systems,and you need something to pull them together to test the applications."

"Data generated by a speech-based application during runtime can provide valuable visibility into the performance of the application and the user interaction," explains David Thomson, chairman of the VoiceXML Forum’s Tools Committee. "The SLAML specification will enable service providers to mix and match development tools, application servers, and VoiceXML browsers while maintaining a consistent data-logging format." Industrywide adoption of the standard, he says, "will make field data easier to analyze and use, improving speech system performance and usability."

The VoiceXML Forum is also putting together a special committee to draft a set of metrics for application performance, both from an execution standpoint and for measuring the quality of the user experience. Without such a yardstick, "when you make a change in the application, how do you measure the effect that change will have on the whole application?" Rehor asks.

The lack of an industrywide standard to test applications has slowed adoption of speech technologies across the board, according to Bruce Sherman, product manager at Syntellect. "As the entire industry develops technology, not a single application has been entirely tested from end to end," he says. "If you look at the applications and all the states and pathways they can go through, less than 25 percent of those have been tested all the way through."

That has created problems for the entire industry, Sherman continues. "Customers have been disappointed because the deployment cycle took longer than they thought, the application cost more than they expected, and the application did not perform right," he says. "The organizations’ momentum [toward speech] got lost as a result."

The User Becomes a Standard Bearer

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.