APPLICATION DEVELOPMENT: Speech Comes to the Call Center

Deployment of robust, voice use interface (VUI) applications is now possible using the current generation automatic speech recognition and natural language technology. And just in time! The traditional touch-tone (DTMF) 0-9 interface on integrated voice response (IVR) systems is being pushed to the limit of customer acceptance.

A recent study of more than 400 IVR systems from eight different industries was conducted by Enterprise Integration Group (EIG) of San Ramon, California. "The facts show that users generally have good reason to hate automated touch tone systems. The optimum number of menu options is three, but some systems offered as many as 27. We observed many opportunities for improvement, especially with regard to user friendliness and the ease of use," said Rex Stringham, president of EIG.

The next generation of inbound customer service applications is upon us. The caller can now use his or her own voice to "talk" to an automated service system. These applications shorten the call, provide service set-up - even handle an entire call - to produce a compelling return on investment. Hard dollar justification for the technology is evident in dramatic increases in the yield factor for customer service representatives. Further, these integrated technologies open a whole new world of service applications which could not be considered with the traditional touch-tone interface.

Using traditional IVR applications can be time-consuming and are often frustrating to callers. These applications are system led versus caller led. Human speech, on the other hand, is natural and information-rich. The new generation of IVR applications will combine continuous speech recognition (for telephony, this means ASR engines from Lernout & Hauspie, Nuance Communications, AT&T, PureSpeech, and Lucent ) and speech interpretation.

These spoken language understanding applications will allow callers to exchange considerable amounts of information in a manner similar communicating with another human being. Callers will not be restricted to simple words and numeric responses: they can actually ask questions, phrase the same input response in varying manners, provide multiple variables in one request, and solicit help at any point during a dialog. For example, consider an inbound application for obtaining a mortgage payment quotation.

System: "Welcome to our Mortgage Assistant hotline. Do you want a payment quote for purchasing a new home, a quote for refinancing your current home, or do you want to hear about our current interest rates?"

Caller 1: "I wanted to talk about FHA loans."

System: "This system only handles conventional mortgages, I am transferring you to our loan desk."

Caller 2: "Uh, yeah, I am buying a new house. Can you give me a quote?"

System: "Payment quote for a new home" "How much money do you want to borrow?"

Caller 3: "I am purchasing a new home. I want a fixed rate loan for $200,000."

System: "Thank you." (recognizing that three variables have been received from the caller, the application skips over prompts asking for the amount and whether they want an adjustable or fixed rate loan. This capability is called dynamic dialog re-configuration) "$200,000 fixed rate new home loan." Do you want the term of your loan to be 10, 20, or 30 years?"

Caller 3: "I want the longest term available."

System: "30 year term. "How many points are you willing to pay. 0, 1 1/2, or 2 points?"

Caller 3: "I don't know."

System: "A point is an up front interest charge to your mortgage. Each point is equal to one percent of the amount you want to borrow. Paying fewer points usually means you will pay a higher interest rate. Are you willing to 0, 1 1/2, or 2 points?"

Caller 3: "As few as possible then."

System: "Okay, Zero points." "I have a $200,000 loan for thirty years, paying zero points. The interest rate will be 8.25 % and the APR is 8.5. Your monthly payment for principal and interest would be approximately $1,345 dollars."

"What would like to do next?" "Would like to see if you can qualify for this monthly payment?

Caller 3: "No, please transfer me to a loan officer."

System: "Certainly, who is calling please?"

Caller 3: "This is Nat Lang."

System: "Thank you. I am transferring you to Loan Officer, John Smith." Good luck on your new home." (The system now transfers all the data collected on this call via a CTI screen pop to John Smith.)

Loan officer: "Hello Mr. Lang I see you're looking for a 30 year, fixed rate mortgage in the $200,000 range and you want to pay zero points. Do you want to proceed with an application or..."

Technology Issues

Let's take a look at the technologies involved. Automatic speech recognition (ASR) changes spoken sounds into words (ASCII text). It does not understand what these words mean or represent and would flunk a grade school comprehension test.

Speaker-dependent recognition matches spoken words or phrases with a pre-processed speech sample of the speaker to optimize accuracy. This type of recognition is used in many current voice dictation products. It is similar to the way that parrots or dogs respond to human speech. They recognize the utterance (voice pattern) to perform a particular activity, command, or produce a word.

Speaker-independent ASR utilizing continuous speech (human interaction speed) does not have to be "trained" to an individual voice. It recognizes speech components such as vowels and consonants and matches groups of these speech components with words in a large vocabulary (2000 words or higher).

When high-end ASR is combined with a natural language understanding system, applications can be created like the example above. Depending on the content and meaning of a spoken message, these systems can carry on a dialog similar to the way humans respond to speech. These applications will be soon a major component for inbound call centers.

The text is then sent to a natural language understanding engine to interpret the text. It can understand the differences of phrases like "I want my checking account balance?" and "I want to balance my checking account." In the mortgage example above, a caller could ask for an adjustable rate mortgage, by saying "I want a variable rate loan" and be led down the correct path.

Two Business Cases

There are two cases for discussion: one for the IVR provider and one for the call center.

For the IVR provider, the use of a natural language speech interpreter improves the accuracy of ASRs, simplifies the development effort of VUI applications, and protects their application investment.

NL improves accuracy by knowing the exact input at any turn in the dialog. For instance, in our mortgage example, imagine the caller had said he would pay "three points." Now, suppose the ASR output gave its ranked guesses (known as the N best list) of what was said as: "tree ponds, free points, and three points." The NL would pick the lower ordered "three points" and respond correctly because of the state of the application. The application was "looking" for numeric input.

By inserting NL in concert with ASR, only the grammars have to be developed on the ASR. The speech recognizer provides the words. The NL provides the interpretation. This approach offers the IVR vendor some comfort in selecting an ASR technology. Since there is no clear cut leader (who will win the speech recognizer wars) in the high end ASR market, developing speech applications around the ASR is just plain risky.

If a selected ASR company would falter either in technology or as business entity, all of the applications created would have to be recreated with new technology. With an NL and ASR marriage, the only work required to convert to a new recognizer would be to rewrite the grammars in the correct format. The application stays intact! In other words, the natural language approach offers speech recognizer independence to the IVR vendor and its clients. This approach makes the ASR a commodity component.

The next generation of ASR for telephony will be statistically based (speaker independent, continuous speech, using huge vocabularies at dictation speeds). With grammars extinct, determining action based on strings of words becomes a daunting task for ASR. Natural language will be not only required but its full power will be unleashed. Compound transactions and multiple sentence analysis will become the norm.

Call Center Value

What is driving the call center toward spoken language applications? Mathematics. The telephone industry serves as an example. A pundit once predicted that the growth of the phone network would require every working person to become a long distance operator. They would be needed to handle all the call switching on the patch boards. The same was said of the demand for programmers for computers. In each case, a technology solved the problem. Switches became automated and programming a computer is being automated with point and click as well as speech tools.

With the growth of call centers, CSR costs and applications can not be maintained by today's 0-9 telephone keypad. Now, spoken language understanding will usher in a new wave of applications.

Call Center Enterprises has evaluated the VUI in comparison to traditional IVR systems. Its study revealed that a call center with 100,000 calls per month could save about $1.2 million per year. This is based on:

  1. Increased CSR hourly yield gains.
  2. Lower WATS line costs (shorter calls with direct voice navigation versus DTMF menus).
  3. Handling rotary phone based calls.
  4. Reducing the pass through rates due to having a user led conversation versus system led keypad input.

The total expected time savings for the application studied was just over four minutes. This study indicates that an IVR system with a natural language and speech recognition application would pay back in 12-18 months with just one application.

Our group has documented new revenue generating applications that could return over $1,000,000 per year by just handling sales calls that were stuck in queues. In another case, by speech enabling an internal application, which is currently handled by live agents, the savings were estimated at $1.4 million to $1.2 million in seven to 12 months.

Market Constraints

So, if these types of applications and technologies are deployable and have economic justification, what is behind the slow marketplace acceptance and deployment of telephony based spoken language systems? There are only few systems publicly deployed around the world. In lessons learned by our group, we now see how the marketplace can be unlocked. Enabling the IVR developer channel is the key to unlocking the flow of thousands of speech applications.

There have been many constraining factors slowing down the marketplace:

  • large vocabulary speech recognition vendors promising more than could be delivered;
  • accuracy of digit strings in large vocabulary recognizers versus the accuracy of text utterances;
  • scaling of the software for large line counts;
  • difficult development and learning curve for developers.

From our experience with many speech recognizers, the capability envelope for practical deployments has just been reached in the past 9-12 months. We are now confident that several recognizers can be used to deploy speech based applications. Accuracy can achieve 90% and above with well-crafted applications.

Scaling issues are being addressed on two fronts. First, by putting the speech recognition software on telephony DSP (digital signal processor) boards and secondly by cost reductions of processing power. Intel silicon seems to get cheaper every month and there are systems that have as many as 10 processors in one system. How many host based recognizers could you run on that system? (we will be testing that soon)

The single largest road block is that current IVR application developers who are users of various IVR tool kits lack some of the skills to bring speech applications to the end user.

To deploy telephony based automatic speech recognition, the programming of the speech recognizer must be done by creating context-free grammars. These grammars are usually constructed in a format called BNF.GRAMMARS (Backus-Nauer Format). There are two factors that compound the creation of BNF.GRAMMARS.

One is that all ASR vendors use their own unique syntax for representing grammars (there is no standard across ASRs). The syntax itself is very cryptic when compared to IVR tool kit development pallets. Project management and debugging capabilities are also very weak. Because of these complexities, most deployments today are being done by speech recognition company engineers. In fact, most automatic speech recognition revenues are for services versus license fees for software. Where does one go to get bnf.grammar training?

Another skill required for speech deployments is speech dialog design. This skill involves a combination of human factors (for soliciting valid responses or "grammars" from the caller) and artistry in the prompts to make the session sound and feel as human as possible.

Speech Interpretation

On the runtime side of speech applications, another important development factor is often overlooked or not considered. How is the interpretation of the text that is produced by a speech recognizer performed? To navigate a speech application call flow, the text that comes out of a speech recognizer must be interpreted. This interpretation reduces all the possible utterances of the caller down to one "action token" that is returned to the IVR call flow manager to move to the next prompt or action. ASR vendors do not always provide a mechanism to perform text interpretation or an easy integration to the current IVR system. They have left the interpretation programming up to the developer.

Can you imagine all of the if, then, else statements a programmer would have to write to determine the interpretation of the following:

Give me an adjustable rate loan.
Give me an adjustable rate mortgage.
I want an adjustable rate mortgage.
I need an adjustable rate mortgage.
I want an ARM.
What is an adjustable rate mortgage?
I need some help with that.
now add the word please to all these sentences!
etc., etc.

The development cycle for this is just too long, too custom, tedious and expensive. Think of the on going maintenance costs.


IVR vendors incorporating both technologies are just beginning to appear. Leading the way are Parity Software (Saulcilito, CA), International Public Access Technologies (Cincinnati, Ohio), Voice Processing Plus (Troy, MI), Periphonics Corporation (Bohemia, NY) and MediaSoft Telcom (Mont Royal, Quebec, Canada). They have integrated the NL Speech Assistant from Unisys Corporation (Blue Bell, PA) with large vocabulary ASR. New spoken language applications are being developed for government, banking, insurance and transportation by these companies and their development VARs. The first implementations will be installed during the first half of this year.

The implementation methodology also allows for current DTMF applications to be extended to a VUI. This will save significant time in development because the design, and database interfaces remain the same. These companies will be putting a "speech wrapper" around existing applications and extending the current capabilites.

The environment provides a full suite of development tools that shields the IVR developer from learning how to write ASR grammar syntax. Instead, a very familiar spread sheet presentation is used. The developer then selects which ASR syntax to generate for from a pull down menu. As a by-product of creating the grammars, a complete speech interpreter is automatically created for use during the runtime execution of the application. This module operates through an open API (application program interface) on the IVR platform. The interpreter "understands" all the text that can be delivered in the speech application. This approach makes the speech portion of the application portable across platforms and relieves the developer from writing interpretation software or maintain it when changes or additions are made to the grammars.

What's Next

The IVR and call center solution providers are now entering speech application engagements. The developers are now seeing integrated spoken language tools delivered as a cohesive environment. (Parity Software started delivering an integrated CD March 1st) The end users can now feel confident that speech recognition can begin to deliver on its promises. Big things are happening! The wave of the future just got a lot easier to catch.

Richard Barchard is the Director for the Natural Language Group at Unisys Corporation. He received his MBA from the University of Iowa. He can be reached at richard.barchard.@unisys.com or 610-648-7065.

/newspics/figure 1.gif (15489 bytes)

Figure one depicts the technology flow. The caller's utterances are converted into a stream of text. First, the utterances are broken in to digitized sound. This sound is then matched against acoustical models of speech. These are known as phonemes. The phonemes are then matched against the words in the ASR vocabulary. If a match occurs, the message is converted to a stream of text.

/newspics/duck.gif (9034 bytes) Eliminating the Keypad

The DTMF interface has a number of drawbacks, including the fact that a significant number of users have rotary phones. In the U.S., 25% of the phones are still rotary pulse phones. In Europe, the percentage is much higher (75%+). This poses the question: Can European call centers leap frog over DTMF and go right to speech enabled applications? Stay tuned!

Users are starting to rebel against the 0-9 interface.
Call pass through rates are rising from the frustration

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues