Speech Goes to the Races

Speech recognition is at the heart of the convergence between computers and telecommunications. It allows people to interact naturally with computer systems and applications. Nowhere are these benefits more fully realized than in a call center.

And perhaps no call center serves customers more demanding of immediate information than the one in Perth managed by the West Australian Totaliser Agency Board, (WA TAB). Speech technology was introduced there to provide an automated telephone information service to the public in the Land Down Under, where gambling on horse and dog races is a way of life.

Call centers present speech technology with many challenges. Instead of the one-on-one communication process evident in PC interface applications, the call center is characterized by a ‘many-to-one’ communication process, with many different callers requesting access to service. Therefore, the speech technology solution must be speaker independent.

Furthermore, the callers tend to be unrestrained in the way they respond to the application — speaking without pausing between words, to put it mildly. This means continuous speech recognition and word spotting technologies are required. Additionally, the technology must also be highly robust against noise and distortion. And all this had to be done in such a way as to accommodate the “peaky” nature of call traffic in betting services, where large numbers of callers place bets just before a horse or dog race begins.

Meeting all of these requirements was not possible until quite recently.

Speech Goes to the Dogs

The WA TAB needed to increase efficiency in its call center operations to provide more timely information and betting services for clients, while expanding its client base to service a wider geographic region in a cost-effective way.

They sought a solution using speech recognition technology to automate its information services, traditionally a service provided by call center agents. By automating this service, more agents could be re-assigned to the critical task of taking bets, especially in the peak times prior to the start of the race.

The WA TAB had investigated the use of traditional touch-tone IVR technology for its information services function, but the complexity and non-numeric nature of the menus meant a touch-tone solution would not work.

A four stage process is necessary to implement speech recognition for deployment into call center services. They are:

Stage 1: Dialogue Analysis - to determine the vocabulary and structure of the application
Stage 2: Trial Application and Data Capture - from which the optimized recognition solution is created
Stage 3: Modeling - the process of creating the vocabulary
Stage 4: Bench-marking and Deployment - integrating the recognition solution into the production system

Dialogue Analysis

The system should mimic the conversation carried out by a human agent. Because the WA TAB application was for a service already in existence, accessed by a large population, caller acceptance would hinge on how closely the automated service resembled the system already in place.

This first stage involved an analysis of what people said when they called the service. The analysis identifies the dialogue and structure (or flow) of the application, identifying the vocabulary that the recognizer will have to support in order to satisfy the customer.

An example of a typical dialogue between caller and human agent is shown in Table 1 below.

Table 1. Caller-Agent Dialogue

Agent	Which service would you like?
Caller	Tomorrow’s scratchings
Agent	Which event would you like scratchings for?
Caller	Brisbane race five
Agent	Scratchings for Brisbane race five are: three and seven. Would you like another service?
Caller	Yes
Agent	Which event?
Caller	Sydney race seven
Agent	Scratchings for Sydney race seven are....

The analysis of the dialogue identified three separate vocabularies:

The service vocabulary:	“Yesterday’s”, “Today’s”, “Tomorrow’s”, “Scratchings” and “Results”,
The event vocabulary:	“Adelaide”, “Brisbane”, “Melbourne”, “Sydney”“Race”, “Dog”, “Trots” Numbers “one to twenty” “Extra”, “all”
The control vocabulary:	“yes”, “no”, “next” and “previous”

Trial Application and Database Capture

Once the dialogue and recognition vocabulary have been identified, the objectives in this next stage are to test the dialogue and monitor calls and user acceptance of an automated version of the service. This also provides the environment for capturing speech data from callers, from which an optimized speech recognition solution will be created.

A trial dialogue was implemented based on the analysis carried out during Stage 1. Figure 1 shows how implementation and deployment of the speech recognizer is achieved. A voice response system is connected to a dialogue manager which implements the automated voice response component of the service under construction.

Figure 1: Speech Recognition Implementation and Deployment

cs-fig1.gif (4983 bytes)

The problem during this initial stage of the trial is that the vocabulary for the recognition component has yet to be created. To overcome this, a call center agent is used to substitute the speech recognition function.

The agent station is set-up with a tool (called SyWoZ) which is used for collection of speech data from callers. It is from this data that the speech recognition solution is created. The set-up is such that the agent can listen to callers but cannot speak to them. A screen display presents the speech recognition vocabulary under construction and a set of dialogue responses that the agent can use as the situation warrants.

This set-up reinforces the callers’ belief that they are interacting with an automated service and it elicits responses which are natural for them under those conditions. Based on caller responses, the agent selects words displayed on the screen of the SyWoZ operator station. These are passed to the dialogue manager, which in turn selects the most appropriate voice response to the caller. In the event that a caller makes a mistake, or if the agent cannot understand what the caller says, the agent can select an appropriate help, or repeat, response from the dialogue menu.

An example of the SyWoZ speech capture tool used in the WA TAB application is shown in Figure 2. The available selection for the events vocabulary, as identified during the dialogue analysis, is shown. Fields also existed for displaying the data entered by the agent to enable the agent to check the transcription. The Dialogue field shows four possible responses the agent can issue to the caller, “Repeat”, “Help”, “Too Quiet” and “Too Loud”. Hot keys are extensively used to enable the agent to transcribe data quickly in response to a caller.

Figure 2. SyWoZ agent screen used during data capture phase

untitled.bmp(40042 bytes)

For the caller, the entire system is perceived as a purely automated speech recognition system. When speaking to automated services, callers will significantly modify their behavior and speaking characteristics. Stage 2 ensures that the speech data captured by the process reflects the way callers naturally interact with the target automated service. What is more, the resultant speech database contains the words specific to the application, as spoken in the target telephone environment and in the noise environment specific to the application. This forms the database from which an ideal vocabulary specific application can be created.

Vocabulary Creation

The speech data, including agent responses, collected in Stage 2 from the trial application, is next fed into the recognizer training facility. This facility is maintained off-line and consists of two parts, the segmentation process and the modelling process (Figure 3).

Figure 3. The Speech Recognition Training Process

cs-fig3.gif (3188 bytes)

During segmentation, the speech data is broken into its component words, or sub-word units. Segmentation is performed using a specially configured recognition system which identifies specific words spoken by the caller.

The segmentation process also isolates words from background noise either generated by the caller’s environment or introduced by the telecommunications network. The segmentation process creates a new database, where each entry in the database consists of many examples (typically 500 to 3,000) of each word of the target vocabularies identified in Stage 1, as spoken by the caller population in the trial.

This speech data is then statistically analyzed to create the vocabulary models specific to the customer’s application. Because the speech data is collected from actual callers to the trial application, it reflects the vocabulary, accent and language of the bettors using the system.

Also, because the models are created from speech samples collected from within a live trial application, they also reflect the noise, distortion and transmission characteristics of the telephone network and environment from which callers typically access the service. As part of the process, special “garbage” models are created which represent these noise and transmission characteristics. These garbage models increase the robustness of the recognizer deployed into the production system.

Deployment

In Stage 4, the speech recognition vocabulary models created in Stage 3 are loaded into the recognizer deployed in the trial application. The arrangement depicted in Figure 1 is then used to deploy the specially created speech recognition solution into a production environment, this time using a bench-marking tool.

Called SyWatch, the bench-marking tool measures the accuracy and performance of the recognizer under development against the performance of the human agent who has been substituting the speech recognition function. This provides an accurate and authentic measure of how well the recognizer is performing “live.” Once the recognizer attains the performance required for deployment, the system is simply switched over to automatic operation.

From the caller’s perspective nothing has changed. The application sounds and responds the same as it did when first introduced with a call centre agent performing the recognition function.

But from the perspective of WA TAB management, everything has changed for the better, because they are fielding more calls at a lower cost.

Clive Summerfield is the technical director of Syrinx Systems.

Speech is a Good Call Center Bet

The approach used at WA TAB has been used in a number of call centers to automate routine applications, releasing agents to concentrate on high value, business critical tasks within the call center.

The cost benefits of traditional IVR technology (based on touch-tone interaction) are generally well understood. However, many of the functions currently undertaken by operators in call centers either cannot be implemented with touch-tone technology, or are so clumsy when implemented using touch tone IVR systems as to cause so much irritation and frustration to callers that they fail to meet customer satisfaction levels.

With the development of advanced speech recognition technologies, supported by development tools and services, automation of many of these services becomes a viable proposition.

The WA TAB case is an example of how speech recognition can be successfully deployed into applications which are not suitable for standard IVR techniques. Indeed, the service described is a pure speech recognition solution, with no fall back to IVR touch tone at all.

The use of speech recognition presents the opportunity for call centers to realize significant cost savings through the automation of a range of services previously considered inappropriate for automation, allowing the re-deployment of agent teams to high value tasks, which can dramatically improve service quality, service accessibility and operating viability.

Speech Goes to the Races

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions