Issues with Speech Deployments

In today’s mobile “access anywhere” society, the increasing reliance on up-to-date personalized information presents a challenge to application developers who are tasked to provide solutions and services whenever and wherever they may be needed. The prevalence of various access devices such as desktop PCs, laptops, PDAs and SMS or WAP enabled mobile phones, merely confuses the task for the developer. While the next generation of PDA or mobile phone always seems to be just around the corner, and access to a desktop or a laptop may not always be possible, the trusty and ubiquitous phone is always there to access and deliver the information needed. Telephony Integration Perhaps the least considered task, but one that can make or break a successful solution, is what kind of phone access is available to the service. In all cases, the speech system must be able to deal with telephony call control, and this can involve significant expense and technical difficulty. Enterprises deploying automated speech applications will need to provision telecom lines either direct from a service provider or as additional connections to their Private Branch Exchange (PBX). For call center functionality, PBXs may be enhanced with Automatic Call Distributor (ACD) software to provide flexible rules-based call routing and call statistics for management reports. In any of these scenarios, carefully planned numbering schemes can help direct callers to specific services where the speech system can be tuned for optimum performance. This will be useful in call centers, where a number of different services are available, or where different languages may be supported. As well as direct dial, Caller ID presentation allows incoming calls to be tagged to a location to help in caller identification, or presenting location-based service menus. Speech applications involved in directing calls (call steering) will need to perform complex call transfers. Network-based ISDN services allow calls to be transferred across the network using 2B channel transfer (TBCT). Once the speech system determines that someone has picked up the call, it can even elect to stay in the call as a virtual operator, in case the user decides to make a follow-on call or conference in another caller. If the application isn’t required to stay on the call, it can tell the network to drop it out of the loop after transfer, and the network optimizes the call path accordingly. Enterprise speech applications may also require the ability to make transfer via the enterprise PBX or ACD. The command used to affect the transfer is dependent on the protocol supported by the PBX or ACD itself, but many ISDN flavors will support explicit call transfer (ECT) that allows the transfer of a call. Applications such as virtual assistants that provide remote access to calendars and contact databases will require the system to stay in the loop. If the network service provider or enterprise switch does not support network call transfer, then a technique known as “tromboning” can be used. This requires that the speech platform create the bridge between the caller and a called party. The downside is that each call takes up two channels. However, there are other cases where tromboning is desirable, as it allows the application to remain in the call path. For example, the system may need to supervise transfer of a call, or pass information on the caller (like an account number) to an agent before completing the transfer (whispering), or go into “stand-by” and wait for the user to utter a “wake-up” command to reactivate. The final issue to consider for telephony integration for call centers is computer-telephony integration (CTI) middleware. These costly systems make it possible to reduce the time an agent is engaged on a call by removing the routine call set up and close down segments. For example, CTI systems can use call data to retrieve a client’s data records, and pop them up on the screen of the agent to where the call is being transferred. CTI can be adapted to retrieve and update records for a caller is being handled by the automated speech system. Application Integration One of the issues that has held back the adoption of speech-enabled applications more than most is the interminable time it takes to provide and maintain up-to-date content. Proprietary systems require each set of services and scripts to be written from scratch, which adds to the cost of deployment and maintenance. With the rise of the Internet, we now have more information than we know what to do with. Although it is a misnomer to term voice as a “browser”, the ability to use voice to retrieve information or drive a transaction from the Internet provides a compelling case for deploying enterprise speech portals. In addition, there are many more computer-based enterprise applications, such as customer interaction management systems for sales, and logistics and supply management systems, that would become more effective by enabling remote access by a mobile workforce. The rapid growth of standards-based speech interfaces such as VoiceXML provides a relatively straightforward interface for speech enablement. A properly designed VoiceXML gateway will allow any HTML or XML-scripted application to be speech-enabled very quickly. Typically, speech platforms contain components that are usually accessed by a low-level application programming interface (API). A good VoiceXML interpreter will encapsulate these API calls to allow developers to author solutions directly in VoiceXML. The VoiceXML gateway is responsible for managing communications with Web servers and integrating with the underlying ASR and TTS engines. This integration should be effected in such a way that it makes the VoiceXML speech application portable. However, VoiceXML provides only limited call control and multimodal capabilities, and complementary technologies can help fill the gaps. Call Control XML (CCXML) supports more comprehensive call transfer features, and can be implemented with or without VoiceXML. Multimodal functionality, ideal for simplified voice access to complex information presented on mobile devices, can be achieved through “X+V” – XHTML for text/data handling combined with VoiceXML for voice input. SALT is yet another approach to delivering multimodal capabilities, and although it appeared on the scene more recently, it is rapidly gaining momentum as Microsoft begins development of SALT-based services within Windows. These alternative approaches to implementing multimodal support have not yet converged into a single universal standard. As a result, developers must carefully consider the relatively subtle implications of these alternative approaches and the availability of development tools and platforms that support them. Application Performance Application integration is only the first step in creating and deploying an effective, scalable speech system. System performance is not just determined by the choice of speech platform, speech engine or even server hardware – it’s also dependent on system bottlenecks that must be addressed for optimal performance and scalability. These bottlenecks occur in the initial processing of the speech signal, which removes noise, silence and echo from the audio signal before it is sent to the system for recognition. The speech software running on the client ASR system can perform some of this processing, but it is at the expense of system performance and scalability. The latest generation of telephony boards has been specifically designed to address these bottlenecks. Recent advances in long-tail echo cancellation improve recognition during barge-in, addressing accuracy problems that would normally become evident only during actual deployment. Specialized voice detection, designed for compatibility with speech systems, performs first-pass endpointing to improve system responsiveness and performance by reducing the audio processing and delays within the system. Unlike traditional telephony hardware, these advanced telephony boards offload the host processor to allow improved accuracy and scalability, especially in the challenging conditions faced when deploying speech applications worldwide. The successful deployment of speech-driven applications requires careful attention to the design of the entire system, not just the speech user interface and application software. Network capabilities and configuration can determine call transfer capabilities; the choice of application integration approaches will define the system architecture and development environment for years to come; and the selection of telephony hardware will impact the performance and future scalability of the overall system. Fueled by market demand, speech technology continues to evolve rapidly to support the widespread deployment of real-world speech applications and systems. Keith Byerly is senior market development manager and Paul Jackson is global markets manager for Brooktrout Technology. They can be reached at kbyerly@brooktrout.com and pjackson@brooktrout.com .

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Issues with Speech Deployments

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Soul App Launches Full-Duplex Voice Model

Mistral Unveils Voxtral Open-Source AI Voice Model

Vonage Partners with AWS for AI Voice Agent Integration