Software-Only Vs. Embedded: Which Architecture Is Best For You?
Speech technologies have become an essential element in successful state-of-the-art computer telephony (CT) applications. The integration of automatic speech recognition (ASR) and text-to-speech (TTS) in applications greatly increases the effectiveness of interactive voice response CT for a wide range of uses. For example, speech technologies give callers without touch tone phones the ability to easily navigate IVR menus. Speech-enabling auto-attendants make it possible for a caller to simply speak the name of an individual or department instead of trying to type in a name using the phone keypad. Financial, travel reservation, and sophisticated personal assistant applicationsæ often with vocabularies of more than 50,000 wordsæ offer the convenience of fully automated transactions. Two viable architectures for implementing speech technologies are software-only and embedded. In a software-only architecture, speech technologies run on the same host CPU as the CT application; no specialized hardware is necessary. In embedded architectures, the speech technologies run on dedicated digital signal processing (DSP) hardware. For some applications, the best choice is a hybrid of these two architectures. There are many factors to consider in deciding which architecture is right for a specific application. The Types of Architectures
Software-only speech technology systems do not require any additional hardware to run the speech technology, making them a cost-effective solution. This type of architecture leverages advances in CPU price/performance, as well as the sheer CPU processing power now available on PCs. Moreover, according to Moore's Law, CPU processing power should continue to double every 18 months for even better price/performance ratios in the future. In addition to saving money, software-only architectures reduce the maintenance sometimes associated with hardware that is specific to speech technology. Embedded systems offer the advantages of scalability, easier problem isolation and diagnosis, and more deterministic system behavior for resource provisioning. Based on DSP boards, embedded speech technology systems move the processing load for the speech technology away from the host CPU, offering greater scalability and providing the ability to create high-density, speech-enabled applications. This architecture lets you add channels of speech technology by adding more DSP boards without having to increase CPU power as you would with a software-only system. Embedded systems also make it easier for a developer to determine system resource requirements by clearly isolating the speech technology processing load. By localizing the speech technology to a specific board or group of boards, this architecture better insulates the CPU from any problems that might occur with the speech technology. If any errors do occur, it is also easier to isolate and correct them. This is obviously essential for call center and PSTN-based CT applications, given their high system availability/reliability requirements. Finally, the DSPs used in embedded systems are more efficient for some aspects of speech technologies, especially those associated with ASR processing. For extremely large-vocabulary phonetic ASR call center CT systems, a third architecture, a hybrid of the embedded and software-only models, is proving to be very efficient. This efficiency results from distributing components of the ASR processing to the place where each is handled best. For example, since DSPs are designed for signal processing, they are naturally quite efficient at what is referred to in ASR as front-end processing.This includes functions such as echo cancellation, detecting the onset and termination of an utterance by the caller (so-called end pointing), filtering/adaptation for varying line conditions, ASR feature extraction, and so forth. Oneapplication for using these front-end processes is performing barge-in, where the speech recognition system recognizes an utterance from the caller during a system voice prompt. Permitting speech input during a prompt is important to users familiar with the system who do not want additional prompting to complete their transaction. By performing barge-in on a DSP board, the input does not need to be processed by the host until the caller actually speaks. By having the host only process data when valid speech data is present (i.e., not having the process silence), the ASR channel densities achievable on the host dramatically increase. In contrast, the host computer, with its large and inexpensive memory, is better suited to tasks like storing very large vocabularies and searching for a specific utterance within a vocabulary. Also, adding sophisticated grammar processing to speech systems, enabling fully automated systems to understand caller input (for example, "I want to make a reservation for a flight from San Francisco to London next Tuesday morning"), is most efficiently done on the host. As a result, the hybrid architectureæ with front-end processing on a DSP board and vocabulary look-up and grammar processing on the hostæ is an optimized combination that allocates specific tasks to those environments (DSP or host) where they are handled most efficiently.
Choosing an Architecture
There are several considerations in choosing the right architecture for deploying speech-enabled CT applications in a call center. A small call center system requiring only a few channels of TTS and ASR may save money by choosing a software-only solution with no speech-technology-specific hardware. Using today's technology, it is possible to reach densities of about 8 channels per single Pentium 200, with enough reserve CPU processing power for the CT application. To ensure reliable performance with this architecture, it's essential to consider the tasks the speech technologies need to accomplish, determining how much of the time a recognizer is expected to be active (its duty cycle), as well as how much CPU processing power would be used by all of the recognizers simultaneously. This must be followed by actual benchmarking tests to ensure a reliable, effective solution. In larger call center environments, where high-density, high-availability systems are crucial, embedded architectures provide a more scalable, maintainable solution. Based upon the specific type of technology used, current densities can be in the area of 12 ASR or 24 TTS resources per board. In these environments, there is great flexibility in being able to confidently scale the speech technology component of a system by adding boards that do not load the host CPU. It is also crucial to be able to quickly replace a defective board with a new one to bring a system back online. Choosing the right architecture is important, but not always difficult. For low-density speech applications where cost is a primary concern, software-only, host-based technologies are ideal. This configuration requires benchmarking work to ensure adequate system performance and may lead to greater difficulty in problem isolation and diagnosis. Host-based products can also serve as a good testing ground or prototyping platform for higher density speech-enabled CT applications. For high-density, high-availability applications, embedded systems are a better choice, offering greater scalability and more deterministic behavior for provisioning. Also, embedded systems lend themselves to easier problem isolation and diagnosis and can quickly be brought back online. Finally, for large-vocabulary, fully-automated transaction applicationsæ which are increasingly common in call centersæ a distributed DSP/host-CPU hybrid architecture provides the most efficient platform. This hybrid architecture moves specific aspects of speech recognition to the environment (DSP or host) where they are handled most effectively.
Gene Eagle is Speech Product Line Manager, Dialogic Corporation, 1515 Route 10, Parsippany, N.J. 07950 and can be reached at 973-993-3000.