VoIP's Impact on Speech Recognition

Voice over Internet Protocol (VoIP) is rapidly replacing traditional phone service in the enterprise and that's good news for speech recognition. According to Synergy Research, revenues associated with enterprise IP telephony topped $4 billion in 2005—a 31 percent increase over 2004, and that has undoubtedly increased even more this year. This amazing growth in enterprise VoIP, fueled by the use of industry standards and improvements in network design, is helping to drive deployment of speech recognition.

In the past, if enterprises wanted to integrate speech recognition into their traditional phone systems, they typically had to rely on a single vendor's closed, monolithic equipment; undergo extensive training on that vendor's proprietary tools; and have in-depth knowledge of PBXspecific protocols. Enterprises also had to bear the cost, effort, and time investment to maintain separate systems for their phone and data services. These obstacles made deploying speech recognition expensive, complex, and limited primarily to the largest enterprises. As enterprises move to the VoIP network model, they can replace monolithic, vendor-specific network elements with distributed, openstandards equipment (see illustration). The VoIP network model allows voice and data packets to share a single converged network infrastructure. All of these VoIP benefits make deploying speech recognition simpler and less expensive.

In a modern enterprise VoIP network, there are three separate main elements: the IPPBX to manage the handsets on the network, the VoIP gateway for the phone service connection, and the IP media server for media processing. Enterprises can buy these network elements from a single vendor if they choose, but they are also free to select best-of-breed components from different vendors. These three network elements are tied together using the industry standards: Session Initiation Protocol (SIP) and Real-time Transport Protocol (RTP). SIP, at the application level, is used for setting up, changing, and terminating multimedia sessions between devices, and RTP carries multimedia traffic over the network. The IP Private Branch eXchange (IP-PBX) manages the telephone handsets throughout the enterprise, allowing calls to be placed over an IP network instead of over standard telephone infrastructure. Telephone handsets connect to the IP-PBX using SIP, and then the audio traffic is carried over the network by RTP.

Most businesses' voice communications traveling outside the enterprise network still travel over the Public Switched Telephone Network (PSTN), which means non-packetized Time Division Multiplex (TDM). So the enterprise needs a translator between the interior packet and exterior non-packet networks. That translator is the VoIP gateway. The VoIP gateway performs two-way translation between the packetized SIP signaling of the IP-PBX and the non-packetized T1 and E1 phone circuits of the PSTN. The VoIP gateway accomplishes the translation because it handles different voice coding algorithms, such as G.711 for traditional phone service, and G.729 for VoIP applications. The VoIP gateway can also be used to transcode between different packet protocols. VoIP gateways are especially important for speech recognition because they handle line signaling and echo cancellation, which improve speech recognition accuracy.

Speech recognition and other applications added to the enterprise VoIP network typically reside on an application server. When the application server needs media processing resources for one or more of its applications, it sends a request to an IP media server. The IP media server handles the request, managing and allocating the packetized media streams to match the application requirements. For Interactive Voice Response (IVR) applications, the IP media server plays prompts, detects Dual Tone Multi-Frequency (DTMF), and processes scripts with VoiceXML.

These three main network elements — the IP-PBX, the VoIP Gateway, and the IP Media Server, along with an application server, make the VoIP network and create an infrastructure that makes it much easier to implement speech recognition.

The development environment plays an important role in the adoption of speech recognition. In the past, applications had to be built using vendor-specific development languages. But mark-up languages such as Speech Application Language Tags and VoiceXML provide an open standard, so developers can use them to build applications faster and more cost effectively, with familiar dragand- drop tools. Web mark-up languages also help leverage existing investments in Web technology, for example, by utilizing information already stored on a corporate Web server. Back-end databases used for graphical Web queries can be reused and extended to support telephony access via speech queries.

Once the VoIP network is set up, adding speech recognition is a simple matter of introducing another standard, Media Resource Control Protocol (MRCP). MRCP is a standard communication protocol for speech resources across VoIP networks, and it allows the IP-PBX and IP media server to communicate over RTP streams with a speech engine. MRCP also controls resources such as automated speech recognition and textto- speech. MRCP adoption is growing significantly. Because MRCP lets developers independently manage their speech technology with minimal impact on their application, it makes deployments and upgrades much simpler and faster. With a VoIP network model, the speech engine and application server have been transformed from a single monolithic block to disaggregated, distributed, independently- scalable network elements that can be quickly upgraded.

IMS Blueprint in the Enterprise: Distribution and Scaling

This new VoIP network model benefits enterprises in much the same way that an IP Multimedia Subsystem (IMS) model benefits the carrier market. The IMS blueprint allows developers to move away from application silos to a distributed architecture that supports the rapid creation of VoIP-based speech applications. VoIP allows for location-independent speech and IVR equipment. Just as in the IMS model, the enterprise VoIP network model allows each functional element to be scaled independently. For example, if a company needs to accommodate growing call volumes in the enterprise, it can simply add more VoIP gateways, without any need to reconfigure the IP-PBX. If a company needs higher reliability, it can duplicate elements such as the IP media server. If a company wants to add additional speech recognition applications, it can simply add new application servers.


VoIP offers integration between speech and data that is impossible with traditional TDM phone systems. With VoIP, designers can pick and choose best-of-breed products that fit the specific business needs and the overall goals of the enterprise. They can also simplify design and leverage their existing assets. By moving to a VoIP network model and using industry standards, applications can be more quickly developed and deployed. Developers move from proprietary tools and products to an open industry standard infrastructure that allows speech to be easily added and leveraged across multiple applications. VoIP simply makes deploying speech recognition much faster, simpler, and more affordable.

Scott Wieder is director of market development at Cantata Technology, a provider of enabling communications hardware and software that empowers the creation and delivery of anytime, anywhere IP-based communications applications.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues