Choosing Hardware for Your Speech Application

"If you limit your choices only to what seems possible or reasonable, you disconnect yourself from what you truly want, and all that is left is a compromise," - Robert Fritz, composer, filmmaker and organizational consultant.

It is debatable whether speech technologies have become mainstream and, certainly, a great deal of the attention is based on skepticism. Nevertheless, more and more companies are looking at what speech can do for them and their customers.

There are many critical choices to be made when deploying speech. In the world of speech technology solutions, especially when getting started with speech, the choices made can mean the difference between a long-term, repeatable success and a compromise - possibly the proverbial "flash in the pan."

Underneath the Hood
The choice of underpinning hardware and software components is, therefore, just as critical as the choice of a speech engine when seeking to construct a solution.

Check Off List

Choose Underpinning
Hardware and Software
Digital access connections (i.e. - TDM and IP)
Diverse protocols (i.e. - SIP or SS7)
Languages
Speech engines (possibly from multiple vendors)
Requirements for Providing End Users Access to
Services
IP Option: Protocols (Using IP Only - may
limit scope for product)
H.323
SIP
Voice calls made by users
Incoming media - could be encoded as G.711
Playback of voice prompts - could be encoded
differently to reduce bandwidth and required
memory space
Transcoding between codecs
PSTN Option
Check number of parameters
Check physical connection
Ensure the ability to switch between T1 and
E1, µ-law and A-law without changing hardware
Selection via application programming
interface (API)
Automatically selected via the protocol loaded
Choose protocol to be used
ISDN
Legacy channel associated signaling (CAS)
Signaling system number 7 (SS7)
Media Gateway Option
Requirements of Voice Applications
Running on Speech Recognition
Buffered speech feed
Silence elimination
Grunt detection
Echo cancellation
Multi-function cards
DSP resource cards or host media processing
software
Consider the Lifecycle of Your Product
How will the product evolve
Consider future needs for end users
Make your solution readily scalable
Make sure your solution can handle more than
Play
Record
Echo cancellation
DTMF handling

End-users' needs and expectations, different implementation languages, alternative application development environments (based on SALT or VoiceXML, for example), a standards-based (i.e., MRCP) or a proprietary telephony/speech server audio interface, and the commercial imperatives of vendors mean that customers will always have a choice beyond that of speech engine.

At a lower level in the architecture, careful choice of standards-based hardware and software components means you can create a repeatable model that will be usable in different scenarios. These might include the need for digital network access connections such as TDM and IP, using diverse protocols such as SIP or SS7, languages, and even speech engines from multiple vendors.

When the German directory assistance service provider Telix was looking to offer its customers an automated solution with high performance but low-cost service, they made a clear choice. "With a single card needed for telephony and DSP resources, we were able to exploit the high channel densities of Prosody for such a large deployment. This, coupled with its proven stability and reliability, meant we had complete confidence in the performance of our solution for Telix," said Johannes Wagner, head of automated directory solutions at Telix.

So for those getting started with speech, it will pay to get a better understanding of some of the key selection criteria for hardware and software components used in speech applications.

PSTN and/or VoIP
Bearing in mind that 90 percent of customer service interactions still occur over the phone, it is essential for the designer of any speech system to fully consider the requirements for providing end-users with a means of accessing the service. This might seem self-evident, but it's easily overlooked, simply because it's so obvious.

A basic question facing designers today is whether to focus on purely IP network connectivity and to retain a PSTN option or not. If choosing IP only, they run the risk of limiting the market scope for their product. IP connectivity is clear-cut; however, there are some options to think about. Protocols are a straight choice between H.323 and SIP, and with SIP already seeming to dominate in new designs, it's not so much a choice as a given.

In terms of the voice call the user makes, there are variables to carefully consider. For example, the incoming media could be encoded as standard 64 kbits/s telephone speech (known as G.711) to ensure maximum performance from a speech engine, but the playback of voice prompts could well be encoded using some other compression scheme that reduces bandwidth consumption (G.723.1 and G.729 are options) as well as memory space needed. This means that a choice of hardware capable of handling all these coding schemes (codecs) is essential. It will be even more important if transcoding between codecs is also a requirement.

For a telephone option (PSTN or ISDN), there are a number of parameters that need to be checked and determined before choosing the right product to meet your needs. In addition to the physical connection, it will pay to ensure that you can quickly change between the North American and European telephony signal encoding schemes (T1 and E1, µ-law and A-law, respectively), without having to change hardware. This is usually done these days by software selection via the application programming interface (API) or automatically by the protocol loaded, and in terms of deploying your solution without having to "knife and fork" an upgrade, this level of capability is a godsend.

A key point in this category of choice is the protocol to be used: ISDN, legacy channel associated signaling (CAS), or signaling system number 7 (SS7). The best advice is to make sure your hardware vendor gives you a wide choice of tried and tested digital network access protocols that won't inhibit you from deploying your product anywhere in the world.

Having mentioned both IP and PSTN, it is a good idea to consider the option of media gateway functionality. This is necessary for a number of scenarios, for example, interfacing from an IP environment to legacy PBX and ACD equipment or providing a channel to your IP-based speech solution for calls incoming from a PSTN network. So whether you are IP-enabling your legacy solution or making sure your next generation product can still offer services to PSTN users, make sure the hardware you choose has a gateway option.

A poor choice up front will mean the costly alternative of trying to include additional network access cards - or a stand-alone gateway - at an untimely later date.

End-pointing and Echoes
A common use for speech technologies in early applications was to replace DTMF-based user menus. The basic media processing resource combinations of playback, record and DTMF handling used in these systems remain very useful for applications.

The majority of voice application servers running speech recognition require a "buffered speech feed" - that is, the recording and end-pointing of incoming speech and its controlled input to the recognizer at a pace the speech engine can accommodate - in order to ensure an accurate recognition result. Silence elimination and the splendidly named grunt detection are essential media processing features, which together enable the effective end-pointing of speech data so that a recognizer does not have to consume excessive CPU time inefficiently differentiating between spoken words and periods of noise or silence.

An additional critical resource feature is echo cancellation. This is an essential functionality in any TDM-based (Time Division Multiplex is used to separate multiple conversation transmissions) practical speech application, where it is used to facilitate user barge-in, allowing the caller to override prompts and gain faster service as a result.

Using multi-function cards that combine these media processing resources with digital network access means that the incoming speech utterances can be manipulated and recorded in a single operation, before being fed to the recognizer. An added bonus, beyond the obvious efficiency gained for your application, is that this also means you avoid the costs of buying separate cards for different functions.

The simple fact remains that choosing any alternative to multi-function resource cards simply adds to your overall costs with the added disadvantages of a longer time to get to market and reduced margins.

Don't Get Fooled Again
Don't be fooled into thinking only in terms of ASR and TTS when considering speech-based applications as these are not ends in themselves. They are technologies used to enhance a variety of real-world, end-user applications. Such as in a customer-service contact center, for example, that may also need to include conferencing and Group 3 fax capabilities. And, even with ASR, you may still need to offer DTMF handling. So, it is probable that you will need other resource features as well. All of these features are applicable to DSP resource cards - or host media processing software - themselves essential components of any speech system.

In getting started with speech it is imperative that you fully consider the lifecycle of your product and how it will need to evolve to continue to meet the needs of your customers. Initial requirements are only part of the story; don't forget about future needs - you should always expect to add new features. Furthermore, make sure you can readily scale your solution - be prepared for when you will need to.

Choose a proven, ready-made integration between your chosen speech engine and the media processing resources. Make sure you keep your options open - use multi-function resource cards, cards that can be readily integrated, with a consistent API across the product range and that will support other functionality needed by your application in addition to the essentials of play, record, echo cancellation and DTMF handling.
Don't be afraid to push your vendors to provide you with what you want. They can do it. Good luck with your developments.

Choosing Hardware for Your Speech Application

Omilia Launches Lexis TTS Model for Contact Centers

Retell AI Launches Conductor

SoundWise Launches Free Forever AI Audio and Video Transcription

Cash Flows in to Speech Company Coffers

Emotion Detection and Recognition Market to Be Worth $43.29 Billion by 2031

Callie Care Collects $500K for Voice AI Development

Study Proves Assistive Technologies Improve Users' Lives

Jon Taffer Launches Digital Coversational Twin

AI Voice Agents Increase Specialty Care Program Enrollment

Symend Launches SymendConverse

Sunoh.ai Enhances Home-Based Primary Care and Operational Efficiency at Bloom Healthcare

Modulate Tops Hugging Face's Transcription Benchmark

Voiskey Officially Launches

VoicePing Releases VoicePing 3.0

LALAL.AI Launches Lynx Voice Cleanup Mode