The State of Embedded Speech

Most of what is written on speech is focused on server based speech processing. But there is a another speech technology out there that‘s powerful enough to sit on a stamp-sized microchip. It’s called “embedded speech”. Advancements in computing power gave server side speech the power boost it needed in the early 90s. Now that same rocket fuel is launching embedded speech into the limelight. What Encompasses Embedded Speech?
Embedded speech is the umbrella term for small-footprint speech recognition, speech synthesis or text-to-speech (TTS), speaker verification and speaker identification performed on a single device. These devices are typically battery powered, and range from small (i.e. toys, telephones) to large (i.e. soda machine, automobile). Due to its compact nature, embedded speech products are largely application and device specific. Some applications make it difficult for the end user to tell which speech technology is actually in use. Cell phones can, for example, support name dialing using either embedded speech or server-based speech. To the application developer, the difference is that server based speech uses a network connection to perform any type of speech processing, whereas embedded speech performs 100% of the speech processing functions on the phone (device). Looking at Embedded Speech Recognition and TTS
The embedded speech recognition process is similar to server side recognition. ·The spoken voice audio is pre-processed (echo cancellation, end-pointing, adaptation), ·then digitized to map the voice properties (spectral & temporal representation), so that ·recognition and language understanding (phoneme/word recognition, interpretation/meaning) methods can interpret and return final results to the application. Embedded speech recognition decoding methods are similar to server-side speech (Hidden Markov Model, Artificial Neural Networks), as are language models and dictionaries. Embedded also uses dynamic grammars and finite state grammars (limited to approximately 50,000 words). Speaker independent recognition is now common among embedded engines, though speaker dependent capability is still reserved for some application instances. ART Advanced Recognition Technologies Inc. supports multiple users on the same device without the need for training. Embedded speech recognition uses compression algorithms and modified search techniques to conserve on processing, storage and memory use. For large vocabulary needs (Internet browsing, email retrieval) embedded speech recognition is improved with the use of phoneme recognition, spelling and pronunciation rules that make it possible to recognize words that don’t exist in the dictionary. Embedded speech recognition is available in many languages to support deployments around the world. Voice Signal Technologies recently announced “multiple language recognition capability” in their embedded speech recognition products. Embedded TTS engines share language models and dictionaries with embedded speech recognition to further increase system efficiencies. Embedded speech synthesis has made great strides in recent years, similar to its cousin on the server side. Last year, Speechworks announced custom “branded” voice capability on their concatenative TTS products for both embedded and server based installations. Now synthetic voices can match branded voice talent, yielding two important results: increased application capability (especially in commerce) and personalization or the ability to make individually tailored suggestions. Speaker Verification and Speaker Identification at the Device Level
Speaker verification refers to when a person claims to be somebody in particular, whereas speaker identification is more about correctly identifying a person out of a group of people. Processing and memory advances now support speaker verification and speaker identification on board standard microprocessors, eliminating the need for separate voice processing and expanded storage. (Voice Security Systems says their speaker verification user enrollment data takes up less than 800 bytes of storage space.) Embedded verification has been deployed for access control, intruder prevention on children’s diaries, and to unlock cell phones. When speaker verification and identification are done at the device level there is no cross-channel or cross-device mismatch problem (when the user enrolls on a landline phone and verifies on a cell phone). Speaker identification is easier with embedded, as the population in which the speaker is compared to is usually small. In automotive, where embedded speaker identification is used for personalization, the population consists of family members who share use of the same car. There the identification process is relatively straightforward. When Mom is identified as the driver, the vehicle amenities (climate control, radio station, mirror positioning) are tailored to her needs and not the last person who used the car - her teenage son. Hardware and Software Drive Embedded Growth
Hardware for embedded speech varies greatly and today there are many more configurations to choose from. Silicon Recognition Inc. makes application-specific integrated circuits, designed especially for the high function and low power needs of speech processing. Sensory Inc. delivers a best-selling speech chip (their RSC series) by taking general purpose microcontrollers (low cost and low power) and adding special purpose speech input/output capabilities for recognition, synthesis and control. Mainstream microprocessor and DSP manufacturers (Intel, Texas Instruments, etc) now support embedded speech processing with high function, low power chips as well. The expanded choice of hardware (coupled with increased processing power and memory) has brought down development costs to drive growth in embedded applications. Voice signal processing software is also contributing to the advancement of embedded speech and is especially important for quality performance in telematics and in noisy environments inherent with wireless use. Voice signal processing software providers include Clarity, SRS Labs and Wavemakers. Getting to Market
Cost sensitive applications make winning a numbers game for embedded speech players. Locking up OEMs is a big key to financial success. That said, direct sale opportunities do exist and certainly touch points are required to generate demand with automotive giants. Server side players tout “out of the box” readiness, while embedded speech players stress the need for custom integration to properly launch a particular application. Integration services ensure hardware capabilities of the device can support the desired speech processing task(s). Animation Complements Embedded Speech
Devices that use embedded speech often come with a screen display as well. Embedded speech providers are taking advantage of this by accessorizing their embedded TTS engines with the synchronized facial expressions and lip movements of a custom 3D character. Marge (right) is a sensible, lovable animated character from Sensory. She may just be just what’s needed to encourage users to subscribe to wireless audio email service. Distributed Speech Recognition
Distributed Speech Recognition (DSR) is another complementary technology to embedded speech. DSR sits in between server side and embedded on the speech continuum. DSR pre-processes speech input on the device and then uses a telephony channel to send the digitized speech to a remote server for recognition. The European Telecommunications Standards Institute has established a standard approach to this front-end processing (Aurora). The standard theoretically allows DSR and embedded speech to “co-exist” if they both support Aurora. For mobile phone applications, this combination will allow users to have some application functionality even when a wireless network connection is not available (during roaming or in-building use). This may be important for mobile workforces, especially if they know they can synchronize data when a network connection is reestablished. Another advantage is that DSR expands access to remote content and to third-party applications (potentially important for telematic portal plays). Embedded Market Breakdown
Embedded speech market opportunities fall into three application categories: Appliances, Automotive and Toys/Games.

Embedded ASR Unit Shipments – by Application

Source: Voice Information Associates

Voice Information Associates offers a detailed review of the embedded speech market in its Automatic Speech Recognition: A Study of the World Wide Market report. Appliances Sector: Telephones (both cordless and wireless) are driving the growth of this market segment; where command and control and dialing applications predominate. Optimistic analysts expect more users to access Internet and Web-based services via mobile devices than PCs by 2004. The embedded speech company that has probably made the most progress with regard to mobile phones and PDAs is ART Advanced Recognition Technologies Inc. Their compact footprint and integration with a half a dozen of the top mobile chipsets has positioned them to win OEM deals with a large number of international handset manufacturers. LG Electronics of Korea recently announced that it will use Qualcomm’s MSM 6200™ chipset to power its third generation mobile phones. These 3G Qualcomm chipsets have ART’s phoneme-based name dialing solution embedded on board. This speaker independent solution provides a friendly user experience as users are not required to train each contact in their address book. At a recent speech conference, Voice Signal Technologies highlighted speech-to-text SMS dictation as the hot new application with their large vocabulary engine. With all the new advancements and applications, mobile network operators have more choice than ever before. Many of the mobility applications can use either embedded or server-side speech engines so the tryouts line is long. Embedded brings with it an upfront cost advantage, but that value is somewhat diminished because the price of embedded speech is typically bundled into the handset price, which many network operators are still subsidizing at the retail level. Sensory has made headway in the home appliance space with their Password Control Center™ product produced by GirlTech. Consumers plug two electrical appliances (stereo, lamp, TV, etc.) into the control center and then they assign a name for each appliance. From that, consumers can control each appliance to turn it on or off by speaking the given name of the appliance. The hot business application in the appliance sector is in warehousing and logistics; embedding speech processing in wearable devices. VoCollect’s Talkman® solution operates as a thick client using a wearable wireless terminal that supports both speech recognition and speech synthesis on the device. Corporate Express uses over 300+ Talkman terminals for order processing of business-to-business office supplies and computer products. On the warehouse floor, the voice interface helps to improve order accuracy and order productivity. Automotive Sector: Voice Information Associates reports this segment having the strongest projected growth. Alan Schwartz leads a specialized group within Speechworks focused on automotive and other embedded markets. “Automotive manufacturers are getting directly involved with speech as part of the design of the car…paying $5-20 per car in a fleet to support multiple embedded applications.” Compelling automotive applications include navigation systems with hands-free voice input and graphical/audio output of directions. For $429.95, consumers can buy the KDLX50 - JVC's top of the line El Kameleon II car CD receiver with voice recognition. Sensory is the embedded speech engine inside the car stereo that gives the driver command and control capability to operate the entire unit with his/her voice. Toys and Games Sector: This category has traditionally been the low cost, high volume segment of talking dolls and diaries, but recent demos include higher-end, Jetsons-like toy robots that aspire to take on the duties of Rosie while entertaining children, ages 8 and up. Hasbro’s R2-D2 Interactive Astromech Droid, powered by Voice Signal Technolgies, has three voice activated play modes (Companion, Game, and Command) and will respond to 40 different commands like “Go forward 5 units”. The droid displays emotion when you ask it questions like “Do you remember Darth Vader?” This toy’s suggested retail price is $100. Looking Forward
Advancements in computing power will continue to push forward the capabilities of speech on embedded devices. The line between embedded speech and server-based speech is starting to blur, giving more options to application developers than ever before.

Kathy Frostad is principal advisor for Voice Web Consulting, a specialty consulting firm solely focused on voice recognition technologies and their business application in wireless, enterprise, Internet and PSTN environments. She can be reached at kfrostad@voicewebconsulting.com.

The State of Embedded Speech

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions