Embedded Speech Recognition: Is It Poised for Growth?
The Early Days of Embedded Speech Recognition
The embedded speech recognition market has been around about as long as the speech technology industry itself. Over twenty years ago, products such as telephones and toys emerged on the market with algorithms running on 8 bit micro-controllers. Most of the products used speaker dependent recognition requiring training, but some products appeared using early DSPs that implemented speaker independent algorithms. A U.S. subsidiary of Tomy Corporation was formed to market voice recognition products including a telephone that had a single button for voice dialing and speaker dependent digit dialing. Embedded implementations throughout the 1980s tended to be either high in cost or poor in performance. Chip and Software Companies Focus on Embedded Speech During the 1990s
The 1990s saw the first public offering of a company focused on selling embedded products with speech recognition. Voice Powered Technologies (VPTI) was a pioneer in speech controlled consumer products and introduced a speaker dependent voice operated remote control in the early 1990s. The remote was backed by one of the very early telemarketing campaigns and included a videotape on how to use it. VPTI had a string of follow on products eventually leading to a voice organizer which had reasonable market success, but not enough to save the company from eventually going out of business. With advances in semiconductor processing technology, the first dedicated, low cost, high quality speech recognition ICs came into production during the mid-1990s from companies such as, OKI Semiconductor (using technology from Voice Control Systems), Sensory and Hualon Microelectronics Corporation (HMC). As memory and processing power became relatively less expensive, software-based recognizers started appearing on digital signal processors (DSPs) for markets such as automobiles and telephones, with companies including ART, Conversay, Lernout & Hauspie, Sensory and Temic providing the embedded speech recognition software engines. Increased focus has recently moved towards the embedded speech space. Convergence concepts with Internet, PDAs, cell phones, and various media devices hold a lot of promise for speech recognition. Although the successes are still few and far between, there has been substantial hype that has analysts focusing on, and new players moving towards, the embedded speech markets. Over the past few years, companies such as IBM and Philips have expanded from their large vocabulary dictation roots, and have refocused on telephony and embedded applications. More recently, SpeechWorks has expanded its speech recognition efforts beyond their initial telephony segment and into embedded speech software. Almost a dozen other smaller companies have emerged all across the world that are now focusing on small footprint solutions for speech recognition. Market researchers from firms such as Frost and Sullivan, IDC, Morgan Keegan, and JP Morgan H&Q have begun covering the embedded speech markets, and for the first time are thinking of speech recognition beyond telephony and computer/dictation and are, in fact, projecting fast growth for the embedded market segments. Market Opportunity for Embedded Speech Recognition
The embedded speech recognition story is very compelling. The user interface on electronic products has changed very little in the past 30 years. We have moved from analog to digital displays and have increasingly improved the quality of LCDs, but the basic knobs, switches, and buttons have stayed the same. Access to data through Internet, satellite, cable, CD ROM, and other mediums has exploded and we now have more available information than we can possibly access or organize. Being able to access this information by voice is very compelling. Devices are getting more feature-rich and therefore more complex, but our access to the information remains primarily through manual manipulations. The size of computing devices has compressed over time and keyboards for the devices have gotten smaller and more compact, but our fingers have not shrunk. It appears that we have now reached the point where any product with a user interface will soon incorporate a voice user interface. The speech recognition market in general has always held a widespread and intuitive appeal. We want to communicate with our products in the same way we communicate with each other. We want full-featured products that are easy to use. We want to access loads of information without navigating through menu structures or reading manuals. Although these concepts are appealing, the markets have not yet attained their potential. Few success stories exist in the speech recognition industry overall and the embedded markets are no exception. There are no publicly traded speech recognition companies that have reached profitability, and very few of the private companies have reached this point. To make things worse, several high profile players in embedded speech recognition have gone out of business over the past year, and several of the larger, well financed players in the industry are expected to pull out in the months and years ahead. Many of the most promising markets within the embedded speech recognition industry pose huge challenges. Despite most speech recognition vendors attempts to work in high noise through techniques of echo cancellation, noise subtraction and other noise reduction techniques, executives in the automotive industry say that nobodys speech recognition engine works well enough in a noisy environment. Most of the leading cell phone companies have developed their own speech recognition technology and are therefore reluctant to go outside for minor improvements in accuracy or features. Competition within the embedded software and IC space have driven prices down, making it difficult for all but the leanest and best financed players to survive. With the funding boom during the late 1990s, there was substantial investment made in the embedded speech technology space. Hundreds of millions of dollars were spent on development and commercialization of technologies. Many of these technologies are only now coming to market. For example, Sensory Inc. acquired Fluent Speech Technologies in 1999 and has invested millions of dollars in compressing the footprint and productizing the Fluent technologies. Sensory is only now starting to roll out development tools so external developers can create embedded applications with a very high powered engine that combines text to speech with speech recognition. New noise immune technologies, continuous digits, and large vocabulary small footprint engines are about to be released and offer a substantial improvement in the state-of-art embedded products that are on the market. What Lies Ahead for Embedded Speech Technologies?
Although historically the embedded speech recognition market has under-performed expectations, now is the best time in history for manufacturers to start implementing speech technologies. New introductions are, or soon will, enable a wealth of platform specific tools for development with very high quality, small footprint solutions. Substantial efforts on improving performance in noise are now in effect. Chip and software pricing have been pushed down by competing players and will not impact sales opportunities in high volume segments. The near future holds combinations of recognition, synthesis and animation for incorporating multimodal I/O techniques into a small footprint engine. The broad appeal for voice access and control has never waned, and now is becoming an excellent time for products to get voice activated. A big part of the embedded speech business is making products easier to use and allowing information access in a convenient and safe manner. Embedded speech applications are popping up all across the automotive, home, and personal electronics markets. At the recent Consumer Electronics Show, there were over 20 products being shown across the floor using speech recognition technology. This number has consistently grown at CES for each of the past three years. Speech recognition alone is not the solution for the future. Speech synthesis, whether through a compressed digital recording or text to speech, is a critical component of a user interface system. By combining synthesis and recognition into a common engine, technology vendors are able to create a much smaller footprint than the sum of the individual parts. The continuing mantra of personal electronics is smaller, lighter, cheaper so movements in this type of integration are very desirable to the product manufacturers. The future of embedded speech technology is very exciting and holds the incorporation and integration of new embedded speech technologies such as animated speech. Animated speech (when combined with recognition and synthesis) allows the creation of an agent or avatar that can talk, hear and look very realistic in its lip-synchronization and emotional displays. Companies such as Sensory and LIPSinc are pioneering these animation technologies for a future that holds the ultimate dream of a home with animated agents hidden in every wall. These agents can pop up and announce telephone calls, or they can be told to record or play your favorite TV show! Certainly the speech technology industry has gone through its struggles, and the embedded segment has only recently emerged as a substantive component of the overall market. The embedded opportunities are very persuasive and some of the solutions offered today are quite compelling. The technologies are better than ever, and are being priced aggressively and manufacturers are starting to deploy increasing use of embedded speech technologies.
Todd Mozer is president, CEO and chairman of Sensory Inc., which he co-founded in 1994. Mr. Mozer has spent over 20 years in the field of speech technology working with high tech companies in positions of sales, marketing, product development and general management.