Milestones in Speech Technology – Past and Future!

The Past:
Speech and language, and the mysteries and magic surrounding them, have a long and venerable history, reaching back into mythological time. Only in this past half century have serious inroads been made into understanding them well enough to be able to emulate them with computer technology. Many top-notch researchers and engineers worldwide have contributed critical pieces to these puzzles. The examples discussed here illustrate just a few of the key milestones, both technical and commercial. Successful research breakthroughs eventually give rise to new products and applications, sometimes quickly, though often, not so soon as desired. Major contributors to progress have been increasing the understanding of the speech and language processes themselves, in concert with ever increasing and less expensive computer power.

Speech synthesis technology harkens back to Van Kempelen's 1791 "talking machines," which could generate intelligible speech at the hands of well-trained technicians skillfully manipulating a set of bellows to force air through various tubes and apertures which mimicked the shapes and cavities of the vocal tract. In the mid-1870s, Alexander Graham Bell tried to create speech recognition to provide an instrument for the deaf that would turn speech into text. Failing that, he focused his energy on creating what, in 1876, became the telephone!

Over the past half century, speech synthesis techniques have centered around (1) extracting key characteristics, using formants, pitch, etc. and/or other parameterizations, such as LPC (Linear Predictive Coding), and then using these to generate intelligible playback (e.g. formant synthesizers, LPC synthesizers, etc), or (2) modeling the sounds themselves, and combinations of them, and then seamlessly joining them together (e.g. concatenative synthesis). The first set of techniques, though trickier to implement well, has the virtue of requiring low bit rates and much less computation; the second set of techniques, though much more memory-intensive, typically generates more natural sounding speech output. Major commercial laboratories (e.g. - Bell Labs, NTT, etc.) as well as academic and government laboratories (e.g. Univ. Amsterdam, JSRU, KTH, MIT, Univ. Tokyo) spearheaded both basic speech production research and synthesis methodologies. Numerous smaller laboratories also have contributed key synthesis techniques and applications.

In 1936, U.K. Tel introduced a "speaking clock" to tell time. Homer Dudley of Bell Labs demonstrated his "Voder," (a manually-controlled speech synthesizer) at the 1939 World's Fair. "Reading machines for the blind" were introduced in the mid-1970s by Kurzweil in the U.S. and NEC in Japan. In 1978, Texas Instruments introduced the very popular "Speak and Spell" learning toy, which contained their new TMS5220 integrated circuit (IC) chip. Laboratory text-to-speech systems started evolving into commercial services and products, such as MIT's "Klattalk," introduced in 1983 as "DECTalk." As processors became more powerful, a host of new synthesizers became available in software in many world languages. Starting in the late 1980s, large scale concatenative synthesis (e.g. Sagisaka at ATR) became progressively more prevalent. The same approach also became popular for music synthesizers.

Speech recognition has been actively pursued globally by numerous laboratories in commercial, academic, and government sectors. In 1922, a sound-activated toy dog named "Rex" (from Elmwood Button Co.) could be called by name from his doghouse. Small vocabulary recognition was demonstrated for digits over the telephone by Bell Labs in 1952. At the Seattle World's Fair in 1962, IBM demonstrated their "Shoebox" recognizer with 16 words (digits plus command/control words) interfaced with a mechanical calculator for performing arithmetic computations by voice. Based on mathematical modeling and optimization techniques learned at IDA (now the Center for Communications Research, Princeton), Jim Baker introduced stochastic processing with Hidden Markov Models (HMM) to speech recognition, while at Carnegie-Mellon University in 1972. In the same time frame, Jelinek et al, coming from a background of information theory, also independently developed HMM techniques for speech recognition at IBM. Over the next 10-15 years, as other labs gradually tested, understood, and applied this methodology, it became the dominant speech recognition methodology. Recent performance improvements have been achieved through the incorporation of discriminative training (e.g. Cambridge University, LIMSI, etc.) and large databases for training.

Starting in the 1970s, government funding agencies throughout the world (e.g. Alvey, ATR, DARPA, Esprit, etc.) began making a major impact on expanding and directing speech technology for strategic purposes. These efforts have resulted in significant advances, especially for speech recognition, and have created large widely-available databases in many languages while fostering rigorous comparative testing and evaluation methodologies.

In the mid-1970s, small vocabulary commercial recognizers utilizing expensive custom hardware were introduced by Threshold Technology and NEC, primarily for hands-free industrial applications. In the late 1970s, Verbex (division of Exxon Enterprises), also using custom special-purpose hardware systems, was commercializing small vocabulary applications over the telephone, primarily for telephone toll management and financial services (e.g. Fidelity fund inquiries). By the mid-1990s as computers became progressively more powerful, even large vocabulary speech recognition applications progressed from requiring hardware assists to being implementable all in software. As performance and capabilities increased, prices dropped.

In 1990, Dragon Systems introduced a general-purpose discrete dictation system (i.e. requiring pauses between each spoken word), and in 1997, Dragon started shipping general purpose continuous speech dictation systems, to allow any user to speak naturally to their computer instead of, or in addition to, typing. IBM rapidly followed suit, as did Lernout and Hauspie (using technology acquired from Kurzweil Applied Intelligence), Philips, and more recently, Microsoft. Medical reporting and legal dictation are two of the largest market segments for this technology. Although intended for use by typical PC users, this technology has proven especially valuable to disabled or physically impaired users, including many who suffer from Repetitive Stress Injury (RSI).

ATandT introduced their automated operator system (e.g. "collect call," "operator," etc.) in 1992. In 1996, Nuance supplied recognition technology to allow customers of Charles Schwab to get stock quotes and to engage in financial transactions over the telephone. Similar recognition applications were also supplied by SpeechWorks. Today, it is possible to book airline reservations with British Airways, make a train reservation for Amtrak, obtain weather forecasts and telephone directory information, all by using speech recognition technology.

Other important speech technologies include speaker verification/identification and spoken language learning for both literacy and interactive foreign language instruction. For information search and retrieval applications (e.g. audio mining) by voice, large vocabulary recognition preprocessing has proven highly effective, preserving acoustic as well as statistical semantic/syntactic information. This approach also has broad applications for speaker identification, language identification, etc.

What's Coming:
Computer processing power will continue to increase, with lower costs for both processor and memory components. The systems that support even the most sophisticated speech applications will move from centralized locales (e.g. computer center, or server) to distributed configurations (i.e. with some processing done local to the user and the balance done elsewhere), to primarily being located local to the end user. This trend has been repeated many times (e.g. with computers, telephones, etc).

On the research side, a great deal of progress has been made, but a great deal of progress remains to be made. Unfortunately, in the wake of the economic downturn and heavy consolidation of speech technology companies over the past five years, the amount of corporate and government funding has declined. The technology presently is good enough for certain products and services to be successfully sold and incrementally improved. A great deal more opportunity exists when the fundamentals of the core technology can be thoroughly explored and tested (not possible with previous processing limitations) to remove known sub-optimizations and to enable major new applications. Experienced researchers are not short of ideas to make fundamental improvements; they are short of the resources to implement many of them.

The promise and the opportunities to be realized for speech technologies, and the time-frames for these, are gated by the resources available to pursue these ideas. The first beneficiaries of this new era in speech technology are likely to be the institutions willing and able to look beyond short-term incremental gains to break new ground. Until remedied, present performance limitations will continue to inhibit the utility and commercial returns of products and services. Nonetheless some very exciting entrants are on the near-term horizon!

We can expect that full, general purpose, continuous dictation systems will become available in a variety of handheld devices. Speech technologies will be embedded in handheld computers, cell phones, remote controls, automotive navigation systems, appliances, foreign language phrase books, toys, and a lot more!

Speech technology will gradually be incorporated into a wide range of different services and products, progressively more ubiquitous and pervasive. Multiple speech technologies (recognition, synthesis, verification, etc.) will become increasingly better integrated and bundled together. More natural language dialog systems with better user interfaces should mean that many enterprise applications, such as customer and technical support, can be conducted automatically with huge cost savings, and eventually, greater customer satisfaction.

Lecture and meeting transcripts will be readily searchable by voice as well as broadcast news and your favorite TV shows. Voice portals will become better enabled with speech input and output. Speaker verification will become a more prevalent technology, especially used in combination with other security protections (passwords, hand geometry, fingerprints, retinal scans, etc). More systems will incorporate natural language capabilities, directed dialogs, and multilinguality as needed.

You will be able to talk and give orders to the characters in your video and simulation adventure games. You can expect customized pronunciation help when you are trying to learn a new foreign language on your own. Children will be able to get personalized friendly reading support on their own, as will adults in need of private literacy instruction. In some stores, bus stations, and street corners, you will be able to ask for information from the roving robot information kiosks! Key components of each of these future applications have already been demonstrated (at least in prototype form). Speech isn't just for people any more!

References:

History and Explanation of Hidden Markov Models:
Poritz,A.B., Hidden Markov Models: A Guided Tour,
IEEE Proc. ICASSP 88, NYC, Vol. 1, 1988.

The IEEE History Center, Automatic Speech Synthesis and Recognition:
http://www.ieee.org/organizations/history_center/sloan/ASSR/assr_index.html

The Saras Institute History of Speech Technology Project:
http://www.sarasinstitute.org/

The Smithsonian Speech Synthesis History Project:
http://www.mindspring.com/~ssshp/ssshp_cd/ss_home.htm

Janet M. Baker worked for IBM Research's Continuous Speech Recognition Group in time-domain signal processing, and then served as VP of research for Verbex, a division of Exxon Enterprises. She was co-founder and CEO of Dragon Systems (Newton, MA). Founded in 1982 with bootstrap financing, Dragon Systems grew to nearly 400 employees globally and $70 million/yr revenues, before its sale in 2000. Presently, she heads up Saras Institute and works with Dibner Institute at MIT, to collect, preserve, and make generally available information on the History of Speech and Language Technology.

Companies and Suppliers Mentioned

Milestones in Speech Technology – Past and Future!

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions