Speech's Holy Grail

Through a remarkable technology developed over decades of research, it is now possible to dictate free text to a computer and have it “recognize” one’s speech and “type” it out, without fingers ever touching the keyboard. Sound waves, vibrations of air, are transduced into electrical impulses by microphones.

Electrical waves are digitized, and then analyzed by computers on a variety of parameters, such as length of utterance, tone, rate, rhythm, frequency, and association with other sounds.

Digitized sound is usually subjected to mathematical transformation, as by the method of “fast Fourier” analysis, then further analyzed or “recognized” using a variety of methods, which can be weighted and combined in different manners. Probability theory, stochastic analysis, and the “hidden Markov model” are employed. Attempts have been made to use “neural networks” and various forms of “artificial intelligence” to decipher the correct “meaning” of the sounds.

This data forms patterns, which are “recognized” by the computer, or “matched” with typed textual output, to produce intelligible (or unintelligible) words, phrases, sentences, paragraphs, etc. The problem of speech recognition is a smaller case of the larger problem of pattern recognition.

In a simple small vocabulary example of say, “recognizing” “no” or “yes”, the task can be achieved relatively easily and reliably, as witnessed by its use now by telephone companies: (“Do you wish to accept this collect call?... Would you like us to dial this telephone number you have just retrieved from information for an additional 35 cents? If so, please say ‘yes’ after the tone.”)

The “Holy Grail” of Fast Speech
In fact, computer speech recognition has now reached new heights, and may be about to take off into the stratosphere. The long awaited “Holy Grail” of easy, fast speech recognition is arriving slowly but surely. In “discrete” word dictation systems which have been the rule until this time, the speaker is required to make at least a short pause between each word, and thus the dictator is required to speak in an unnatural staccato-like manner. The pause required between dictated words is diminishing, and even beginning to disappear with some systems.

The machines shown at the recent SpeechTEK 96 exhibition and conference in October in New York demonstrate surprising speed and accuracy. Most excitingly, continuous speech dictation systems have begun to arrive, and more will be forthcoming.

IBM has many recent accomplishments in this area. First, IBM’s dictation system called VoiceType has been upgraded for Windows 95. It can now run on a 16 bit SoundBlaster compatible card and no longer requires the special IBM accelerator card or IBM PCMCIA card. It will also dictate directly into some applications rather than just into a “dictation window” from which it had to be copied into an application.

IBM has released a new Windows 95 developer’s kit for incorporating VoiceType into Windows applications.

Perhaps most exciting of all, IBM has chosen to incorporate most of VoiceType, essentially for free, into the new version of its PC operating system OS/2, version 4, nicknamed Merlin. This means that if you buy the new operating system for perhaps $150, you get a $700 dictation package free. It is true that you may need a fast processor, 16-24 mg of “RAM” memory (more is always desirable) to run it, nevertheless this might be a very good bargain for someone who wants to test the waters, and try a very high powered speech recognition system.

Medical Modules
Even more remarkable is the advent of the first IBM continuous speech recognition system. This is a medical radiology x-ray report module not appropriate for other purposes, running on NT. IBM is sure to follow with other specialized vocabularies. A general English version is likely to take much longer to develop. Nevertheless this is a very historic and significant step in the history of speech recognition.

For the continuous speech recognition product, IBM has used essentially the same speech recognition “engine” as VoiceType’s, which may have been upgraded a bit for version 3 of VoiceType. This is the same engine which was in the original version of VoiceType, (the so called Speech Server Series) running on a RISC machine on Unix AIX RS 6000 technology. (Not to be confused with the first version of “VoiceType” under that name which was licensed to Dragon Dictate Technology, but which has been supplanted by the IBM VoiceType engine. In some ways, the world of speech recognition is a small one!).

In summary, what we have from IBM is a new version of VoiceType for Windows 95, a new version of VoiceType included free with the OS/2 version 4 AKA Merlin, a new VoiceType developer’s kit, and a newer continuous speech recognition system for x-ray reports, with additional continuous specialized vocabulary modules surely to follow.

Dragon Systems, a leader in the large vocabulary speech recognition field, has recently introduced a new version of Dragon Dictate for Windows which has some new features. It accepts words spoken at a faster speed than previous versions. Dragon has incorporated speech synthesis into its system so that it can now read back to you what has been dictated or typed or appears anywhere, a very useful feature. (There is also some evidence that IBM may incorporate similar technology into some of its own future products. Incidentally, then, Microsoft will be under some competitive pressure to do this also.)

The Windows versions of Dragon Dictate will now run on SoundBlaster compatible boards. Using faster processors may result in more accuracy and speed while using these Dragon products. Dragon is well known for its user friendly interface, fine reliable products, and excellent long term customer support.

Dragon in recent years received a capital infusion from Seagate, and has recently acquired part of Articulate Systems, which ported Dragon dictation technology to the Macintosh platform. There is a very nice version of Dragon Dictate for the Macintosh selling for about $700 for the classic 30k dictation version and about a thousand dollars more for the 60k “power version.”

Dragon, like Kurzweil, does not currently embed the dictated sounds behind the text, unlike IBM and Philips products, which allow one to click on a dictated recognized (or misrecognized) word, and to play the sound behind it. This may help one remember or learn what was said, and correct the dictated material for misrecognitions. This feature is especially helpful if one dictates much material before correction, or defers correction, or if a different person than the original writer is doing the correction.

Philips Enters the Arena
Meanwhile, Philips Speech Processing has entered the large vocabulary speech recognition arena with continuous large vocabulary speech recognition systems. Currently they have commercially available the emergency medicine and mental health report dictation modules, with many other specialized vocabularies in the pipe line. An internal medicine model is scheduled to appear in January. Many more medical, legal and other modules are expected to follow. With the same basic recognition engine, each of these modules has approximately 60 thousand slots for utterances of which 20 to 30 thousand may come filled with the specialized vocabulary, and the rest of the slots are free for words added by the user. The specialized vocabularies, prepared by various companies using the Philips recognition engine, are produced by examining the context of language contained in a sample of many millions of words, in text samples typical of a particular area of endeavor.

Like Dragon, Philips’ large vocabulary end-user systems would replace a rarely spoken word with a frequently used word. The systems require a sound board as well as a Philips accelerator board, however a new version of the Philip’s engine expected early next year may run exclusively on a sound card such as SoundBlaster. The systems can run now on Windows 3.x however, they are expected to run on Windows 95 soon.

In addition, Philips has released a small digital hand held recorder which records digitized sound directly onto PCMCIA cards. These cards can then be placed in a computer for recognition. A new edition of this digital recorder is also expected early next year. Philips is also offering telephony and smaller vocabulary products.

Philips large vocabulary dictation systems accept rapidly and normally spoken language, in a true continuous manner: “natural language.” They have two basic modules. The first module digitizes language and stores it on hard disk for later recognition. This unrecognized sound may then be moved around, transferred to another machine, or even sent elsewhere electronically for human or machine mediated transcription (the new version of IBM VoiceType also offers a delayed recognition option). The current Philips recognition engine requires a special accelerator board, but a new version expected early next year is anticipated to run on a SoundBlaster type board. This would allow easier use in a notebook computer. Philips has a correction paradigm requiring the user to type in incorrect material, replacing incorrect material with the correct version. Then when the voice/language adaptation file for the recognition engine is updated, the machine then learns the speaker’s pronunciation and use of language and vocabulary.

Which Should You Choose?
So which system is best? The Dragon is user friendly, reliable, and stable with excellent customer support. IBM VoiceType is quite fast and accurate. The Philips and IBM continuous systems point the way to the future, but at present are available only for certain specialized vocabularies. Continuous dictation is much more natural and may greatly increase the spread and use of speech recognition systems. Which system to choose, and whether or not to use a system at all, depends on personal preferences, personal writing style, how much writing one does, and budget. It is also possible to use different speech recognition systems for different purposes. One may use one system for initial dictation, another system for editing, and a third system for entering data into spreadsheets or databases.

It has been pointed out, and it is certainly true in our experience, that the shorter words are the ones which are more likely to need correction. This appears to be a general principle, true of all the systems. The longer words are more likely to be recognized correctly. Since the shorter words are shorter, it is relatively little work to correct them. Typing “a”“to” etc. is quite a rapid and easy task.

Natural Speech
It is perhaps worth noting some differences about the rate and type of speech which speech recognition systems accept. First, there may be a spectrum between discrete and continuous systems which require words either widely spaced in time or “run together,” also known as co-articulation, namely two words being articulated together.

Discrete speech refers to the speaking of one word at a time, slowly, with a pause on either side of each word. Continuous speech refers to the process of running one’s words together. The phrase “natural speech” may be used to refer to, among other things, the way speech is normally and naturally used. It might be pointed out that natural speech does in fact contain pauses, natural pauses, and is not totally continuous. Some people pause more than others. There are natural places to pause. Some words tend to be spoken in groups with others, naturally co-articulated, thus forming what are called “utterances.” It is possible to train a discreet word system such as Dragon or IBM VoiceType to recognize a phrase as such. This process speeds the rate of dictation, and moves it closer to natural language. A spectrum is formed between discreet word dictation and continuous word dictation. IBM VoiceType was a large step forward toward a system which would accept speech faster and faster. In fact, the IBM VoiceType engine will make an attempt to “parse,” namely dissect into different words, speech uttered continuously. Moreover, essentially the same IBM engine is being used in their completely continuous speech recognition system which has just been released with the first specialized vocabulary for X-ray reports.

Phillips enters the speech dictation market with exclusively continuous speech recognition for large vocabulary dictation, and is thus on the other end of the continuous spectrum. BBN has also shown a continuous system in the research phase, which accepted continuous speaker independent speech using the Wall Street Journal vocabulary. This particular system appeared to want one to speak at a somewhat regular pace, albeit continuously, but some what evenly.

The Philips system, and perhaps the new IBM continuous system, appear to accept speech with the normal accelerations and decelerations which many people incorporate into their natural speech. Dragon has also demonstrated several times in the past a continuous large vocabulary speech recognition system in the research phase based on the Wall Street Journal vocabulary model.

Future of Continuous Speech
Computer “intuition” of punctuation is unrealized within speech recognition. One currently needs to dictate punctuation to both discrete and continuous word systems. “Comma,” “Period,” and “New Paragraph” must be explicitly stated. One must interrupt the flow of one’s thought to insert punctuation, another deviation from truly “natural” speech. The challenge remains for computers to begin to learn to recognize punctuation. For example, in the following sentence: PAUSE “Where are we going” PAUSE, with the last word spoken in a higher tone, one may intuit a question mark at the end of this sentence, thus: “Where are we going” “?”

We are moving closer and closer towards continuous large vocabulary speech recognition systems which utilize natural language as it tends to be normally spoken. The advantage of this method is that the speaker does not need to think about the dictation process, but can concentrate his or her thoughts upon the matter which is being said, thus increasing speed and, theoretically, effectiveness.

In the future, we anticipate greater accuracy from speech recognition systems for dictation, which will probably become more and more continuous, and eventually include computer intuition of punctuation. This greater accuracy may be achieved by using larger windows of data for analysis, looking more and more at the context of words, the expected and grammatical ways in which words are used, yielding better and better systems.

We expect the user interfaces will become more and more friendly and easier to use. The speech recognition systems will probably become more integrated within the operating systems and applications, as is happening with the integration of IBM VoiceType into the new release of OS/2. The speech recognition system will begin to run more silently in the background. The command and control mode will become better integrated with the dictation mode. These developments will gradually take place over many years.

In some ways, speech recognition is in its infancy, despite several decades of work. It is perhaps analogous to the beginning of the automobile industry when people drove a Model A Ford to town, slowly, and needed to change their tires several times on a short trip, driving with difficulty but fun. Now cars have become more practical and are almost intuitive in the complex ways in which they work. Ironically little computers are being included within the structures of automobiles, running silently, but also rendering the functioning and repair of cars more complex.

One might also anticipate replacing one speech recognition system every few years just as one replaces one’s car and computer every few years as improvements occur. It is arbitrary where one jumps in, and there is no need to always wait for the latest model.Improvement in speech recognition systems is slow and incremental. Skills in the use of these machines may, for the most part, be transferred from system to system. The beginning is the biggest leap, when one enters for the first time into the world of driving or dictating. New doors of opportunity and experience are then opened.

In summary, speech recognition technology is currently ready to help new users with the process of writing and transcribing more efficiently, by the use of voice rather than typing. The new systems are quite fast and accurate. Moreover, with the advent of the continuous engines, speech recognition technology is becoming even faster and more natural. Similar to the automotive industry, we anticipate that speech recognition systems will gradually, year after year, become better and better. Just as one may replace one’s car every few years, slowly one may replace one’s speech recognition engine every few years. Since dictation to a computer is now not only possible but also fairly easy and inexpensive, the writer now has the opportunity to concentrate on the challenge of composition itself, without having to be burdened by the process of typing.

Bon Voyage, HAL!

(This paper was dictated on IBM VoiceType, and edited by a Dragon system, a mouse, a keyboard, and a pencil.)

Peter Fleming and Robert Andersen, speech recognition consultants at Aristotle Systems may be reached at aris@world.std.com or (617) 923-9356.

Companies and Suppliers Mentioned

Speech's Holy Grail

Vonage Integrates with Salesforce's Agentforce Voice

Lorikeet Launches Voice 2.0

Krisp Launches SDK for AI Accent Conversion

Kling AI Launches Kling Video 2.6 Model