The Verdict is in for Continuous Speech
For someone who has been using and testing speech recognition systems for over five years, the new continuous speech, large vocabulary dictation systems available now are like breaths of fresh air, truly marvelous, the industry's version of the Holy Grail.
Systems are coming to market that are powerful, fast and accurate. They have sophisticated paradigms. These are marvelous utilitarian programs that will change the entire process of writing and word processing. It is particularly exciting to see two new systems from industry leaders being released at about the same time - Dragon NaturallySpeaking and IBM ViaVoice.
The Dragon NaturallySpeaking product was in general release as we went to press and we were able to compare it to a beta version of the IBM ViaVoice product. The ViaVoice should be ready for shipping by the time this review appears. In addition, Dragon has announced that it intends to ship a "Deluxe" edition of its product sometime this fall. The Deluxe edition was not available for us to review at press time, but we will be covering it soon.
We had the opportunity to test Dragon's new "Naturally Speaking" version 1.0, as well as the beta version of the new continuous IBM dictation system called ViaVoice. We present our preliminary impressions here. More thorough evaluation will require additional testing over time. Hardware
Both the IBM and the Dragon ran well and concurrently on a notebook system: an Ergo Brick 3: 200 MHz Pentium MMX with 48 Megs of RAM. We chose the Ergo because it has a reputation for being a high quality, rugged, low cost notebook, with a good sound system, and from a small company which offers good individualized service.
The minimum requirement for the Dragon is a Pentium 133 MHz, but not necessarily an MMX, with 32 Megs of RAM running on Windows 95, or 48 Megs of RAM on Windows NT 4.0, and at least 60 MB of hard disk space. For the IBM ViaVoice, the minimum requirement is a Pentium 150 MHz with MMX, and again 32 Megs of RAM on Windows 95, with at least 125 MB of hard disk space. Both require a high-quality sound card compatible with Sound Blaster 16 bit sound or above. The sound card which came embedded in the Ergo worked admirably.
Our Dragon came with a VXI microphone which included a battery power source, although some Dragons may come with other microphones. The IBM came with an Andrea headset microphone Model NC-50u. It was able to utilize the phantom power from the laptop. It came with an optional battery power source, which proved unnecessary.
We found the Andrea NC-50u microphone to be somewhat more attractive. It had two plugs, one for the microphone and one for the headset speakers. The Andrea was lighter, smoother, did not require batteries, and fit more comfortably on the head. The VXI contained a sharp metal edge and a wire headset piece. The Andrea folded nicely and fit easily into a briefcase or the laptop carrying case. Both appear to be directional noise canceling high quality electret microphones with good acoustical properties.
Dragon and IBM should consider making available a telephone handset type microphone in addition for people who may feel self-conscious about wearing a headset in public or at work.
Setup And Training
Each system took about an hour to set up, and about another hour to train to one's voice by the initial enrollment process.
The recognition rate seemed to be somewhat better with the Dragon. Of course it is difficult to compare accuracy rates and have adequate control of the many variables such as microphone positions and the quality of enrollment. Also, it is unclear how each system will be affected by training over time.
The IBM program "opens" somewhat faster than the Dragon. The Dragon seems to take a little longer to load than the IBM each time one clicks on its icon. The Dragon opens with a color photograph of a man sitting in front of a computer. We grew tired of this image. Why not just the fascinating drawing of the red dragon?
The Dragon microphone seems to turn on and off more quickly than the IBM microphone. After turning on the IBM microphone, a voice from the computer instructs one to "begin dictation." In some ways, this is annoying to hear each time, but it can be a useful feedback device. Similarly, the IBM announces when the dictation mode has stopped. (IBM might consider making these announcements an optional feature which could be activated by the user.)
The Dragon seemed a little crisper and cleaner and faster, but this may be a subjective impression. There are reports that the Dragon works faster and better with a faster processor and more memory. It may actually function with a processor slower than the 133 Megahertz recommended by Dragon, but it is said to have better performance with even a faster processor. We were happy with the performance on a 200 Megahertz MMX with 48 Megs of RAM on Windows 95.
The greatest difference between the two systems was the correction paradigm used. The IBM had the advantage of having the recorded sound behind the words, such that one could double-click on a word, and the system would playback the dictated sound. This aided in the correction process as well as in editing. The IBM also contained a speech synthesis system which one of us found useful in the process of editing. In addition, the IBM allowed one to work directly in Microsoft Word, whereas one could work in the Windows 95 WordPad with either system. From WordPad, material could then be moved elsewhere by cutting and pasting, or using file management. The document could be saved as a text file or as a Word document.
IBM mis-recognition mistakes were most easily corrected by typing the correct word. In theory, one could double-click on the word that was misrecognized and then say the correct word to correct it, however in practice this method of correction did not work well in our hands. We are told by technical support that this method of correction does not update one's voice files. Only corrections typed in the dialogue box are said to update one's voice files and thus improve accuracy.
The Dragon also allowed one to correct mistakes/mis-recognitions by typing in the correct word into a dialogue box, similar to the method of the IBM. In addition, the Dragon has other correction methods which can be performed by voice. One can try correcting the misrecognized word by voice. One can spell a misrecognized word in the dialogue box by voice, and one can select an alternative choice to mis-recognitions from the dialogue box by voice. One can correct the Dragon by typing or redictating or by any method, without fear of corrupting one's voice files, or at least so the instructions stated. IBM said that this principle applied to their system as well.
On paper, this approach would seem desirable in that one only corrects and revises one's voice file when a mis-recognition is actually corrected. In point of fact, how both systems will perform over time, whether one will improve more than another, whether one will degrade more than another, can only be tested over time. (We will attempt to keep you posted here about these conditions and developments.)
Another aspect of this correction process is one of personal taste. Some people may prefer to perform their corrections by typing, which we found slower, but which process would be aided by the IBM system and by its available sound recording. This is also a useful feature if someone other than the original dictator is performing the corrections.
Positives And Negatives
Perhaps it is a matter of settings, however we find the IBM microphone to be very sensitive when it is on, mis-interpreting extraneous noises. One had the desire to turn on the IBM microphone simply for dictating and then turn it off quickly.
It was more natural to leave the microphone on with the Dragon system. At times the Dragon system would mis-interpret stray noise as unwanted words, but it seemed slightly less irritable than the IBM microphone.
With the IBM system, one had the impression that while the microphone was on, one was continuously recording in a linear manner on the hard disk, and temporarily using up much hard disk space. This is the "down-side" of recording the sound behind the words. With Dragon, when the microphone was on, it did not seem to be recording any noise, but merely waiting for the next signal to process in order to produce some text or command. Perhaps in the future, IBM and Dragon will allow the users to turn on, or turn off, the capacity to record the speakers voice.
The systems only occasionally mistook commands with dictated text. These two modes seem to be fairly well separated by design. However we did not stress the systems in this regard.
Usually it seems safer and more efficient to carry out commands by mouse or keyboard, unless one is physically challenged, although this is a matter of personal preference. It was possible in the Dragon to hold down the shift key in order to be sure that what one said would be transcribed rather than taken as a command.
It is of interest that one can run both systems together on the same computer and even have both windows open at the same time, or at least so we found to be the case with the Ergo.
Dragon had many new carefully thought out features dealing with sophisticated aspects of speech recognition dictation. IBM seemed to be ahead of the game with their previous product, and perhaps this is a matter of leap frog.
Nevertheless we have now reached a very sophisticated plateau. If it does not degrade with time, and especially if it improves with time, the Dragon is certainly an outstanding product, and a very powerful tool for writing, word processing, transcription, dictation and communication.
The IBM system is also extremely sophisticated with great utility and may prove preferable to many people because of its recorded sound behind the words and its built-in speech synthesizer, combined with a very accurate recognizer able to take speech rapidly and in a continuous manner. However these differences between the two products, namely voice recording and speech synthesis, may disappear with the coming of the new Dragon Deluxe edition. IBM, in turn, is no doubt planning upgrades to its product, with further improvements and polishing. Stay tuned!
Pricing And Availability
The IBM is scheduled to be priced at less than $200 and may even become available for less than $100, as IBM has done with its previous product. The Dragon, while listed at about $700, has been selling for about $300, and the street price may drop further to about $200, and be available in stores such as CompUSA.
These are no longer "expensive" products. However the computers to run them on are not cheap, and many people will need to buy a new machine or upgrade their computer processor and memory in order to run them.
We believe these products have now reached the level and the price point which will begin to bring them into widespread use among those performing significant amounts of word processing or writing. The Dragon on a thousand dollar Compaq Presario Model 2200 ($800 if you already have a monitor), with a little added memory to get up to 32 RAM, could be available as a desktop machine for between $1200 and $1800 total price including the Dragon, although we have not tested the compatibility of the Compaq sound card.
Or consider the inexpensive but good quality "PCs For Everyone" available at pcs for everyone.com (617)868-0068. These come with an actual Soundblaster card, and therefore should theoretically work well. The Dragon might even run on a slower processor than the 133 MHz, with a slower response time, and unclear effect on accuracy. The Dragon would also run on a less expensive notebook at 133 MHz and might be available in this rendition for perhaps between $2500 and $3000.
Consider the Ergo Brick 2. There is a suggestion that the Dragon may actually run better on a faster processor. The IBM does require a slightly more expensive machine with at least a 150 MHz MMX processor, recommended by IBM. We have no data whether or how it would run on a slower or non-MMX machine.
Each dictation product is available from its respective company directly, or from value added resellers. Now these products are beginning to make their way into computer stores as well. This may be a sign that they are beginning to become widespread.
See ergo-computing.com (800) 495-3746, pcs for everyone.com (617) 868-0068, compaq.com (800) 308-7774, dragonsys.com (617)965-5200 (area code about to change), ibm.com and search "ViaVoice" (800) 426-2255.
There are differences between these two systems but there are also many similarities. In some ways, they are more similar than different. With the expected revision of the Dragon product, this statement may become increasingly true.
Both systems require a new user to learn how to use a microphone properly, and both require patience in learning complex software programs, especially learning how to correct recognition mistakes, which is essential to the training and proper functioning of these speech recognition systems. Thus a naive user cannot simply pick up the system and instantly use it. They must learn the system and train it. But nevertheless the process is quite natural.
The IBM continuous product, now in beta, builds on the beautiful work of its predecessor. VoiceType and Simply-Speaking went beyond simple discrete word recognition in accepting phrases and quite rapid speech, recognized in a very accurate manner.
IBM has two very valuable features which are not yet present in the Dragon. The IBM system allows one to playback one's recorded speech, which helps one both with the transcription process and with the correction process. Secondly, the IBM contains an integrated speech synthesizer, allowing one to playback what has been "typed out" on the screen, read by a "computer voice". One of us found this to be useful for correcting and editing. However these differences may diminish with the coming of the Dragon upgrade expected sometime this fall.
In our hands, the Dragon appeared slightly more accurate, however it is difficult to control for the idiosyncrasies involved in enrollment and microphone use, which continue to play such an important role in shaping the accuracy of a speech recognition system.
Is there room for improvement?
Of course! Nonetheless these are two beautiful products with enormous functionality, and signaling the beginning of a new era in word processing and human computer interface design.
Specialized Vocabularies Expected
We anticipate that both Dragon and IBM will release products with specialized vocabularies in areas such as law and medicine.
These may be sold as add-ons to the existing products. IBM already has a radiology version of this continuous speech recognition product designed for writing x-ray reports. One can add vocabulary to the general English IBM product reviewed here, and thus tailor the IBM to one's specialized needs. The Dragon allows one to introduce one's own specialized vocabulary with its feature of the vocabulary builder. The new anticipated upgrades should allow a single user to employ several different vocabulary sets. However, what we have here, in each product, the IBM and the Dragon, should be more than adequate for many writers.
Moreover for a couple hundred dollars extra one could purchase both products, and use them together! One can then add the upgrades as they become available.
Other companies already have, or may soon release, continuous dictation products, however none of these are yet a general English version: Kurzweil, recently acquired by Lernout and Hauspie, has suggested that they may ship a continuous large vocabulary speech recognition system by the end of this year. They have recently released a continuous speech recognition command and control system.
Philips currently has available their continuous dictation engine which can now be used with specialized medical and legal vocabularies currently available. The Philips model also includes the feature such that the sound is recorded behind each word. Thus the world of continuous dictation is suddenly exploding in 1997.
When To Buy?
Is this a good time for a new user or company or institution to jump into the use of speech recognition products for dictation?
We would say, definitely yes. These are products for which we have been waiting for years. They still require training, and correction, but then, so does the human brain. It makes sense that good systems will improve with training. Nonetheless the baseline recognition on these systems is truly phenomenal. We would expect gradual enhancements and upgrades throughout our lifetime. These will be gratefully accepted. However the beautiful basic engines are now here and available. They should be extremely useful in their present state of development. These are elegant and very powerful products just as they are.
The transition from discrete word to continuous recognition is a truly wonderful time in computer science. We anticipate justified widespread interest in these systems. They are most useful to people who need to (or choose to) write a considerable amount.
The noise canceling microphones with which these systems come, allow them to be used in even noisy environments. When one speaks into the microphone, (which should be very near the mouth,) the microphone focuses on the sounds which are near, rather than sounds farther away. One may speak quietly into the microphones allowing for some degree of privacy even in a crowded work environment, just as people talk on the telephone in such settings.
For the writer, or especially the writer who does not wish to type, or who finds typing uncomfortable or difficult, or who prefers to dictate, and that includes a very large number of people in the world, these are actually model systems. Moreover they would be useful to transcriptionists.
Both the IBM and the Dragon are outstanding products. Both are able to accept speech very rapidly and very accurately, and both are the result of many years of hard research by many people over time.
This review was dictated using a combination of IBM ViaVoice and Dragon NaturallySpeaking. Editing was performed by voice, typing, and pencil.
Peter Fleming and Robert Andersen, consultants in speech recognition, are available at (617)923-9356 or firstname.lastname@example.org.
IBM Releases ViaVoice
As this issue of Speech Technology went to press, IBM made the first shipments of ViaVoice, the company's first general purpose, continuous dictation product for the consumer market. Speech Technology reviewed a beta copy of ViaVoice for the article in the current edition.
"Businesses and individuals are clamoring for products that make computing easier, more productive and fun," said W.S. (Ozzie) Osborne, general manager of IBM Speech Systems. "ViaVoice accomplishes all of these goals - further extending IBM's effort to make speech recognition available and affordable to the mainstream computing market, by adding the exclusive features that our customers have requested."
People using ViaVoice can expect to enter text at up to 140 words per minute, more than three times faster than the average computer user can type.
"With the introduction of ViaVoice, IBM has redefined what voice technology is in the PC marketplace," said John Bunkle, president Workgroup Strategic Service. "IBM has delivered on the promise of true continuous speech recognition. Not only is this an IBM first, it is an industry first, reshaping and re-defining the "man/machine" interface and common applications which become voice capable with this technology."
ViaVoice works with most Pentium computers available today. The ViaVoice is priced at $99, which includes a noise canceling headset microphone.
For more information about IBM speech systems, visit the web at www.software.ibm.com/is/voicetype.
Dragon Deluxe Debuts
Dragon Systems, quickly following up on the announcement of IBM's ViaVoice, has announced the first addition to their Naturally Speaking product line.
Aimed at business and corporate markets, the Dragon Naturally Speaking Deluxe Edition adds feature enhancements, including multiple user and topic configurations, increased active vocabulary sizes, text-to-speech capabilities and integration with the Dragon Dictate software which allows for completely hands-free operation of a PC.
The deluxe edition is available through Dragon's VAR channel at the list price of $695. The price of the personal edition of Naturally Speaking will be dropped to $349. The personal edition will be available in catalogs and major retail outlets including CompUSA.
"The market for Dragon Naturally Speaking is expanding rapidly due to industry awards, many positive reviews and overwhelming response," said Roger Matus. "Deluxe edition is just the first of several Dragon Naturally Speaking products that will be aimed at people who must crate text to get their work done."
New features of the deluxe edition include:
Multiple User Support - allowing more than one user to create their own voice and store the voices on the same computer.
Multiple Topic Support - allowing users to store custom words and language usage information on sever topics to increase accuracy for each topic.
Recorded speech - allowing users to listen to what they said for easier proof reading and editing.
Text-to-Speech - allowing users to have words that appear on the screen read aloud to them using high quality text-to-speech from Elan Informatique.
Integration with Hands-Free Software - Dragon Dictate Classic Edition is included with the deluxe edition to provide hands-free computer control.
Both the personal and deluxe editions support Microsoft Windows 95 and Windows NT 4.0.
For more information about the Dragon products, contact Renee Blodgett, Dragon Systems, Inc., 617-965-5200 ext. 348.
Companies and Suppliers Mentioned