Language Models: Making Sense of Speech
The general public knows it will take a few days to get to know a new car. They will drive gingerly, carefully checking mirrors and self consciously touching the gas pedal until they know the feel of the machine.
But due to unfortunate, often repeated references to Star Trek, 2001, A Space Odyssey and other science fiction stories, the general public has no such tolerance for speech recognition dictation. "It's not plug and play" "You'll spend hours training it," and "It's no good if you have the wrong headset," are common complaints. We live in the age of instant gratification. How can these lofty standards be met, particularly for those users who use extensive specialized vocabularies?
One unique solution offered by VoiceAutomated of Huntington Beach, Calif. is language modeling. A language model is to speech recognition as grammar is to the English language. It is the model which tells the system that a group of words recognized as "read corvette" in general English is most likely said with the intended meaning of "red Corvette." It is the model which ensures a doctor's report shows the doctor said "Patient demonstrates normal sinus rate without murmur," instead of "Patent demonstrates normal sign us gate add for them."
Such models are known by many different names. To IBM, a continuous speech language model is referred to as a "domain." At Philips, a continuous speech language model is referred to as a "context." Dragon Systems commonly refers to them as "vocabularies."
There are two primary pieces of technology which enable speech recognition to work, the software and the microphone. The software is commonly referred to as a phonetic recognition engine. This phonetic recognition engine breaks down the spoken word into phonemes, the smallest measurable pieces of speech.
The microphone produces an analog signal which is converted to a digital signal through a series of converters on the sound card of the computer. The digital representation of the sounds are then broken down into mathematical representations of the audio which are referred to as feature vectors.
Training is Not Enough
A new user must train the system by dictating from standardized scripts to provide the system with examples of how they speak. This allows the system to recognize the user's voice patterns and create the users individual "speech profile." This becomes the foundation of the system's processing capabilities. This process takes about one half hour of training.
The feature vectors are compared to a user's speech profile. From the comparison of the feature vectors and the user's speech profile, the system determines what phonemes the user most likely spoke. After a long string of phonemes is obtained the system then tries to determine what words were most likely spoken, given the order of the phonemes. These probable words can then be used in the next stage of recognition to predict what the user actually said.
A user may speak "Please read the red book Mrs. Wright wrote," which the system hears as, "Please reed the red book miss his right wrote." As long as the system is only "listening" for your words without any analysis based on a custom language model you can expect these types of errors.
Creating an accurate language model for a profession, or saturating a topic to create a statistical representation of all the words in a language model and their positional interrelationships, requires the processing of millions of words from reports dictated by physicians.
The reports being analyzed have to be in an error-free state since any errors in these documents would be introduced into the model and result in non-recognition of spoken words or correct recognition of a misspelled word. Secondly, the words need to be handled in a very consistent manner since speech recognition systems are case sensitive. For example, the word "avenue" could be in the reports as "Avenue, AVE, Ave, Ave., AVENUE or avenue."
Traditionally, processing of the input word sample, its cleaning, and its processing and compilation into a language model has required teams of people working in excess of a year at a cost of hundreds of thousands of dollars.
Voice Automated has developed software tools to create language models which are convertible to most continuous speech recognition technologies. For example, the Voice Automated General Practice and Primary Care Language Models were each created from a corpus of 10 million words.
Until speech recognition is truly "plug and play," developers will continue to find creative ways such as language modeling to cut overall development time, and enhance users early experience with dictation products.
In the meantime, let's not forget that even a new haircut takes a few days to look right.
Specialized Language Models
Specialized language models for medical and other professions are popular with VARs who are eager to find ways to build margin back into dictation packages which are dropping in price. As a result, language models are proliferating from Voice Automated, Voice Input, Dragon and others.
The medical category is experiencing most of the attention, primarily because the environment for medical professionals is so highly regulated. With the advent of managed care and increasing requirements for clinical documentation related to reimbursement, more and more medical practices and hospital departments are turning to speech as a way to reduce operating expenses and improve the documentation and file management processes.
Voice Input Technologies (VIT) announced that SpeechWriter(tm) for Cardiology is available now for orthopedics, oncology, emergency medicine and radiology with general surgery scheduled for release later this year. SpeechWriter is a large vocabulary continuous speech dictation product for health care administrators and practitioners. ÏWhen your business depends on the correct application and interpretation of a very specialized vocabulary, you need a very specialized product,Ó observed Andrew Friedman, president of Voice Input. VIT, an early leader in the development of specialty language models, introduced SpeechWriter for Mental Health in May of 1996. VIT's language model is fully integrated with both MS Word and WordPerfect. VIT is also developing a custom language model for a system of three hospitals. Learn more at www.speechwriter.com.
VoiceAutomated provides a broad range of language models for the medical market including General Medicine, General Surgery, Urgent Care, Podiatry, Cardiology, Works Comp, Psychiatry, Psychology, Radiology, Pathology, Rehab with Oncology, Orthopedics, Neurology and OB-Gyn coming soon. Their product is designed for IBM's ViaVoice and Dragon's NaturallySpeaking. They also announced a new product line to assist medical professionals to rapidly complete forms using speech recognition dictation. The MedFlow product contains templates and macros enabling physicians, nurses, nurse practitioners, medical assistants, therapists and others responsible for direct patient care to quickly generate AMA compliant documentation through IBM's ViaVoice. MedFlow is physician written with strict adherence to new HCFA guidelines. Learn more at www.voiceautomated.com.
New medical speech dictation software from Dragon announced in April is specifically designed for medical professionals to create patient records, medical reports, notes, correspondence and other documents. Medical professionals can dictate directly in MS Word and Corel WordPerfect and most other Windows applications, including medical applications. Dragon Naturally Speaking Medical Suite is an enhanced version of Naturally Speaking dictation software, and claims text can be naturally and continuously dictated at up to 160 words per minute and more, as words appear immediately on the screen. Accuracy rates are said to be 95% and up for experienced users. Learn more at www.naturalspeech.com.