January 9, 2004
By Caroline Henton Founder & CTO - Talknowledgy
Features

Speech in the Healthcare Industry

Medical care providers, ranging from receptionists to insurance providers to specialist surgeons, currently have a wide array of applications available that may increase their productivity by using speech. Other beneficiaries include the recently established National Patient Safety Network and developers in the growing field of bio-informatics. There are advantages to using speech technology in medical and healthcare applications, such as record keeping and transcription, as well as many linguistic dangers of using similar tools to place pharmaceutical orders, with potentially lethal consequences. Approximately 15 years ago, researchers in automatic speech recognition (ASR) realized that one of its most effective uses was in routine data entry. To this end, they focused on developing specialized limited-vocabulary tailored applications that stood a better chance of productization and success in niche markets, such as medical (or legal) note-taking and record-keeping. By restricting users' utterances to single words, and avoiding no-match or out-ofgrammar utterances, recognition rates in these scenarios is likely to be more accurate. The software has indeed had greater adoption rates and provides greater help than, say, spoken command-and-control of the PC desktop. Speech-enabled medical documentation systems allow physicians to use ASR to create and dispatch patient notes, medical records, referral letters and, most recently, place prescription orders. X-RAYS TO X-FILES
Ten years ago ASR was bundled into a handheld device that resembled a personal memo recorder so that radiologists were able to record their analyses of X-ray plates. The data would then be loaded into appropriate fields in medical records on a larger computer. The maturity of this type of application was illustrated recently when Speech Technology Magazine recognized Ramapo Radiology Associates with a Most Innovative Solution award in 2003. A combination of Dragon's NaturallySpeaking (from ScanSoft) with VoiceBrook's VoiceOver tools makes it possible for radiologists to deliver prompt diagnoses for better patient care, rather than spend time on repetitive, routine administrative tasks. Ramapo's description of their product encapsulates this successful deployment. "Speech recognition solutions can effectively replace traditional transcription, reducing cost and speeding response to referring physicians," said Dr. Robert Tash. "Document creation in realtime can be achieved without significantly altering the radiologists daily workflow. In addition, speech recognition software is always available, and the rapid turnaround of reports is a major benefit for us. We are very pleased with our results with speech recognition technology and consider it a vital tool." Managing healthcare information such as patient names and insurance records is a successful and safe use of speech technology. The challenges of recognizing and verifying personal and other proper names are essentially no greater than in other routine record-keeping applications (Henton, 2003). Well-designed user interfaces combine ASR and graphical user interfaces and custom templates. Macros avoid repetitive tasks to reduce the time taken to create documents by as much as 50 percent. And transcription is real-time. Physicians working in shared practices, hospitals, clinics and other specialty groups benefit from expedited exchange of, and access to, dictated records, notes and prescriptions in a centralized document database. Medical professionals can save time, accelerate reimbursements, cut processing costs and increase revenues. DO NO HARM
In an emerging and potentially powerful application of speech technology, physicians can now speak prescription orders into a wireless handheld device, like a PocketPC©. Embedded speaker-independent, non-continuous recognition ASR is then used to enter the spoken items in pre-determined fields. After recognition has been performed, text appears on the small screen for confirmation and the prescription is relayed to a central server for rapid filling at the pharmacy of the patient's choice. It is anticipated that physicians and pharmacists should review all prescriptions placed wirelessly at the end of the day, but we are all aware of the public area noise levels, the size restrictions on PDA screens, and the tedium of having to review forms. Typical orders spoken by harried doctors, walking along the busy hospital corridors take the form: "Ibuprofen. 600 milligrams. Every 4 hours. A.C. For pain." Given the many opportunities for mistakes (in the speech recognition, in mixed-up drug and/or patient identification, in dosage, etc.) this scenario may provoke chills in many of us. How might the linguistic diversity due to physicians who do not speak English as a native language affect the effectiveness of these speech-driven devices? YOU SAY TRACHEA, I SAY TRACHEA
Medical dictation systems must support far greater than normal vocabularies - more than 250,000 words to include medication names, medical procedures, diagnoses, diseases, etc. Shaw wasn't considering this when he called America and Britain "two countries divided by a common language" (Henton, 2002), but the divisions are as strong here as elsewhere in English. The list below presents a few well-known differences in the terminology used (to designate semantically the same thing) and the varying pronunciations of these scientific/medical terms by American and British speakers. All pronunciations appear according to the International Phonetic Alphabet (IPA) transcription standard; primary stress is indicated by a raised bar before the stressed vowel.

The impact of these significant pronunciation divergences - in stress placement, varying numbers of syllables and in vowel length - on speech recognition is perhaps not the most obvious one. ASR providers should know these variants and load appropriately different grammars (with their associated pronunciation models) into the localized software used in the U.S., Canada or the UK. The real problem lies with physicians and medical technologists who have learned English (perhaps as a second or other language) outside North America or the British Isles, but who are resident in the U.S. or the UK. Linguistic speculation accounts for these varying pronunciations by assuming that (native) speakers of English draw different analogies according to their perception of the morphological origins of these neologisms, and by regularizing with the stress patterns preferred in their dialect. Speakers of Indian or Singaporean English will have learned primarily British English but they may practice in Chicago or Vancouver; similarly, Australian English doctors and dentists who studied in Hong Kong may have moved to London. Their accented varieties of English will be one impediment to reliable recognition built for other standard accents, and their learnt/preferred pronunciation of the terminology will add another layer of potential confusion or failure. UNSPEAKABLE NAMES
For legal purposes names and trademarks need to be spelled correctly. However, it is not possible to legally dictate how they are pronounced. This has important and varied repercussions when names are (re)produced using text-to-speech (TTS). In naming a new company or product, it is now de rigeur to combine upper and lower-case characters in one alphabetic string, with no white space, or to alter the spelling for eye appeal. This typographical rulebreaking also comes from company mergers, giving rise to such unwieldy strings as exemplified in the following list of some pharmaceutical giants and their product brand names. Bold face sequence show non-English spelling names; the hash mark (#) shows a TTS normalized text string that breaks the normal spelling (phonotactic) rules of English, which may in turn cause the TTS system to produce an unpredictable or weird interpretation.

Some drug names are familiar enough to physicians and patients alike that they should not present pronunciation/recognition difficulties for an automated spoken system (e.g. aspirin, codeine, Valium™). For native speakers of English, however, other drug and/or compound names range from fairly unambiguous, to opaque/ambiguous, to those speakers having no idea with regard to either pronunciation or stress placement. The three lists below illustrate these issues, in descending order of difficulty for humans, and by deduction, those which present increasing difficulties for TTS systems:

In a vain attempt to help speakers with unpredictable stress placement and/or vowel quality in drug names, pharmaceutical companies and health management providers (HMOs) sometimes give pronunciation hints, in a random dictionary-style transcription. For example, the following are taken from product advertisements and prescription leaflets from the HMO:

This information is completely unsystematic: note three different renditions of unstressed syllables, of post-positioned single quote to indicate stress or upper case, and the unjustified or inconsistent use of upper case in general. It is not helpful to native nor non-native speakers of English, or to those confused by quasi-phonetic notation. Problems with the unknowables (the great majority) remain unalleviated by drug manufacturers providing such pseudo-pronunciations. More often than not, we are left to our own (wobbly) intuitions about stress placement, short vowel /I/, long vowel /i/, or diphthong /aI/; 'hard' or 'soft' letter "c" i.e. /s/ or /k/, etc. Anyone who has listened to a radio doctor's call-in show, where people question a physician about the drugs they have been prescribed, knows that lay people (us) stumble and hesitate with the pronunciation of the drugs they're taking, and ultimately resort to spelling them for the doctor. Given these many (socio)linguistic variables, is it impossible to attribute a degree of certainty in attempts to recognize many names of drugs. All commercial recognizers rely on certainty/confidence factors to supply a match. Recently Walter Rolandi (2003) supplied a useful, critical analogy for this recognition problem: "Imagine an English-only speaker being asked a question by someone speaking in French ... The English speaker instantly knows that what the other person said was not English, i.e. that the speakers' utterance was not in the listener's grammar ... having a recognizer capable of accurately determining whether ... an utterance is in its grammar would be a significant step toward more intelligent voice user interfaces." Having medical and healthcare-based systems capable of accurately determining whether diseases, procedures, and the names of drugs have been recognized accurately by speaking them back using TTS (to prompt checking and re-entry by hand if necessary) would not only be an intelligent and significant step. It is a vital, preventative step if these devices are to be used more widely by all medical practitioners. Computerized order entry systems typically offer physicians and medical institutions the ability to "streamline workflow, reduce error, save time, money and lives" (www.validus.com). With the many and varied linguistic and phonetic barriers given, it is not clear how errors can be avoided, let alone reduced, and how lives may be saved. RX FOR REMEDIES
There are still three hurdles to wider adoption of digital dictation devices to increase efficiency for health-care professionals. First, there are understandable concerns about confidentiality/security. Second, the fragility or fallibility of recognition accuracy. Third is the lack of immediate spoken guidance cum confirmation. What can we suggest to mitigate these factors? The first is the easiest: users need to be sensitized to the need to enter the data in a quiet, semiprivate location. Walking out from a consultation, or from a patient's room, or standing near the nurses station in the center of a bustling ward are not ideal environments in which to speak delicate, private facts about a patient's prognosis or prescriptions. These are also very noisy places, which in turn will affect the accuracy of the recognizer adversely, leading to repeated attempts and giving rise to increased frustration rather than efficiency. The second problem will then be tolerated, if not solved. The last, and most important improvement in these speech scenarios, is for the user to have some guidance and immediate confirmation of what they have spoken. Many early adopters in U.S. radiology departments have since abandoned spoken record keeping, because the need for repetition and high failure rates were simply too frustrating. According to Philips Speech Recognition Systems, however, their product SpeechMagic™ (available in 22 languages) is now used in some European countries by more than 60 percent of radiologists (STM NewsBlast, December 10, 2003). The product has recently expanded into other specialized areas, such as cardiology, pathology and surgery. Clearly the speech recognition component has improved over the past 15 years. And perhaps the working conditions of these non-U.S. professionals provide better, quieter, privacy. There remain skeptics in the U.S. medical profession who simply do not trust that doctor-patient confidentiality is not being violated, and who also do not trust the accuracy of the speech recognition. This may be because the ability to talk back is NOT there. None of the current instantiations include TTS, which is capable of talking back. TTS can guide users to speak a personal or product name correctly (i.e. the way the name has been entered phonemically in the recognizer's dictionary), and it can safely confirm entries that have been made using ASR and/or the graphical interface. Every doctor, specialist and pharmacist would welcome such a system if it contained such features and IF their HMO accountants or company paid for the installation, training and setup fees. References
Henton, C. (2002) You say 'zee', and I say 'zed'. Issues in localizing voice-driven applications. Speech Technology Magazine, May/June, 28-31.
Henton, C. (2003) The name game: pronunciation puzzles for TTS. Speech Technology Magazine, September/October, 32-35.
Rolandi, W. (2003) When you don't know when you don't know. Speech Technology Magazine, July/August, p.28.

Dr. Caroline Henton is Founder and CTO of Talknowledgy.com. Dr. Henton can be reached at carolinehenton@hotmail.com or 831.457.0402.

Companies and Suppliers Mentioned

Speech in the Healthcare Industry

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round