The State of Desktop Speech

This article focuses primarily on state-of-the-art speech applications presently running on full-function PCs, both desktop and smaller. The speech software runs on the PC itself, and is typically used by a single user at a time. Applications and customers, drawn from the US market, are representative of those throughout the world.

"In the beginning was the Word ..." We depend on words. Despite its fleeting and ethereal nature, speech is the most common means of communication between people. Acquiring speech and language is a critical developmental activity, starting in infancy, for people everywhere. However to span time and space in a more permanent fashion, we need to turn spoken words and data into text, (handwritten, printed, typed, etc.).

The creation of moveable type to mass produce printed materials has been with us for 550 years. For just 125 years, people have been able to use a typewriter keyboard to create the printed word themselves. Each of these technological breakthroughs, improved and refined over the years, has dramatically changed our environment, our work and how we do it.

To go to the next step, to enable that text to be rapidly processed and disseminated, we need to turn it into computer-readable form. Word-processing with keyboard input to do that has only come into popular use over the past 35 years. The advent of commercial large-vocabulary, general-purpose, continuous speech recognition dictation products, just over five years ago forms the basis for today's desktop speech capabilities. In addition to, or instead of typing, users can now use speech effectively as an input modality. Users speak, and their computers can take appropriate action on oral commands, and, more significantly, can immediately transcribe natural speech into arbitrary text on their screen as they speak. Recorded speech can be played back, for transcription purposes, proofreading and other applications. High-quality synthetic speech can read text and data on demand, allowing people to listen to email or other materials while they are otherwise engaged.

While far from perfect, these speech input and output capabilities are presently relied upon by significant user populations. Further research and product development will continue to improve system performance and expand the markets and user groups adopting this technology.

In this day and age, with most office workers responsible for generating their own email, reports, etc., typing is a time-consuming activity. Although "thought" time is often the gating factor for job throughput, keyboarding is still an important component.

Programmers, journalists, and secretaries are among the world's fastest keyboard afficienados, typing up to and above 100 words per minute (wpm). It has been reported that the average office worker types at 30-40 wpm however. We routinely speak conversationally at 150-200 wpm.

Not surprisingly, two professions that have most strongly embraced desktop dictation are doctors and lawyers. These groups have to generate copious amounts of text, and have to do it under time-pressure. Like many professionals, the work products which they produce to communicate their expertise, opinions, reports, etc., and ultimately for which they are compensated, are usually text. Individual doctors and lawyers typically generate thousands of pages of text annually. Some of these are turned around in a day; many more take days or weeks.

The human and economic cost savings in improving throughput and turn-around time are tremendous. As a public defender once put it to me melodramatically, "When my paperwork is late, my client sits in jail!"

Large legal firms, hospitals, etc. typically staff their own around-the-clock or on-call transcription services to transcribe recordings of dictated materials. Some U.S. hospitals have even resorted to using off-shore transcription services. Small to medium-sized legal groups and medical practices often scramble to obtain satisfactory daytime coverage. The two-step, record-transcribe process invoked in all of these situations gives rise to errors that are not caught by originators of the reports in later reviews. Serious consequences have resulted with common transcription errors, specifically the omission of short words, such as "no" in "no evidence of cancer."

When health-care professionals directly use speech recognition for their dictation needs, they receive immediate feed-back, are spared a separate review cycle, and can catch and correct errors while the information is fresh. Latencies in transcribed documents and reports result in the unavailability of timely information, especially critical in the medical arena for multiple doctors conferring on a given patient. Delays in reimbursements, especially from third-party health insurers, are directly tied to the submission of satisfactory finished reports. So even when an emergency rises, significant resources are brought to bear, and the crisis is resolved, costs are not recoverable until the complete reports can be submitted. A major advantage in rendering text and data immediately into computer-readable form is the opportunity, and greater likelihood of entering it into integrated, streamlined, work flow processes for document creation, customer relation management, hospital information systems, etc. Centralizing information directly improves its integrity, consistency, availability and trackability, while reducing redundancy, multiple sources of errors and time delays. It's analagous to the difference between producing a typewritten report and a word processing document, or between a handwritten note and an e-mail message.

Dictaphone, Philips and Sony have each integrated desktop dictation capabilities into their centralized dictation products. These are marketed primarily to doctors, lawyers and large enterprises. Another significant group of users are people who use dictation software because of disabilities or impairments. For many of these people, dictation software allows them to work or pursue their education; without it, they couldn't. Disabilities where this technology has proven very useful include mobility impairments, paralysis, cerebral palsy, muscular dystrophy, dyslexia and carpal tunnel syndrome. Carpal tunnel syndrome, also known as repetitive stress injury (RSI), is the single largest occupational disability in the United States today. A growing number of companies, including Chevron, Kodak, Southern California Edison, etc. make dictation software available to their employees who have been injured or who are at risk. A real "equal opportunity" disability today, RSI not only afflicts factory workers, laborers and musicians, it also plagues office workers and professionals who spend too much time typing.

Many disabled students, from young children to adults, now depend on speech recognition to do schoolwork, conduct Internet searches, etc. The multi-partner Liberated Learning Project uses dictation software to project real-time text transcription during college lectures for the benefit of disabled and able-bodied students alike. Text-to-speech has also proven very valuable. It enables blind and visually impaired people to access computer-readable information.

For people who can't speak clearly, synthetic speech provides an effective communication alternative. It is worthwhile reflecting that disabled users have been the instigation for the innovation of many of today's major office technologies; including the typewriter, the telephone and even the ball point pen!

Surveys previously reported by Dragon Systems and IBM found that heavy users of dictation software span many classes of business, government and home users. This technology has become routine for many transcribers of dictated materials (e.g. Veteran's Administration hospitals for medical reports), document creation (e.g. Berrocal & Wilkins, P.A., Sidley, Austin, Brown and Wood for legal briefs, etc.), news story capture (e.g Herald News, Joliet, IL), foreign language translators who routinely dictate their translations (e.g. United Nations), quality control inspectors working in "hands-free/eyes-free" environments (e.g. Volkswagon), law enforcement officers (e.g. Los Angeles Police Department) and many more. Young people write school papers by voice, while senior citizens talk to compose email. For these latter two groups, the very young and the very old, special speech patterns modeling typical acoustics and language usage for these groups, respectively, have been built into some dictation products.

Form-filling applications are well-suited to speech input. Forms typically include a combination of fields of well-defined, application-specific, restricted data (numbers, dates, codes, states, conditions, etc) as well as fields for free-text (observations, detailed descriptions, special instructions, exception reporting, etc.). Applications in this arena are diverse, from financial trading floors to manufacturing floors. A growing number of mobile workers, especially business people and law enforcement officers, routinely record customer reports, expenses, time billing, data, and other observations, into high quality hand-held recorders. When these people return to their PC, they download their recorded acoustic data to obtain a transcript automatically. Data and memoranda recorded on-the-spot are demonstrably more accurate and complete than later recollections.

The principal players today offering speech dictation and speech synthesis capabilities for desktops are IBM and ScanSoft, followed by Microsoft, and more distantly Philips Electronics. Focusing primarily on speech recognition, IBM has developed its own technology, commencing seriously in the early 1970s. IBM offers an extensive line of ViaVoice products, available in 11 world languages. Application develeopment tools and runtime licences are also available through partners. IBM's products are noted for high quality and widely marketed.

hrough an acquisition in the Delaware Bankruptcy Court in 2002, ScanSoft gained rights to the market-leading Dragon NaturallySpeaking product line, as well as other Dragon Systems assets. Its creator, Dragon Systems, had been acquired in a 100 percent stock swap by Lernout and Hauspie in June, 2000. Shortly thereafter, allegations of Enron-like fraud by L&H drove the company into bankruptcy, rendering the stock virtually worthless. Despite the exodus of most former Dragon employees, an extensive line of Dragon NaturallySpeaking products, tools and services, continues to be developed . Dragon products are also available in many world languages and marketed internationally.

Both the ViaVoice and Dragon NaturallySpeaking products and licenses are marketed and distributed through multiple channels, including retail, catalog and Web sales; Value Added Resellers (VARs); Independent Software Integrators (ISVs); and Original Equipment Manufacturers (OEMs) marketing bundled hardware and/or software products. M

icrosoft, a more recent entrant to the market, has introduced its dictation speech engine "Whisper" and its text-to-speech engine "Whistler" in its SAPI software developer kit. These are shipped by Microsoft's Speech.Net initiative for inclusion in major Microsoft products, including, Microsoft Encarta, Windows 2000, Office XP and Windows XP. Although the features and performance of these engines are not as advanced as the Dragon and ViaVoice products, these engines have become readily available and free to large numbers of users. Microsoft has not significantly promoted or marketed these capabilities to date. They are available in English, Chinese and Japanese.

Philips focuses its SpeechMagic engine for use as an adjunct to its server-based dictation systems. Its FreeSpeech dictation product was withdrawn from the highly competitive U.S. retail distribution several years ago. Philips' speech engine is primarily used for medical and legal applications marketed in Europe.

Besides desktop speech, all of these companies have been or are becoming actively engaged in applying speech technology onto a range of devices, from servers to embedded devices. A number of other companies are supplying component technology for desktops and/or other platforms. Hundreds of VARs and ISVs integrate and customize systems for individual customers and market segments. A limited number of companies and university departments specialize in making fundamental advances in the core speech technology, and focussing on especially challenging operational tasks.

Today's desktop speech is the springboard for advanced speech capabilities on a myriad of convenient handheld and other mobile devices. Major desktop applications, such as word processing and email, become major headaches on small devices with miniscule keyboards. Grafitti is inherently slow and no one really wants to enter "text by toothpick" on a set of tiny keys.

The advent of more powerful small devices will usher in platforms where speech I/O has unique advantages. Like the transition from mini and mainframe computers to PCs, local processing on wireless PDAs and high-end cell phones will enable users to work without the delays and problems inherent with remote server access for distributed speech processing. In addition to standard database queries checking weather, stock and sports scores, users will be able to compose email, conduct queries of open-ended search engines, create sales reports and maintain customer databases, and record field observations directly. Connecting to servers and networks, while still very valuable and essential for many applications, will no longer be a requirement for speech processing. New, more sophisticated, desktop applications will also gain currency. Some of today's server-based applications will be ported onto desktop platforms. Desktop computers with ever faster processors, networking and Internet access (more substantial storage, etc.) will be able to support more advanced speech and language capabilities to conduct local audiomining (audio search engines), multispeaker meeting and telephony transcription processes, real-time spoken language translation and progressively more natural language database queries.

Feasibility prototypes for all these devices, small and large, have already been demonstrated. It is a matter of time and additional R&D improvements to bring them succesfully to market for consumers. User interface improvements will make it easier for new users to start using speech systems with less effort, to make corrections more intuitively, to move seamlessly between diverse devices, etc. On-going improvements in speech recognition accuracy, more natural text-to-speech, and overall system capabilities are of paramount importance in creating ever more useful and attractive products. Market conditioning, an essential component for the wide-scale adoption of all new technology, is now becoming ever more evident for speech technology. Many people are now beginning to encounter speech technology with brief speech interactions for constrained command/control or database tasks over the telephone. Automated directory assistance, prompts such as say or touch 1, say collect call or operator, etc. are typical exemplars.

More recently, telephone callers seeking information such as Amtrak train schedules even encounter artificial interactive personae, such as Amtrak's "Julie." Low-end desktop speech capabilities have been widely disseminated through low cost retail products, product offerings through AOL, and hardware/software product bundles.The shower of industry awards collected by dictation software over the past five years, also helps raise consumer awareness. Increasing familiarity with effective speech technology speeds its adoption in all sectors.

It is, of course, the customers and consumers of all kinds of speech technology who realize the greatest economic benefits from time savings and convenience, reduced labor and other cost savings. Those benefits include the time a medical doctor saves writing reports, or the ability of a disabled person to rejoin the workforce, as well as offloading a telephone operator with an automated attendant, or providing the safety of hands-free, cell-phone dialing in an automotive environment.

Desktop speech products started emerging about 20 years ago. Over the past five years, with the arrival of general purpose dictation software, millions of copies have shipped worldwide. According to PC Data's monthly surveys of the U.S. retail sector alone, the desktop dictation software sales (both units and dollars) have accounted for a top category of business software sales for the past several years. Significant revenues for desktop dictation sales also derive from direct sales and licensing to corporate and government customers, VAR/ISVs, OEMs, etc.

L&H's failure to market and ship products during its debacle in the late 2000 to early 2002 timeframe set back industry product sales substantially. Coupled with the computer industry malaise and general economic downturn, recovery of desktop speech sales, though steady, promises to be slow. Nonetheless, the companies supplying desktop speech capabilities (as contrasted with speech companies supplying server-based telephony or embedded speech technology) have consistently, both past and present, garnered the lion's share of all speech company revenues and profits. Despite recessionary fears, serious set-backs and delays, a number of market prognosticators still project annual speech industry revenues in the several hundred million to multi-billion dollar range over the next decade. As in the past, market advances are likely to be directly tied to significant technological advances.

Presently, companies such as IBM (with WebSphere) and Microsoft (with its .NET initiative) are focussing much of their speech and language R&D on the burgeoning Web application server market, using speech input/output as as adjunct, especially to handheld devices and cell phones with minimalist keyboards. Initial systems will focus on users gaining access by voice for constrained database applications, similar to those presently popularized by telephony focused speech companies. Even greater user and economic benefits, however, will arise from the porting of today's advanced desktop speech capabilities onto these small form-factor platforms themselves.

Meanwhile a major issue presently in contention with respect to multiple platforms accessing Web server applications, is the choice of standards. Microsoft is promoting the SALT initiative in opposition to X+V (a combination of XHTML and VoiceXML), advocated by IBM and others. Harmonizing these two standards would confer major benefits to suppliers, developers, customers and the market. As well evidenced by the U.S. cell phone industry, multiple standards confuse, fragment and delay the industry while dramatically increasing costs and reducing functionality. Technological innovations are the key to future success; improved recognition accuracy and more natural text-to-speech will drive higher utility, market acceptance, return-on-investment, and higher economic returns. Commmon interface standards and consistent intuitive user interfaces will enhance products and expand the user base. Improved noise handling expands the environments where this technology is effective; affordable low-power processors will enable handheld devices and cell-phones to rival today's desktop PCs. Progressively speech will become a principal communication mode, with speed and convenience for both machines and people!

Dr. Janet M. Baker is co-founder of Dragon Systems and has been active in the speech industry for more than 30 years. She presently lectures and writes on speech technology, entrepreneurship, transferring technology to the market, etc. to business audiences worldwide. She can reached at janet_baker@email.com.

Companies and Suppliers Mentioned

The State of Desktop Speech

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions