Making TTS Real
Text-to-speech (TTS) technology is a computer system's ability to translate text into synthesized speech. Today's deployment of TTS can be divided into three segments: enterprise and telecommunications; automotive and mobile; and consumer applications. These segments in turn demand differing sizes of TTS footprint: large footprint is host or server-based, accessed by multiple remote clients, and small footprint is embedded. Large footprint TTS is ideal for high-volume enterprise and telecommunications speech services, while small footprint suits applications such as automotive, PDAs, cell phones and other devices. These deployments are reviewed here in order of both market and footprint size. Some compelling current integrated TTS systems are also described. IVR and Telephony Systems: Server Applications are Tops
In server applications TTS is used to generate dynamic text for any situation where pre-recorded speech is cumbersome or impossible. Examples of TTS working in server applications include:
- Reading email messages
- Address data
- Part numbers
- Financial information, including account balances, stock prices, etc.
- Dynamic Web content
- Alerts (e.g. reminder to take a prescription drug)
- Notifications (e.g. when a stock limit is met; new items)
- Audible help tips, in kiosks, booths, etc.
The most powerful argument for using TTS technology in the telecom world is that it reduces costs and expedites development by supplanting human recordings. By using TTS prompts, an organization can reduce staffing requirements while improving the speed and quality of customer service. In some cases, the TTS supports scores of languages that would otherwise be unavailable from humans. TTS in the telecom space is widespread and varied. TTS allows flexible data to be output from any content source, like databases and the Internet, on the fly. Market-leaders for voice, video and data convergence over broadband networks have licensed TTS for a variety of applications that provide businesses with innovative ways to manage information and resources. Users may improve their productivity by being able to access voice mail, fax and email from one central location. The practice of combining digitized voice over speech (e.g. for prompts in interactive dialogue systems) with synthetic speech (e.g. for reading the information in IVR systems) is growing. This is a legitimate, efficient and user-friendly combination of the two types of spoken information delivery, especially if prompts and synthetic speech segments are recorded by the same voice talent. Highest quality, near-human TTS is important in call centers because the ability to copy and automate the speech of a voice talent aids the seamless integration of dynamic information into sequences of static pre-recorded prompts. Sometimes its possible for all voice prompts to be TTS-only. Unified messaging now means primarily "e-mail by phone". Both YahooByPhone and AOLByPhone incorporate a large footprint, flexible TTS system. Since AOLByPhone was launched in 2000, more than 24 million people have received their e-mail messages over the phone. These deployments may be the most difficult for a TTS supplier, owing to the high volume, different hardware and highly varying physical circumstances of the users (in an airport, an open-top car, on a noisy factory floor, etc.). In another taxing environment, the National Weather Service (NWS) now uses a high quality concatenative TTS voice (the older parametric TTS voice was difficult to understand, resulting in people not able to hear the correct weather forecast). In areas where weather changes are frequent and dramatic, such as tornadoes or hurricanes, or in situations where conditions may alter minute-by-minute (for pilots of small airplanes) it is essential to have immediate and intelligible dynamic updates. Only TTS can deliver this information continually. In Europe, TTS is being deployed through robots and audiotext services for broadcasters and media clients. TV viewers call servers in response to questions asked or transmitted during programs in order to win prizes, participate in game shows, etc. These services are scheduled at prime broadcasting times, and are designed to encourage massive response and create additional revenues for broadcasters. Large peaks occur several times a day, with a total of over six million calls per month. Automotive and mobile devices
TTS is a critical component in complying with in-vehicle hands-free regulatory laws. Speech technology for mobile applications is designed to create an intuitive voice-controlled user interface for safe productivity while driving. TTS forms part of the speech option in higher-end cars, the best-known of which in the US is OnStar. Typically, travelers hear personalized traffic and directions (spoken street names, city names, points-of-interest, freeway names) and numbers, addresses and any other text that the navigation system is designed to supply, such as news, sports, stock quotes and weather reports. In-dash personal assistants may incorporate automatic speech recognition (ASR) and TTS to allow drivers to perform a number of tasks by talking, while keeping their hands on the wheel and eyes on the road. The availability of customized voices should help reduce the risk of driver distraction. Deployments for navigation and information far outnumber those for embedded telematics. Car-independent spoken navigation systems also exist. German-based navigation systems developers for mobile and automotive solutions, and software providers have chosen a small footprint TTS system to speech-enable navigation software for mobile devices and notebooks. Drivers enter their destination; all directions are then spoken out loud. In addition, the exact position of the vehicle can be calculated via GPS and precise instructions given for a journey. TTS thus provides low-cost route guidance and improves both safety and autonomy. Consumer software products
Despite the decade-long availability of a large variety of TTS voices on personal computers, "...computer speech has been thoroughly ignored by the average consumer" (1). TTS has been touted to fulfill needs as a proof-reader, screen-reader, talking browser, translation tool, spelling/grammar aid, personal assistant, time/alert speaker and in talking agents or characters providing on-line assistance. Quite simply, mainstream desktop PC users do not need text to be read to them from the screen while they do something else; and to be constantly told the time (in quarter, half or full hour increments) when it's displayed on most screens is unnecessary at best and annoying at worst. (For TTS deployment for blind users, those with physical disabilities, and for help with language or reading skills, see sidebar). Recently, however, there has been a wave of interest in TTS embedded in handheld devices as a "scheduling assistant", reading appointments or tasks from a calendar, and as an MP3-enabled text reader. In the MP3 scenario TTS allows documents, email, fax messages, manuals, electronic books and Web-based text content to be downloaded and spoken during a commuter's journey, in the gym or while gardening. This deployment is more likely to find a wider demographic of fans, since it is both an aid to productivity and a form of "edutainment". Text files can be any length; text may be copied to the clipboard, appear as files saved to a play list, or typed in by the user. But the means by which text files have to be copied and pasted into the special reading programs is often clunky and irritating. The user interface in such programs needs to be more transparent, and the stages involved in getting the MP3 files read out fewer. Ideally, converted text files (Word and PDF) would be loaded directly onto the MP3 player, and only highest quality synthetic voices would read them, rather than the sub-optimal warbling or monotone robotic voices that persist in many applications. One industry insider's predictions
When asked for his views on the best deployment of TTS, Matt Marx, vice president of customer solutions for Tellme, started by reaching into several pockets and laying a clutch of handheld devices on the table: a cell phone, a pager and a PDA. Why three? Clearly he needs to be in contact constantly. The cell phone's battery dies too often - it needs re-charging every night; the battery in the pager dies every couple of months; the PDA provides a calendar, addresses and Web site accessibility while commuting or traveling. Obviously one handheld device (with one battery) would be preferable, but it has to be small, light and the batteries must last longer than a standard cell phone. The advent of "smart phones" that combine the functions of all three devices (as well as "name dialing" using ASR, storing and retrieving contact information, and accessing information services such as news, stock quotes and sports) is none too soon for the continuously connected communicators. For such users, the everyday necessity of TTS lies in its ability to deliver unstructured data, for example, to scan and read e-mail messages remotely, by phone. Structured data such as weather reports, sports scores and restaurants can all be recorded by a voice talent or accessed directly from broadcast sources. For most menu offerings in a voice portal (e.g. street names, movie titles, stars and synopses) information can be updated on a weekly basis; but sometimes the schedule slips and TTS can be used to "plug the gap" until the voice talent can catch up. Mobile professionals call their e-mail dozens of times a day because it gives instant remote (intranet) access to e-mail between meetings (even during meetings), and avoids a return trip to the desk or office. It is not a stretch to equate the ubiquity of TTS with cell phones: TTS is the only way to stay connected when traveling--on the road, in the air or just walking between meeting rooms. At home, the cell phone may become the preferred way to collect information from the office. TTS is used most effectively to scan e-mail message contents for immediacy or urgency. Responses can then be made by phone, rather than by the more cumbersome and time-consuming (logging on, password, internet connection down, etc.) e-mail. Despite the decade-long availability of competent algorithms to normalize the text of e-mail headers, top TTS systems still perform poorly, reading headers character-by-character. This is especially bad in the "Reply" and "Forward" functions. Future improvements
The latest TTS products offer an array of individual voices, a variety of personae, regional accents, and the possibility to choose between characters to interact within applications. The most pressing area for improvement remains prosody: finer control of voice quality and intonation contours is still required. User demands are not so much for a variety of voices (e.g. a male or child's voice) or accents, or even different languages; users do however require accuracy, intelligibility and no droning monotony. For further deployment to large scales, operational issues remain important, as do questions relating to quality and usability. Measuring port densities and latencies on standard benchmark equipment is essential. There is currently some confusion in the industry as to how these two figures ought to be measured and reported. As the demand for e-mail access by phone grows, so will TTS deployment. Technology enthusiasts like it. Will leery luddites, technology cynics and non-geeks use it? The (general population) jury is still out. Acknowledgements
This article is based on current information obtained by direct communication with, or from Web sites of, six leading TTS suppliers: AT&T Labs, Elan, fonix, Rhetorical Systems, ScanSoft and Speechworks. I am grateful for input from individuals at these companies. Footnote
1. David Pogue, Hearing Text, Not Tunes, on Your MP3 Player. The New York Times. May 2, 2002 Dr. Caroline Henton is CTO of Talknowledgy.com. Dr. Henton can be reached at firstname.lastname@example.org.