Beyond Voice Quality

Text-to-speech (TTS) technology leaves many application designers at a crossroad. Although TTS technology is deployable - and it frequently sounds great - it still makes mistakes. What's more, it makes some mistakes that seem painfully obvious to us. However, there are many great applications that can't exist without high-quality TTS. Imagine a telephone shopping application that can make pro-active, personally-tailored suggestions to the callers; or a ticket booking system that can confirm every caller's address automatically and accurately; or a telephone banking system that didn't just tell you what you spent, but - far more usefully - where you spent it. Of course, high quality TTS is not just for newly deployed applications. There are many DTMF and dialog systems on the market whose functionality could be greatly improved by the addition of high-quality TTS. RAISING THE BAR FOR TTS DEPLOYMENTS The question we must ask ourselves is: when we deploy TTS technology, how do we maximize the benefits, and minimize the errors? To answer this, we must fundamentally alter the way that we treat TTS in our dialog applications. Historically, developers have treated TTS as a black box - something that can simply be plugged into an application and start talking. This works to some extent. Most TTS systems will read pretty much anything you pass into it. Presumably however, we want our TTS to speak well. It seems counter-intuitive, but rather than verifying the black box approach, increasingly natural TTS has underscored its risks. Today, highly natural TTS systems are held to a higher standard than earlier systems. Errors that occur in highly natural TTS systems are somehow even less acceptable than if made by older systems. As with other speech technologies, TTS systems need a fair amount of caressing to perform at their peak, and there are steps that application developers can take to enhance the output of today's TTS systems. These steps fall generally into four categories: 1)Application design
2)Preparing the text
3)Using tools effectively
4)Advanced features Application Design: Making the Voices Work An important factor in dialog applications is branding. When people talk to a speech service, they get a strong feeling about the company behind that service, and the way that company treats its callers. Music, sound effects and the casting of a particular voice talent all contribute to the voice brand that extends a corporate identity and differentiates an organization from the competition. This is perhaps most pertinent to the voice that is chosen to represent a company or product in an application. When we use TTS more liberally, we must decide from the outset whether we will use the same voice-talent for the TTS as for the recorded prompts, or different voices. Using different voice talents is the most common approach to date, mainly because the ability develop high quality custom-voices is very recent. There are various User Interface (UI) techniques to make the transition from recorded prompts to TTS as subtle as possible. These include (where, recorded speech looks like this, and TTS speech looks like this.):

Introductions: "The computer will now read your details…3 Gray Street…"
Audio icons: "The message body follows. [BING!] Hi Tony…"
Using a variety of female/male voices for the prompts and the TTS
Using similar-sounding voices for the prompts and the TTS With care, these techniques can sound acceptable for some applications, but the transitions can still be jarring, confusing or irritating in longer calls. Be cautious; using similar sounding voices is strongly discouraged. This practice is the aural equivalent of wearing a suit jacket that is nearly, but noticeably not, the same color as your suit pants. Using the Same Voice Talent to Create a Sophisticated Dialog Using the same voice talent can be very effective. It involves combining recorded prompts with TTS derived from the same speaker. This practice is only aesthetically possible now that TTS has reached such a high level of naturalness. This approach can deliver compelling output - all the emotion you need from your pre-recorded prompts, combined with all the flexibility of TTS, in the same voice. For example, for a banking application, recorded prompts could read, "Your last card transaction was a debit at…" and then complete the phrase with dynamic content, say "…Pleasant Valley Restaurant." Applications that achieve this level of sophistication avoid jarring transitions and create a smooth dialogue that is enjoyable to the caller. While this isn't trivial, it can be achieved successfully with careful design. It's also worth noting that certain core functionality is a pre-requisite for this level of blending to work convincingly (e.g. the TTS engine must support rate and volume control). Preparing the text: "It's All Greek to Me" Remember the expression, "Garbage in, garbage out"? The same rule applies to TTS. Imagine that you don't understand a foreign language—Greek for instance. Fortunately, Greek has a transparent writing system, so without too much trouble you can learn how to read Greek out loud fairly easily. However, because you can't understand the language, you don't get a notion of the meaning of what you're reading. From this point of view each sentence is, a single item, unconnected to the previous and following sentences. The reader is ignorant of ambiguities, spelling errors or if the text is not logical. As a result, someone who understands the Greek language would find the reader comical or worse, irritating. The same outcome is true when the TTS output does not sound like it is expected. The best way to illustrate the importance of correct pragmatics is to show some examples of what can happen if the text is sub-optimal. Example A:
Two dozen eggs
4 quarts of Milk
Orange Juice
Bread (Whole Grain) Without proper text preparation, this would be read as: "Two dozen eggs four quarts of milk orange juice bread, whole grain." However, if we add punctuation it would read: "Two dozen eggs. Four quarts of milk. Orange juice. Bread, whole grain." All we need to do is add periods to the end of each line. The TTS system does not have real-world knowledge, and cannot deduce from the content of the text that this is a shopping list. Likewise, it cannot assume that every time it sees a new line character, that it should pause. Example B:
The weahter is expected to be cloudy tomorrow. "The wheat er is expected to be cloudy tomorrow" TTS systems do not usually have spell-checkers (although they could). So if a misspelled word is passed in, the TTS engine will not correct it. Another issue is case sensitivity. In e-mail, many people write only in lower case and often without punctuation. Example C:
i was finally accepted at mit "I was finally accepted at mitt" Example D:
I was finally accepted at MIT "I was finally accepted at M I T." Depending on the type of information an application must read, it is critical to pay close attention to the format of the text to be synthesized. Having total control over the input text, produces fewer problems. If the text is dynamic, implementing a custom pre-processor should be considered. This is a tool that alters the input text to make the output speech sound even better. The National Weather Service, for example, uses a custom pre-processor for its NOAA Weather Radio application. TTS Tools A high quality TTS system will have solid "front end" performance that analyzes text before sending it to the speech engine. This function determines, for example, how a "/" (forward slash) character should be read and whether "Dr." is read as "doctor" or "drive". Sometimes more focused text analysis capabilities are needed, which is where dictionaries, tags, language identifiers and pre-processors can be essential. TTS dictionaries allow application developers to override default processing rules in order to force a pronunciation of words, acronyms and more. Mispronounced words can be corrected with a simple dictionary entry. Specialized tags for addresses, URLs, dates or phone numbers can be extremely useful in cases when control is maintained over parts of the input text, such as when reading from a database. Portions of text can be identified and tagged, which invoke a special set of processing rules applicable only to the tagged text. For example, the default engine might read NW as "N W", but when tagged as an address it would read it as "northwest." A language identifier is essential for deployments in which text from more than one language is presented to the engine. An e-mail message, for example, that has portions of text in both English and German cannot be properly synthesized by a single TTS voice. The identifier determines the source of the incoming language and routes it to the proper voice. The more text available to an identifier the more accurate it can be. Some TTS engines require as few as 20 characters. Application Specific Pre-Processing Specialized pre-processors combine a variety of enhancements into a focused tool used to improve text rendering for an application. The most widely known for TTS is the e-mail pre-processor, which performs such tasks as extracting date, sender and subject information from the header, parsing the message, and voicing commonly used texts such as emoticons. Pre-processors can perform similar functions for other applications. A specialized pre-processor for a navigation system might assure that vehicle and driving-related input texts are pronounced correctly. Advanced Techniques In Application Development Let's imagine a designer has: 1) selected a state-of-the-art TTS engine that speaks in a highly natural voice; 2) chosen the right application; 3) designed a compelling TTS presentation by mixing recorded prompts and synthesized speech; 4) set up a system that makes the input text impeccable; and 5) used dictionaries, tags, identifiers and pre-processors to limit pronunciation ambiguity and further control the output. Incredibly, there are still options available to further enhance synthesized speech. Customers with significant investments in their talking heads—think James Earl Jones or Larry King —can extend this investment in corporate branding by creating a custom TTS voice using the voice talent to do the voice database recordings. A custom voice can be tuned to match desired personality and specific application domains such as weather, traffic reports, banking, store locator, etc. Voice user interface designers have often complained about introducing TTS to their audio masterpieces. They felt a robotic or stumbling voice with limited intelligibility interrupted the flow and distracted callers. But today, even out of the box, TTS has reached a new level of acceptability. Yet, enhancements such as those discussed here can take TTS to even greater heights. With advances in processing power, natural language understanding, statistical modeling and linguistic theory, the next great strides in TTS will include:
- Injecting even greater personality into TTS voices (both in terms of a single personality and being able to emotionally load different renditions of the same text)
- Seamlessly combining pre-recorded prompts and TTS and, increasingly, replacing pre-recorded prompts with TTS
- Further efficiency and performance gains
Application developers may ultimately move back towards the black box model of TTS deployment as these advances take hold. In the meantime, the methods introduced here could help developers achieve greater success from this technology today.
Robert Rieger is the text-to-speech product marketing manager and Dan Faulkner is a head of core technology/TTS for SpeechWorks International Inc. They can be contacted at robert.rieger@speechworks.com and dan.faulkner@speechworks.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Beyond Voice Quality

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Conversational AI to Reach $41.39 Billion by 2030

Deepgram Launches Voice Agent API