Taking a Look at TTS

What are the current trends in TTS? Some indications appeared during a novel session at SpeechTEK 2002. The author moderated a Show and Tell: TTS Solutions panel in which ten international TTS vendors accepted an invitation to participate. (The list of vendors, together with their URLs, appears in the sidebar). Critical review of the session’s results forms the basis of this article. TTS demonstration files were played at the session from ten vendors: Aculab, AT&T, Elan Speech, IBM, Loquendo, NeoSpeech, Rhetorical Systems, ScanSoft, SpeechWorks, and SVOX. Each TTS vendor was instructed to provide a file lasting no longer than one minute for demonstration and testing purposes. Contents were stipulated to be: (a) the company name; (b) a statement of the number of languages and voices the company offers; (c) a short sentence created by them to show off their system; (d) a sentence generated by the moderator that would test the TTS systems. The sampling rate could be set at whatever level vendors chose, but they should state the rate used. Different voices could be included in the clip, as long as they conformed to the specified format. The demonstrations and statements should be delivered as a .wav file on a CD-ROM. If vendors offer both concatenative and parametric speech synthesis, they could submit two clips, one for each method. The test sentence was: “From Laurel Canyon Blvd., turn left onto Mulholland Dr.; Dr. O’Shaughnessy lives at the yellow house on the corner at the first ‘Stop’ sign: 2529 Wonderland Ave.” While it was not explicit in the instructions, vendors were assumed to have applied the ‘honor’ code, and to have delivered samples that were true unedited output from their TTS systems. Hand-correcting (tweaking) demonstration files serves little promotional purpose, since shortfalls of any system will become apparent immediately when a customer is allowed to test-drive the system by entering their own text freely. Hard nuts to crack
The content of the test sentence was designed to represent a growing application for TTS: spoken directions in automobiles. The sentence was also crafted to contain several well-known ‘old chestnuts’ in text-to-speech production, namely: • correct resolution of (ambiguous) abbreviations
• phrasal stress
• appropriateness of phrasing and intonation tunes around punctuation
• segment-specific blending quicksand (liquids and glides)
• field-appropriate number pronunciation
• proper name pronunciation All systems expanded all the abbreviations (Blvd., Dr., Dr., and Ave.) correctly, although one English-speaking company mispronounced “boulevard” as /bulə vɑ d/, with the wrong vowel in the first syllable. Both UK and US English pronounce the first syllable in this word as /bʊ l/. This is a typical error that requires further careful checking by speakers of the language being synthesized. Similarly, all systems resolved the ambiguous abbreviations “Dr. (‘drive’)” vs. “Dr. (doctor)” correctly. Assigning the correct phrasal stress in the two noun phrases “Mulholland Drive” and “‘Stop’ sign” proved challenging for many of the systems. When the name “Mulholland” is spoken in isolation, the primary stress falls on the first syllable; in the phrase “Mulholland Drive”, the phrasal accent shifts to “Drive” and consequently the degree of stress in “Mulholland” changes. This is a well-documented phenomenon in English phonetics; such knowledge needs to be included for better naturalness in TTS systems. While they all correctly ignored the single inverted commas around ‘Stop’, several systems delivered a sub-optimal intonation tune on the word “sign” in the phrase “‘Stop’ sign:”. Many parsers treat the colon as a terminator, similar to a period, and apply a fall. A falling intonation tune is incorrect. Others try for a ‘list-initiating’ continuation rising tune. Also incorrect. What is needed is a dynamic fall-rise tune, which may be absent from the inventory recorded for many US English systems. These enduring problems in generating appropriate phrasal stress and intonation tunes are explored further below. Phonetic knowledge also indicates that certain segments, known variously as approximants, liquids and glides, are harder to concatenate without detectable glitches (warble). The acoustics are understood, and the sounds /r, w, l, j/ are generally handled well in current synthesizers. The initial sound in “yellow” nevertheless sounded ill-selected or poorly blended in a couple of the samples played. When TTS voices are produced by non-native speakers of English, some crucial language-specific knowledge may be absent. Such was the case with a (German-speaking) rendition of the word “house”, which was pronounced incorrectly as /haʊ z/, with a final voiced fricative /z/ that appears only in the verb form, ‘(to) house’. This mistake reveals a German residue. Careful checking of the pronunciation rules, or ensuring use of an English-only inventory of sounds would eliminate such glaring errors. In a different sample, some vowels’ length (e.g. the vowel /ɪ / in “fit”) was noticeably too long, indicating an insensitivity to important phonetic length/quality differences in English that do not apply, for example, in Italian. Let’s do the numbers
There still remain language-specific issues of text-field normalization. In the test sentence, a couple of European systems failed to read street numbers correctly for US English. Notably, these systems were created by non- English speaking companies. In both cases, the street number (“2529”) was spoken as “two thousand five hundred (and) twenty-nine.” The format and pronunciation of items in postal addresses differs greatly according to different international standards. In the process of defining text fields, members of the set {Number} may need to include a special subset to be used in the text field ‘Address’. This would ensure that US English street numbers are read in ‘Digits’(“two five two nine”) or ‘Number Pairs’ (“twenty-five, twenty-nine”) modes (for further details, see Henton, 2002a). What’s in a name? The majority of systems showed remarkable resilience and accuracy when faced with the (admittedly challenging) name “O’Shaughnessy”. Lively debate in the audience showed that there remains much room for improvement in the pronunciation of proper and personal names. A relatively small number of Anglo-Saxon and Spanish personal names can be pronounced using clean grapheme-to-phoneme correspondences; the vast majority, however, require hand-checking. It is estimated that there are over 2 million distinct last names in the USA alone. A commercial TTS system has to not only pronounce these names containing sequences from their origins in Spanish, Irish, Russian, German, Chinese, Vietnamese, Norwegian, etc., it also has to provide alternatives for many personal names that are bifurcated in the way owners pronounce them: ‘Marie’ as /mə ri/ or /mɑ ri/; ‘Sanchez’ as /sanʃ es/ or /santʃɛ z/; ‘Menzies’ as /mɛ nziz/ or /mɪ ŋɪ z/, for example. For vendors who claim multi-lingual coverage in their offerings, reliable accuracy in this sphere is particularly daunting. Correct treatment of input text cannot be measured from a few examples. Large databases of text (100,000,000 to 1,000,000,000 entries) across a wide range of applications and languages can serve as automated and repeatable reference mechanisms. Lexical coverage (out-of-vocabulary rates) can also be measured using offline, database-driven tests for reference. Broader trends that emerged from ten systems -Concatenative speech synthesis is now pervasive.
-Large footprint, large database, non-uniform unit (NUU) or variable-unit concatenative systems are commonplace.
-Some systems produce a remarkably natural sound and expressive quality.
-Large numbers of languages are offered (21 is the most claimed). -Many different voices (varied according to age, regional accents, mood, etc.) are widely available.
-Female and male voices are always available; female voices seem to be deployed more widely; children’s voices are not common.
-The need for different accents, speaking styles and variety is being met.
-Driving instructions, e-mail reading, and delivery of up-to-the-minute news and information are frequent TTS applications.
-Concatenative systems still suffer from occasional poor segmental selection and/or blending, resulting in unnatural speech effects: ‘warbling’, ‘bubbling’ or ‘burping’.
-Further work is needed in the areas of intonation tunes, phrasing and pauses.
-More work is needed to ensure language/dialect-specific accuracy in reading certain text fields appropriately.
-Much work is needed in letter-to-phoneme rules or dictionary creation to ensure correct pronunciation of language-specific and cross-language proper names.
None of the vendors demonstrated a parametric TTS system. So one of the anticipated talking points for the conference session - the relative benefits of parametric vs. concatenative synthesis - became moot. The comparative merits of these two synthesis approaches are discussed in Henton (2002b). Vendors uniformly provide a choice between 16-bit and 8-bit µ-law sampling rates, but none stated which rate was used to produce their demonstration files. Obviously, the lower bit-rate has to be available for all standard telephony applications. Broader directions
A specific discussion point anticipated before the session was: “What are the biggest challenges for TTS today?” In brief, TTS industry needs to make better use of linguistic, phonetic and acoustic analyses, rather than refine digital signal processing techniques or other engineering and computer-telephony issues of implementation. It was clear from listening to the files containing any linear predictive coding (LPC), pitch-period manipulation, or similar compression techniques that the resulting unnatural voice quality (sometimes likened to a ghastly nasal residual) and the robotic pitch monotony are audible. Such compression techniques are largely outmoded: they detract from the human-like quality of synthetic voices now expected by customers, and size constraints have been addressed by more efficient database construction and search algorithms. Pressure of time prevented discussion of two further important questions: how much time is needed to make a new voice in a language already available; and how long it takes to create the first voice in a new language. Such inquiries may frame the discussion for a future TTS session, along with exploration of exact professional services for application tuning that are available from each vendor. Intonation, timing and pauses
The latest TTS products offer an array of individual voices, a variety of personae, regional accents, and the possibility to choose between characters to interact within applications. The most pressing area for improvement remains prosody: finer control of shifts in intonation contours and of associated voice quality is still required. Analysis and coding of intonation – as well as other non-segmental aspects of speech – is rather crude for US English. The ToBI system – which claims to be “a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety” – relies on binary labeling of ‘highs’ and ‘lows’ in the intonation tune. This is insufficient for the creation of synthetic speech, and it does not integrate well with other semantic, syntactic and pragmatic determiners of intonation. Application of such a simplistic analysis results in intonation contours in TTS systems being too invariable, with a tendency to be either overly dynamic (too enthusiastic or emotional), or to drone in a predictable, soporific, bored and boring monotone. Intonation models for UK English contain more dynamic intonation tunes; it would behoove researchers to compare the comparative success of the different schemata. A similar argument can be applied to the levels of word stress that are incorporated in a TTS system. Generally, no more than eight levels, to include primary stress, secondary stress, unstressed, reduced (not accented), etc., should suffice. Best rewards are obtained if an ‘adequate sufficiency’ of pitch-level definition and movement, as well as stress levels is provided by empirical observation, or by user-adjustable pitch contour pattern-matching. Timing (insertion of pauses and their duration), overall rate, and loudness of the synthetic speech may be user-controlled in a concatenative system. However, TD-PSOLA concatenative synthesis generally limits the pitch and duration modification factors between 0.5 and 2.0, otherwise substantial artifacts are introduced into the speech. Improved modeling of intonation, natural pauses and breaths, and discourse-appropriate variation in rates of speech, remain major challenges still facing us in the evolution of text-to-speech. SpeechTEK’s theme was Real Issues. Real Solutions. Conference attendees came to share experiences and learn how discussing mutual problems can help their products. A prime goal of this report on the Show and Tell: TTS Solutions session was to encourage panelists to re-examine their own and others’ demonstrations, and to educate all on the benefits of phonetic-linguistic analysis of speech in the pursuit of excellence for TTS. References
1. HENTON, C. (2002a) “You Say Zee, I Say Zed”. Issues in Localizing Voice-driven Applications. Speech Technology Magazine, May/June 2002. 2. HENTON, C. (2002b). Challenges and rewards in using parametric or concatenative speech synthesis. International Journal of Speech Technology, 5: 117-131.
Dr. Caroline Henton is CTO of Talknowledgy.com. Dr. Henton can be reached at carolinehenton@hotmail.com or (831) 457-0402.
SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues