How TTS Works: The Technology Behind Text
Unlimited vocabulary speech synthesis, or text-to-speech (TTS), has had an impressive history over the last two decades. During this time, it has moved out of the laboratory and onto the desktop. A variety of TTS systems are now available; such systems are becoming increasingly natural-sounding, intelligible, and even expressive. Yes, TTS technology is here today, and cost-effective applications that utilize it are starting to proliferate.
To examine what underlies this technology, lets consider the complexity of the task at hand. Unlike limited vocabulary systems, which must produce only a small predefined set of possible utterances, TTS systems must be able to intelligently handle any input text - a task more difficult than might be expected.
Consider, for example, the sentence There are 1904 barking dogs that live on St. John St. In order to correctly produce this sentence, the system must determine that 1904 is to be pronounced one thousand nine hundred and four (as opposed to nineteen o four); that live rhymes with give (and is not used as in the phrase a live show); that the first St. represents Saint and the second street; that there rhymes with hair, and not here; and that the sentence does not end after the first period.
But analyzing the input text is only half the battle. Once the system has determined the desired pronunciations of the input, it must generate the actual sounds that accurately produce these pronunciations.
This task is complicated by the fact that perceptually-identical sounds, or phonemes, are acoustically quite different in different phonetic contexts. For example, while the ps of speech and peach are perceptually very similar, they are acoustically very different.
The precise duration and frequencies of a sound depend on many factors - which segments precede and follow it, its position in the word, syllable, or phrase, whether the syllable containing it is emphasized, whether the speech is fast or slow, whether the voice is that of a male or female, and so on.
Not only does a TTS system need to produce the speech sounds themselves, but it must also generate the overall prosody (timing and intonation) for the utterance as a whole, emphasizing and de-emphasizing the appropriate words.
For example, in our sample sentence, the word John would usually receive the most emphasis. However, what if the sentence were uttered in response to the assertion There are not 1904 barking dogs that live on St. John St.. Then, the word are would receive the heaviest emphasis, and other words would be de-emphasized accordingly.
While current text-to-speech systems generally are not sophisticated enough to make these kinds of contextual distinctions automatically, more and more work on discourse analysis is being done, and the next generation of systems will be increasingly sophisticated in this regard.
Currently, some TTS systems, including Eloquent Technologys ETI-Eloquence, allow users to annotate utterances with emphasis markers and other kinds of tags to create special intonational effects like contradiction, or even general moods like boredom and excitement. And overall, many of todays systems automatically generate appropriate intonation for most utterances.
To accomplish the text processing and speech production tasks, text-to-speech systems generally have two main components: the text module and the speech module. The text module parses the input text into the appropriate linguistic units (e.g., phonemes), while the speech module uses the information produced by the text module to create the actual speech.The Text Module
The text module contains the algorithms, or rules, that process the input text. In order to provide the information needed for high-quality speech output, a text module must divide the text into sentences, the sentence into phrases, the phrases into words, the words into syllables, and the syllables into phonemes.
For accurate phoneme generation in many languages, this module must also analyze words into morphs (prefixes, roots, and suffixes). In English, for example, the ed of naked is pronounced very differently from the ed of baked, because in naked it is part of the root, while in baked it is the past-tense suffix.
Similarly, the th of the two-root word hothouse is realized as two sounds, while the th of the single-root mother is realized as a single sound. Note that putting all possible words in a dictionary is not a viable solution to phoneme generation, since new words are constantly introduced and in many languages, words or roots can be combined at will into larger compounds.
Text-to-speech products differ in the strategies they use to parse the input text into linguistic units. In its text-to-speech product, ETI-Eloquence, Eloquent Technology, Inc. (ETI) has adopted a novel and powerful approach developed through years of research by its founder and her associates at both Cornell University and ETI.
This approach centers around a unique, multi-tiered utterance representation, called a delta, in which all units necessary for high-quality speech generation (e.g., sentences, phrases, words, morphs, syllables, phonemes) are explicitly represented on separate, time-coordinated streams.
Following, for example, is a small fragment of the delta (for the two words barking dogs) that would be produced by ETI-Eloquence for the sample sentence discussed above:
text: |b|a|r|k|i|n|g | |d|o|g|s |
word: |wrd | |wrd |
morph: |root |suffix| |root |suffix |
syllable: |str1 |str0 | |str1 |
phoneme: |b|a|r|k|I|ng | |d|c|g|z |
Although not shown here, the units in each stream also have various associated attributes that the rules in ETI-Eloquence use to make the appropriate linguistic generalizations. Phoneme tokens, for example, have information about phoneme type (e.g., vowel vs. consonant), place of articulation, manner of articulation, whether the phoneme is voiced or voiceless, etc.
The parsing algorithms that make up the ETI-Eloquence text module are formulated in ETIs special Delta programming language, which is designed for straightforward expression and testing of rules that operate on multi-tiered deltas.
Typically these rules test the delta for particular patterns, and manipulate it accordingly. (Delta programs are compiled into C programs for portability to a wide variety of computer platforms).
ETIs approach can be contrasted with more conventional approaches, which use linear utterance representations and general-purpose programming languages. Such systems cannot express the linguistic rules that underlie speech in nearly as straightforward a way as systems that use more linguistically-appropriate modes of representation.
As a result, long-term expansion and maintainability are serious problems and the development of more sophisticated models often impossible. While many systems may sound comparable today, those that have been designed with expansion and ease of development in mind will come out ahead in the long run.The Speech Module
The speech module uses the linguistic structure generated by the text module (in ETIs case, the multi-stream delta) in producing the speech output. As mentioned above, the speech module must generate the overall prosody of the utterance, as well as the appropriate acoustic patterns for the individual speech sounds.
To generate prosody, all systems use rules that manipulate parameters like pitch and durations, although the degree of sophistication of the prosody rules differs dramatically across systems.
ETI-Eloquence takes advantage of the rich linguistic structure produced by the text module in generating highly-natural intonation patterns for most utterances. Concatenative or Rule-Based
To generate the acoustics of the speech segments themselves, systems generally use one of two main strategies: concatenative or rule-based. Most commercial systems in use today are based on concatenation.
In concatenative systems, speech fragments (such as syllables or parts of syllables), originally extracted from natural speech, are pieced together to produce the intended utterance. Depending on a variety of factors - how large the units are, how many units are stored, how the units are represented (e.g., as actual speech waveform fragments or in terms of a smaller number of parameters extracted from waveforms), and many others - concatenative systems can produce quite natural-sounding voice quality, capturing much of the voice quality of the original speaker.
However, because these systems often depend so critically on the speaker(s) from whom the units were extracted, generating a variety of voices and speech styles (e.g., whispering) can be difficult, and each new voice or style may require extracting an entirely new set of units from the model speaker.
Also, depending on the strategy, it is often difficult to express the rules responsible for generating the overall prosody; while a given system may produce quite natural voice quality, it may also produce unnatural prosody.
Finally, again depending on the details of the approach, memory requirements for concatenative systems can be excessive, even when only a small number of voices are provided.
In rule-based systems, the acoustic parameter values for the utterance are generated entirely by algorithmic means. In ETI-Eloquence, a set of rules sensitive to the linguistic structure in the delta generates a collection of acoustic values (frequencies, bandwidths, amplitudes) that captures the perceptually-important cues for reproducing the spoken utterance.
A set of voice filters (implemented as small program procedures) modifies these cues in accordance with the values specified for a number of parameters (like gender, head size, breathiness, roughness, pitch baseline, and others) to produce the desired voice quality; a synthesizer generates the final speech waveform from the parameter values.
Rule-based approaches require extensive knowledge and understanding of the sound patterns of speech. While acquiring this knowledge can be expensive and time-consuming, rule-based approaches have the long-term advantage that knowledge is cumulative.
Like the rules in the text module, the rules in the ETI-Eloquence speech module are based on novel linguistic models derived from some twenty years of research. These models have resulted in a succinct set of rules that accurately reflects the linguistic regularities underlying speech.
The ETI-Eloquence speech rules are grouped into a relatively large universal component and smaller language-specific/dialect-universal and dialect-specific components.
As a result, new languages and dialects can be developed quickly, and can be integrated into a single system that takes relatively little memory.
It is even possible to mix and match text and speech modules to produce foreign accents, say a British speaker speaking Castillian Spanish or an American speaking Mexican Spanish. The speech parameters for any language or dialect can be filtered by the common voice filters to produce a limitless number of voices.
While concatenative and rule-based systems each have their respective advantages, the flexibility and long-term potential of a rule-based system that is based on appropriate underlying linguistic models cannot be matched.Text-to-Speech Toolkits
TTS systems are made available to application developers in the form of toolkits that can be accessed through an application program interface (API). While TTS toolkits share a common core set of features, the complete set of capabilities differs from system to system, generally reflecting the capabilities of the underlying technology.
For example, depending on the sophistication of the text module, a toolkit may or may not provide text interpretation options that allow developers to specify for different applications how to treat ambiguous text strings, such as four-digit numbers like 1904.
Depending on the nature of the speech module, a toolkit may or may not give developers the ability to define their own voices; it may or may not allow for voice changes in the middle of a sentence; and it may or may not provide mechanisms for creating special intonational effects.
In choosing a toolkit, a developer might wish to consider not only what the toolkit can do today, but how it can be expanded in the future.
The ETI-Eloquence toolkit is a highly flexible toolkit that capitalizes on the powerful linguistic models and development tools that underlie it. The toolkit provides eight, fully-customizable voices, including those of adults and children, both male and female. It provides functions that make it well-suited for a wide range of applications. These include real-time mouth position data for phonemes spoken (e.g., for creating animated talking faces), text indices for synchronizing actions with the speech output (e.g., highlighting words on the screen as they are spoken), special dialog boxes for customizing user dictionaries, multi-channel output (for telephony applications), and more.
The toolkit is presently available for Windows-based PCs, and is being ported to other platforms. In addition to ETIs own API, the system supports the Microsoft Speech Application Program Interface (SAPI). ETI is continuing to enhance the toolkit with new features. Most significantly, 1997 will bring six new languages/dialects.
As more potential users become aware of the advantages of TTS, there will be an increased demand for, and availability of, applications that take advantage of this exciting, cost-effective, human-machine interface technology.
Anticipating this demand, IBM recently announced that ETI-Eloquence will be supplied as part of its VoiceType speech toolkit. Soon TTS-enabled applications will be as commonplace as word processors are today. Sue Hertz is the president of Eloquent Technologies. Further information about ETI-Eloquence and its technology can be found on the companys web site, http://www.eloq.com