Innovations: Apple's New TTS

Under the umbrella of speech technologies, automated speech recognition (ASR) has garnered the lion’s share of press, even though within many applications speech synthesis, or text-to-speech (TTS), is running not so silently in the background. Any time you have rapidly changing or large amounts of text that need to be spoken to an end user, it makes sense to employ TTS rather than attempting to prerecord speech.

Understandably, in any new place that ASR has been introduced, TTS has been right alongside, from unified communications and mobile phone applications to business applications and assistive technology. In addition, there are many applications in these domains where speech recognition is not robust or accurate enough to be useful, but speech synthesis makes information delivery quite possible and useful.

While people have become inured to synthesized speech in IVR applications, the acceptance curve has been rapidly accelerated through applications such as GPS navigation and email reading.

So often we write about how good speech recognition has become and the many new ways it’s being deployed, but what about TTS? The answer is that synthesized speech has become very good, and it’s everywhere. In the past decade the industry has catered to different user preferences through dozens of new accents, both male and female, and the addition of more natural intonation and expressiveness. End users now talk about speech personas as TTS output has become more natural.

Grounded in Theory
One particularly brilliant example of how far we have come in synthesized speech technology is the new voice and engine in Apple’s new Mac OS X Leopard operating system. Leopard contains a new approach to speech synthesis in the Alex voice.

Alex uses a hierarchical approach to concatenative speech synthesis, in which many aspects are motivated by psycholinguistic theory of speech perception and understanding. What this simply means is that text spoken by Alex provides listeners with more of the cognitive clues they get when listening to human speech.

For example, when we speak we generally take a breath at major pauses, unconsciously shaping our mouths for the next phoneme we are about to speak. These pauses give the listener hints about the topic structure, and anticipatory mouth shape changes the way each breath sounds, giving the listener clues as to what the next word will be. By adding this function to Alex, the cognitive load on listeners is noticeably reduced. Since the first phoneme of the next word is predictable from the breath sound, this rules out any word not starting with that phoneme, and so the listener’s search space is reduced by about 98 percent. Fewer possible words from which to choose equate to a higher likelihood of perceiving the correct word, which lowers demand on the listener’s attention.

Similarly, most synthesized speech reads one sentence at a time, disregarding the context of what comes before and after, but humans don’t speak that way. Apple has added that human factor by focusing on one paragraph at a time, including context, paragraph topic, and sentence order so that each sentence is read in a way that reflects the context and placement in the paragraph. Better still, it ties this in with the human breathing structure that is employed with reading contextually to better mimic real speech.

Finally, since Apple has garnered a large share of the assistive technology market, it has tailored Alex for that group, who typically prefer formant synthesis over concatenated speech because it is faster. To address this issue Apple used a different method than normal to speed up concatenated speech, making this new TTS engine highly accessible to the visually impaired.

I listened to Alex. He was great. Five years ago you wouldn’t have heard me say that any TTS was great, but this isn’t the same TTS as five years ago. I’m hoping that Apple extends this development to other platforms, like mobile phones, and gets Alex a sister, English cousin, Japanese friend, etc. The world would welcome these new additions.

Nancy Jamison is principal analyst at Jamison Consulting. She can be reached at nsj@jamisons.com.

Innovations: Apple's New TTS

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round