Speech Technology Magazine


IBM Makes Watson TTS More Expressive

IBM makes its Watson Text to Speech API more human-sounding by drawing on more than 40 hours of speech to tag utterances.
By Tye Pemberton - Posted Feb 29, 2016
Page1 of 1
Bookmark and Share

Furthering its efforts to incorporate what it calls "Emotional IQ" into its Watson technology platform, IBM has updated the Watson Text to Speech (TTS) API to allow developers to add believable affects totheir synthesized speech, rebranding the service as Expressive TTS.

The service, which developers can control using Speech Synthesis Markup Language (SSML), is the product of machine learning, manual utterance tagging, and direct coding, drawing from more than 40 hours of speech to develop the algorithms that give Expressive TTS its human-like quality.

"We exposed the Watson machine learning system to speakers speaking in different tones of voice that we annotated in advance," says Michael Picheny, senior manager at the Watson Multimodal Lab for IBM Research, who goes on to explain why Expressive TTS sounds believably human:

"You could imagine achieving expressive speech in a more naive way than we have, where you just take one sentence recorded in one mode and another sentence recorded in another mode, and when you want to synthesize the combination you ask the algorithm to take the average between them. But that's very jarring. What we've done is develop algorithms that allow us to have very smooth transitions between these different modalities so that it sounds like a person is actually talking."

As a result, Expressive TTS gauges the overall affect of each speech synthesis, rather than applying simple rules like "good news equals a raised tone" to an entire passage.

"If Expressive TTS is commanded to sound apologetic when it would cause some jarring effect on the user, the system will back off to a more neutral style of voice to make sure everything comes out smoothly," says Picheny.

Currently, only the U.S. English Alison voice is SSML-enabled, with three default tones that Picheny calls "affects." They are Good News, Apology, and Uncertainty.

"We spent a lot of time studying what affects might actually be useful to developers and the service industry to begin with," says Picheny. "Affects like anger seemed inappropriate for the initial use cases we were imagining. But that's not to say that additional affects wouldn't be useful in non-commercial TTS."

In fact, while the current deployment of Expressive TTS is limited, the Watson team is paying close attention to the developer community for guidance. "Let's get some feedback from people and see what else they might want. Our real goal here is to give developers tools they find useful. We want to be useful," says Picheny, hinting at the possibility of additional languages, voices, and affects, dependent on feedback and requests from developers.

The Watson team is also hoping that developers will find ways to leverage Expressive TTS with the wealth of Watson’s other services available through the IBM Watson Developer Cloud on Bluemix. In particular, Picheny believes Expressive TTS might help to create a fully functioning intelligent system for customer service and other similar human-to-machine interactions when combined with Watson services like Tone Analyzer (which detects emotional and social cues in text) and Personality Insights (which builds emotional and personality profiles of individuals and groups based on digital cues, such as email, forum posts, and Tweets). 

IBM is offering Expressive TTS free to the first million characters and will charge 2 cents  per thousand characters thereafter. 

Page1 of 1