TTS Is Finding Its Way

Hundreds of choices or one perfect voice Accuracy or variety? Vendor confidence and enthusiasm or customer mistrust and skepticism? When it comes to text-to-speech (TTS), questions still remain and hesitations linger about its proper direction. Some vendors toil in crafting what they hope will be the most accurate, natural-sounding voice; others believe customers want hundreds of unique voices. But despite the growing presence of TTS in expanding markets like in-car navigation units, for example, most within the industry blame either consumers or the technology for TTS’s not yet hitting the bar they say it should have met years ago.

"Concatenative TTS has mostly been a bust," Walt Tescchner wrote in the June edition of ASRNews. Following a departure from the acoustic model, concatenative TTS entered the market in the 1980s with the promise of greater ease of understanding and a more natural-sounding voice. But even if these applications were above-par, the use of concatenative TTS didn’t increase because vendors overhyped products and applications that consumers didn’t necessarily want or need, Tetschner asserts.

Still, some say concatenative TTS has not only evolved, but will continue to do so as developers push for more applications. Caroline Henton, chief technology officer of Talknowledgy, has worked in the concatenative TTS field for 18 years and says developers acknowledge the technology’s flaws, but continue with advancements nonetheless. "Many people expect concatenative speech to be able to read whole paragraphs, whole Web pages, as though they were being read by a human being," she says. "We know the areas that we have to make developments in."

And with the emergence of TTS in car navigation and GPS units, coupled with more indepth work in speech synthesis, signs of progress have some industry players hoping for TTS’s large-scale adoption.

Where Did It Go Wrong?
The TTS industry has yet to hit the billion-dollar mark that many believed it would following the release of concatenative TTS offerings, says Craig Campbell, CEO of TTS provider Cepstral. Even if companies combined all the costs associated with the technology (such as services and hardware), TTS is nowhere near that benchmark, he says, primarily because the technology has not evolved at the right pace and because most customers and end users were never as excited about it as developers were.

"In some sense, the technology hasn’t shifted to the degree that we had anticipated or would have liked," Campbell states. "I’m not sure there has been broad enthusiasm. I would say some of the industry is at odds."

Tetschner adds that TTS is still being developed by vendors catering to the "geek market," noting that many want technology that looks and acts slick and sophisticated. "Talking to a machine or a machine talking to you is not perceived as cool," he says.

He also blames TTS’s lack of growth on vendors that haven’t taken customer opinion into account or that have convinced themselves that they understand customer needs. "[TTS vendors] even tell you that callers don’t like a certain synthesized voice, and they have no data to back that up," Tetschner states. "I’ve seen surveys that have been run; they come back and they show that the caller doesn’t care. Callers are more irritated that [TTS] doesn’t work and they have to repeat things."

Further stifling industry growth has been a lack of applications. Today, GPS units, PDAs, smart phones, and unified communications have spurred growth within TTS, but it has taken time to realize that TTS works well in niche environments like hands-busy, eyes-busy situations (such as driving), call centers, and assistive technologies for people with disabilities. Two of the most common vertical markets employing TTS solutions today are healthcare and finance. And though TTS remains for the time being the only viable solution for GPS navigation systems, markets such as assistive technologies are too small to provide substantial market share, and TTS implementations in call centers remain costly for small to mid-sized firms.

Further development cannot be made in the call center until there is a complete change away from ill-conceived VUI designs, according to Tetschner. "You’ve got this army of VUI designers that are out there and companies that make a big deal out of VUI design and then mess it up," he says. "It’s embarrassing to watch it; it’s out of the Stone Age."

Despite a myriad of obstacles within the TTS market, each problem has its root cause in the same issue: speech synthesis. Without the proper voice, accurate algorithms, and intonation, synthesized voices used by TTS fall flat.

Two camps have formed within the industry: One pushes for a singular, perfect speaker, while the other seeks variety in the form of 1,000 voices. In early October, Cepstral launched VoiceForge, a program enabling new and existing clients to craft their ideal synthesized voice. Campbell says VoiceForge, which offers thousands of different voices, is a response to customer demand not only for accuracy but also for personality. "These huge, monolithic voices are not going to satisfy the marketplace," he says. "Quality is very subjective, and enormous variety and mass customization is one way to deal with this gap in expectations."

Following the success of personas such as Amtrak’s Julie, Campbell believes corporations will demand a lower-cost alternative in TTS—a synthesized voice that will not only have accurate pronunciation but will also play a role as the voice of a corporation. Steve Tomasco, a media relations director for IBM Research, agrees. "Now that we have achieved good naturalness and intelligibility with state-of-the-art concatenative TTS systems, significant effort has shifted toward making speech synthesis more expressive," he says. "[IBM] pioneered research in expressive concatenative TTS several years ago, and this has caught on in the TTS community since then."

Defining Expression
Expressive can mean everything from giving a voice a Southern drawl, speeding it up to reflect the fast pace of a teenager, or adjusting the voice’s intonation for various phrases in different scenarios. As Campbell explains, a company in the South might not be so willing to accept a persona called "Yankee Voice." People may be willing to "trade off some of that quality" for a voice that speaks to end users in the intonation or dialect native to their location, he adds.

Some developers have also added emotion to their synthesized voices, which works particularly well in the contact center, explains Rob Kassel, senior manager at Nuance Communications. "Folks have some idea of the scripting that they want to do, even identifying particular phrases that they would like to have said in a variety of ways that is far from neutral," he says. "If the wait time is very long, they might want the voice to say I’m sorry, but they want it to sound like it means it. We have multiple versions of yes and no, so now dialogue designers can add emotion to TTS."

But delving too deeply into emotion is potentially problematic, Talknowledgy’s Henton says. Voice tones and volumes vary across languages, and what might be considered normal volume in Germany could sound forceful in England. The bottom line is that there is no one formula for the perfect voice.

"Emotion is a field fraught with disputes and problems because neither psychologists, linguists, nor branders can all agree with what a voice is that sounds angry," she says. "There are various flavors of anger, intimacy, tenderness. [Emotion] is an area we should keep away from."

Despite efforts to produce unique or emotional synthesized speech in TTS, other developers have focused all their energies on improving accuracy. Lessac Technologies offers the most natural-sounding voices, according to Tetschner. The company runs on the teachings and principles of Arthur Lessac, a voice coach for orators, actors, and singers, and offers a new perspective on TTS. Lessac operates under the premise that better speech is achieved by feeling what one speaks, being constantly aware of the voice’s changing expression or intonation. Gary Marple, a Lessac Technology cofounder and its current chief technology officer and chairman, says the company is the "technological application of Lessac’s voice coaching," produced via synthesis and TTS. Since the company’s founding in 2001, it has pushed to create a natural voice and to tackle accuracy and intonation.

"Most [synthesized voices] are flat monotones and basically are not in use beyond four or five words because the listener won’t accept it," explains John Reichenbach, president of Lessac Technologies. "We will be successful if we can demonstrate that our product sounds just like a human. If we can only demonstrate that we’re 20 percent better, we’re not successful."

The company still plans to further its synthesized voice offerings by developing various personas for consumer use. For now, it will continue to focus on expanding its dictionary, a step Reichenbach thinks is more important than jumping into the market with a dizzying array of different personas. "[Other vendors] out there use about 60 English phonemes; we have extended that to slightly over 1,000," he says. "Nearly all of them are vowel variations, and it’s a natural extension based on the Lessac method. Each voice is normed, or has a prime frequency, or amplitude, and what we’re doing is determining how far up it goes in amplitude, resonance, and what the frequency jumps are. They’re all versus a normed voice; in this case our demonstrator is based on a specific woman."

Others, like IBM’s Tomasco, believe different issues within TTS and synthesized voice take precedence over perfection. Maintaining baseline quality over a wide range of applications while balancing memory footprint with TTS quality, is a prominent problem with TTS and plays a major role in the newest applications, he says.

Whether the end user is an executive with a hectic schedule or a college student who makes frequent, last-minute trips, in-vehicle GPS systems like TomTom, Magellan, and Garmin have made a larger-than-life footprint in the navigation marketplace. Powered not only by a bevy of preloaded street and highway names and numbers, GPS systems are one of the only applications on the market that could not operate without TTS technology. The two make for a perfect match: GPS systems require a system to pronounce an almost never-ending stream of directions and proper street names. While commonly used words can be preprogrammed, more bizarre words or phrases can only be translated using preprogrammed phonemes strung together by a strong TTS system.

GPS Is Here to Stay
Across the board, most experts acknowledge the staying power of TTS within the GPS market. All the elements for an ideal situation for a TTS application are present: driving is a hands-busy, eyes-busy activity, and a GPS system must have the ability to pronounce thousands of words.

"The navigator application is a winner, big-time." Tetschner says. "For a whole trip, you can’t be watching the map while you’re driving along. That’s why TTS is a big winner for a hands- and eyes-busy operation."

GPS sales numbers speak volumes. Market research company RNCOS predicts the GPS market to become a $757 billion industry by 2017.

Extreme growth within TTS systems for GPS units also comes with its own set of obstacles, including size and quality, Campbell says. Despite widespread adoption, end users still expect a higher level of quality since most consumer units fall in the $200 to $400 price range. "You’re selling a voice [in a GPS system] with a [$55,000] Cadillac Escalade," he says. "The same expectations, or higher, are present."

In addition, quality can be undermined by the unit’s size, as smaller units must cram massive amounts of data while still providing the same level of quality. "You’re looking to take a product that’s working pretty well on the server side and squeeze it into such tight constraints that size-wise it becomes an issue," Campbell explains. "What we can do well on a server doesn’t necessarily mean it’s going to sound that good on a device."

Nevertheless, despite the need for further tuning and problem-solving, the GPS market remains a steady and reliable avenue for TTS technologies.

Other emerging markets for TTS include toll-free directory assistance services like GOOG-411, and technologies that can read lengthy amounts of text from blogs, newspapers, books, or Web sites. Each has different requirements than personal navigation devices. While an end user might be willing to sacrifice exact pronunciation for accurate directions, a TTS system that reads complex information and requires proper pronunciation, intonation, and speed demands a more sophisticated application.

This is where companies like Lessac Technologies say their services will excel. With a focus on complex algorithms and dictionaries, Reichenbach acknowledges that his company would not "add any value whatsoever" to the GPS market, but that its focus right now is solely on longformat text. "News articles, blogs, or books—those are expensive, so you have to be able to process text for its meaning, as well as phrasing, and assigning an intonation pattern that will help clarify the meaning," Marple says. "Right now, you can’t do TTS books; they sound awful, and people lose their way and can’t comprehend what was said in the last paragraph. We believe we have that solved. We see ourselves as finally enabling a whole group of market applications."

Doing More with Less
This market, Talknowledgy’s Henton says, could be a lucrative avenue for companies, providing they propose a lower-cost alternative to the technologies currently offered. Most people may not want to have the entire newspaper read aloud to them, but those with eyesight problems have a vested interest in the synthesis of text for longer articles, she maintains.

The number of people having sight issues is growing, as is the aging population," she states. "Those blind at birth or by accident desperately need to have a wider array of technologies offered to them. The real problem in the past has always been pricing. The people who make specialized, integrated devices for the blind have done it on a small scale, so it’s hard for them to make a profit."

As more countries and businesses deal with increasingly multilingual populations, TTS vendors have been faced with adapting their dictionaries to work with foreign languages. While Spanish remains ubiquitous in most U.S. contact centers, TTS vendors have also responded to emerging markets like China and India, prompting them to expand their language options. In September, for example, GPS provider Magellan released three in-vehicle navigation units in China.

TTS and speech-to-text are also expected to play vital roles in countries where low literacy rates are common, where people might have difficulty entering contact information or producing and reading text or email messages. Countries with fast-paced economic growth also might require more advanced TTS systems to accommodate new business contact centers.

"Increasingly, the Asian and southeast Asian markets are needing TTS," Henton says. "In India, where they have an excess of 40 mutually unintelligible languages, as well as English and Hindi, they need some means of getting information over the phone, or a language that can be used in IVR systems that they can interact with."

Henton also adds that some users in foreign countries have used TTS as a tool for enhancing their pronunciations of English words and phrases. Users can write something in English and use TTS to "hear how it sounds when spoken by a native speaker of English."

With each new language comes another set of challenges, like new dictionaries, proper voice tone, volume, intonation, and dialects.

But despite a number of disagreements and problems with the technology, TTS remains a commonplace and sensible solution for a highly mobile population. GPS navigation systems have provided one strong outlet for TTS, but others remain a challenge. Developers must continue to produce voices that not only sound natural, but can mimic the unique pronunciation and tone of a human voice. Whether one flawless voice or 1,000 one-of-a-kind voices hold the key to further TTS success, developers acknowledge the technology’s faults, but remain confident in its progress and increased growth.

"I believe in the industry that we can advance, that speech and text is everywhere all around us, and the ability to get access to that is a value," Campbell says. "For the revolution to happen, it should be demanded by people who want their information to be portable, that all that information be seamlessly morphed between the different modalities, visual and auditory."

TTS Is Finding Its Way

Modulate Tops Hugging Face's Transcription Benchmark

LALAL.AI Launches Lynx Voice Cleanup Mode

VoicePing Releases VoicePing 3.0

Voiskey Officially Launches

Deepgram Brings Nova-3 Speech Engine to Snapdragon Devices

The Voice Can Sound Right, and the Video Can Still Be Wrong

DeepL Acquires Mixhalo

Canary Speech Partners with NeuroLexIQ

Voice-Only Outreach 'Structurally Misses' Gen Z and Millennial Debt Holders, Says Vodex AI CEO

Voicelyt Launches Voice Score

DXC Partners with ElevenLabs

Fish Audio Raises $52 Million in Seed Funding

Deliverect Partners with SoundHound AI

OrcaRouter Launches OrcaDub

Nabla Launches Dictation for Mac