Golden Words

On August 8, a white figure jumping out of a background of bright red blazed across more than 2 billion television screens. Opening ceremonies of the 2008 Summer Olympic Games in Beijing had begun. For most viewers across the world, this emblem simply signified the start of the games. However, for the 383 million people watching in China, this character meant something much more complex.

The figure appeared to be a man running with his arms outstretched, but it was actually a stylized version of the character that represents the word for "capital" in Chinese. When paired with the character that means "north," it morphs into the word for Beijing. In the form of the red and white image, created by prominent Chinese graphic designer Guo Chunning, the symbol means "dancing Beijing," and its red background is meant to evoke a seal symbolizing the promise of the city. The likeness to a person is meant to capture the idea of a hero. The outstretched arms mean "welcome," and the curving lines represent the "form of a dragon."

Imagine how you would say that character to capture its full meaning. Imagine how this would sound in Chinese so that it could be understood by all speakers, many of whom speak dialects that are mutually unintelligible.

Now imagine how it would be spoken by your average text-to-speech (TTS) system.

For an event that was expected to bring more than 2 million people to Beijing, the city spent a great deal of time, energy, and money—a total of $40.9 million—to ensure it was prepared. A keystone in this preparation was the goal laid out in the Action Plan for the Olympic Games, created by the Beijing Organizing Committee for the Games of the XXIX Olympiad (BOCOG): "That anybody, at any time and in any place, can enjoy the information service that is affordable, comprehensive, personalized, and multilingual."

This was no small task considering the volume of information that had to be delivered to staff, sponsors, volunteers, athletes, guests, and tourists. To make good on its promise, Beijing needed to have information available not only visually, but audibly as well for those who could not read, could not see, or were on the go. However, extremely dynamic information, such as schedule changes and driving directions, would make prerecording impossible.

This is where iFlytek came in. As the leading developer of speech and language technology in China and creator of the Chinese Speech Synthesis Markup Language (CSSML), a nationally award-winning text-to-speech (TTS) system, the Chinese company was chosen to provide the TTS technology for the Olympic Games. With iFlytek’s technology and support, Beijing was able to achieve its goal and deliver pertinent information throughout the games.

Olympic Game Plan
Beginning in 2003, CAPINFO, the main provider of data and communication services in Beijing, began developing a prototype to submit to the BOCOG for approval. iFlytek was already on board for all TTS needs. "The whole system was quite complicated," says Qiang Bai, vice president of iFlytek, who cited five central aspects of the project:

GPS services delivered primarily on handhelds or tourists’ PDAs, required TTS support for those driving or unable to read their screens. Bai notes that this was especially important due to road closings that changed daily. "The mapping information was a very strong part—to be able to tell you the traffic is being controlled so if you want to go here, you need to use this other road," he says.
Services for visitors to receive information related to the games as well as tourism support were also needed. Delivered via kiosks set up around the city, on handhelds, and through the official Beijing Olympics Web site, iFlytek’s TTS was used to make this information accessible via audio delivery.
Information and training services for volunteers needed to be delivered. Beijing recruited more than 5,000 volunteers to support media operations, act as VIP escorts, and provide translations, medical help, directions, and more. TTS was used in kiosks, call centers, and on a Web site where volunteers went for instructions. "There were so many volunteers during the Olympics, and because many were young students, they did not know the information that they needed to have," Bai says.
For the numerous sponsors, including McDonald’s, Nike, and Coca-Cola, Beijing created several call centers, using TTS to deliver information to their personnel in both Chinese and English.
Translation services were required for those visiting Beijing—a particular issue for Westerners as the Chinese alphabet prevents ease of dictionary translation. (For more on translation development see "You Say Potato, I Say...").

To ensure that all information was speech-enabled, iFlytek worked to embed its TTS technology into the mobile, kiosk, online, and call center platforms. Bai notes that this process was quite simple. "We did not need to do anything fancy, as our TTS has been around for many years and is the best on the market for Chinese," he says.

After the prototype was approved in 2005 by the BOCOG, little customization was needed outside of providing additional TTS engines and completing testing to ensure that the engines worked properly. "Very standard work," Bai says.

However, to understand how it was possible for (Welcome to Beijing) to be heard with a click of a button throughout China’s capital city on August 8, we must look back several decades to the creation of this technology and back nearly 3,000 years at one of the oldest, most complex, and most rapidly changing languages in the world today.

The Hurdles
The first speech synthesis devices built for a computer were developed in the late 1950s for Western languages. TTS systems for Chinese, however, came later in the game, beginning in the wake of former Chinese Communist Party leader Deng Xiaoping’s Open Door Policy, which he launched in 1978 to allow foreign trade and economic investment, and which led to a rapid computerization of the nation.

In the mid-1980s, with the support of the Chinese government, iFlytek began developing speech synthesis systems, and since has developed the most widely used system for the Chinese language worldwide. "Speech technology in China is a relatively new market," Bai notes. "We’ve had to work very hard to educate and nurture the market."

The rapid development of iFlytek’s CSSML is impressive not only in the cultural challenges the company faced, but the difficulty of the Chinese language itself. The Chinese writing system comprises nearly 40,000 characters, although speakers recognize 6,000 to 7,000. In comparison to the 26 letters in English, the sheer number of recognitions a TTS system must process is profound.

During the 1950s, China oversaw a massive resurfacing of its written language, leading to the creation of Simplified Chinese. The characters, however, are still quite complex, and their sheer number can make computer usage tedious. Sue Ellen Reager, CEO of @International Services, a global translation and localization services and solutions provider, says that when typing in Chinese, "If you type honn, you hit a key, and boom, up on your screen appears 100 characters, and you have to wander through them looking for the one you want."

Complication in Chinese extends to the spoken word as well. "Because of the interaction of tone and intonation [in Chinese], the Chinese pitch model is much more complex than in other languages," explains ZhiWei Shuang, a researcher at the Speech Technology Group for the IBM China Research Laboratory. For example, an English speaker would hear ma as the same word, regardless of the pitch or slide of a person’s voice. However, depending on the height and rising pattern of the pitch, a Mandarin speaker will understand this word to mean either "mother," "horse," or "scold."

Perhaps the most daunting task for the Chinese TTS creator is that Chinese is not one language, but a group of thousands of dialects. Many of these dialects are arguably separate languages that are almost completely mutually unintelligible. Therefore, a true Chinese TTS system must incorporate hundreds of languages. Dan Burnett, director of speech technology at Voxeo, says this creates problems for speech synthesis. "You cannot tell from the writing system which dialect is to be pronounced," he says. "We have similar problems in English. Obviously there are words that are spoken differently in the southern part of the U.S. than in New England, but this problem is much, much worse in China."

The final hurdle is in the structure of written Chinese. "With English it’s really obvious where one word begins and one word ends," Reager says. "There’s a space in between and punctuation marks. In Chinese, it appears as one long, never-ending group of characters." Therefore a TTS system may group characters incorrectly, creating nonsensical sentences. For example, "Michael Phelps" written in simplified Chinese reads one way. However, the first character alone translates to "fragrant."

With no word boundaries, the technology can easily become confused. Because of this issue, proper names in other languages often remain written in the Latin alphabet in Chinese sentences. Any TTS software used, therefore, must understand this alphabet as well and determine how to pronounce these names so that Chinese speakers will understand.

During the past 20 years, iFlytek has overcome these obstacles through extensive research, data collection, testing, and platform improvements. "As we have become bigger, we’ve improved our TTS, and when our TTS improves, we can become more successful and have more data, and can then come back and tune up our service again," Bai says. Working with a number of Chinese language experts and creating a giant speech database of more than 3,000 sentences, iFlytek has combatted issues of language size and tonality, and has been able to develop 1,000 separate systems for the disparate dialects of Chinese.

To address the issue of word boundaries, iFlytek has added word elements to its systems to allow authors to use word-separation markers to show where each word begins and ends. CSSML allows for phrase, sentence, and paragraph elements to ensure proper separation along these lines. Additionally, iFlytek has created Interphonic CE, a multilingual TTS system that can read a mixture of English and Chinese.

With its superior TTS system in place, iFlytek delivered text-to-speech systems that worked seamlessly with the infrastructures required to support the Olympic Games. Due to extensive research and development put into the creation of iFlytek’s speech synthesis, thousands of visitors and staff were able to gather audible and pertinent information through a number of channels starting at the opening ceremonies. For the Chinese TTS used in Beijing, iFlytek chose a standard, official-sounding Mandarin, estimated by Reager to be understandable by roughly 70 percent of China’s population.

To the surprise of many, the speech synthesis services were embraced during the games. Bai says that at first he was hesitant to believe that citizens would use the technology. Traditionally, "Chinese people don’t have the habit of using automated services," he says. "I have an answering machine in my office, and during my three years in that office at iFlytek, only one meaningful message was left."

Although he could not offer concrete numbers to show how many people used the TTS system, "It turns out that there were many more people [willing to use the services] than we thought," Bai says. "Our partners and we were very, very pleased."

April Fong, a student at Shanghai’s School of International Relations and Public Affairs and an attendee at the games, agrees, noting that the volunteers were especially knowledgeable and well-trained. "Everywhere I turned there were always Olympic volunteers. I think Beijing did almost flawlessly for being a host city." (For more spectator perspectives, see "I Was There,").

Looking Forward
Although the Olympic Games concluded on August 22, iFlytek’s work continues, as does the proliferation of TTS technology in China and across the globe. In 2010, China will host the World Expo in Shanghai. The motto for the expo is "Better City—Better Life" and Shanghai will rely on iFlytek to support this objective.

Bolstered by the widespread acceptance of automated technology in the Beijing games, iFlytek is currently developing new TTS projects to prepare Shanghai. Although the TTS deployed in Beijing was in Chinese and English only, Bai says that for the World Expo, the city will have auditory information available in a multilingual format, for which iFlytek is preparing to partner with other speech synthesis providers worldwide.

iFlytek will also provide the technology for online language assessments created by the Shanghai government. Because the people of Shanghai speak Shanghainese (a dialect of Wu), the Chinese government has created a program to encourage them to learn Mandarin as well as English. "We are making this Internet-accessible, so a citizen on his home computer can do a self-assessment," Bai says. "The logic is not to test, but to encourage them to learn the languages."

The Olympics and upcoming Shanghai World Expo demonstrate only a few ways TTS is becoming a presence in our day-to-day lives. Bai sees the largest growth area for TTS in cell phone use. iFlytek is currently partnering with China Mobile to create a service in which a user would say the name of a song into her phone, and the phone would play it back instantaneously. "Services similar to this show the growth and power of speech technology," he says.

Reager sees the power of TTS extending even further to break down language boundaries internationally, making it possible for airports, public transit, hotels, tourism departments, and taxicabs to have information delivered audibly in a multitude of languages by removing the cost of translation and recording. "TTS should be every place in the whole world. It’s not tomorrow—but it’s not very far away," she says. "Language should no longer be a barrier. Technology is coming together to remove it completely."

Now that is truly something to dance about.

^{YOU SAY POTATO, I SAY...}

Despite Beijing’s efforts to make visitors feel comfortable in China, traveling to a foreign country can be unsettling for anyone who does not speak the language. It is with this principle in mind that Interactive Systems Laboratories (InACT)—a developer of multimodal and speech-based user interfaces that improve human-computer and human-human communication, with locations at Carnegie Mellon University in Pittsburgh and the University of Karlsruhe in Germany—created the Digital Olympics Speech-to-Speech Translation System. The handheld device, which can translate Chinese to English and English to Chinese, was first presented at the Beijing-supported Special Programme for Construction of Digital Olympics in 2005.

Alan Black, an associate research professor at Carnegie Mellon’s Language Technologies Institute, was one of the people who worked on this project. "One of the biggest issues for Western people going to China is that they can read nothing; they have no idea about the language," he says.

A device in which a person speaking Chinese could translate into an English automated voice would overcome this obstacle. To develop this translation service, InACT took special care to ensure that the synthetic voices sounded friendly and appropriate, and that the device would recognize all pertinent proper names, such as hospitals, streets, and hotels. Black says that this technology extends to support any traveler for whom language is an issue, noting, "Being able to communicate just makes it better."

^{I WAS THERE}

>>> April Fong, student, School of International Public Relations and Public Affairs, Shanghai: "I was brought up with Cantonese, and I’m learning Mandarin, but I wouldn’t say I’m fluent. I think for the average person it was very easy to get around."

>>> Ryan Sullivan, graduate student: "Overall I was very impressed with the Beijing Olympics. Everyone was extremely friendly and proud to do their part for the games. I left my temporary residence card at one of the events, and one of the volunteers found it and called our hotel to return it the next day, which really amazed me."

Golden Words

FlashLabs Releases Chroma 1.0 Voice AI Model

Movate and Krisp Partner on AI-Powered Voice Solutions

Agora Deepens Partnership with MiniMax

DeepL Launches Voice API