July 1, 2010
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Adding a Voice to Tweets

Much has been written about the advantages and disadvantages of the various types of online social media and how to use them. Businesses use social media to establish and maintain good relationships with customers, promote their goods, and provide supporting services. Customers use social media to express their needs, evaluate products, and critique services. Social media sites enable businesses and users to achieve their goals easier and faster with a wider audience than previously possible.

Traditionally, businesses and consumers had to have verbal discussions to achieve social media goals, but social media software primarily uses text. Except for the occasional exclamation point and question mark, text messages do not contain verbal intonation and emotions.

Verbal messages have several advantages over text: When users’ eyes and/or hands are busy, they can speak and listen rather than type and read. Verbal messages can be created and reviewed by sight- or hearing-impaired individuals. More important, verbal messages can contain verbal intonations and emotional nuances—urgency, seriousness, and humor, for example—that are missing from the textual version of information typically exchanged by businesses and customers.

How can we provide the missing emotional nuances to online social media systems? At least two approaches exist, both of which involve audio:

Send and receive audio messages: Users can speak and listen to Twitter messages (“tweets”) without reading and typing into mobile devices while driving.
Record audio messages into a database to replay later to recipients.

Speech technologies will be able to convert audio messages to and from textual formats. Automatic speech recognition algorithms convert speech to text, which can be integrated with other text messages, searched, sorted, and filtered using database technologies. Conversely, any text message can be converted to speech using speech synthesis algorithms. Thus, users can access their messages using either text or speech. These conversion capabilities enable drivers to listen to and create verbal messages or tweet when driving, and to read and type text when not driving.

But don’t we lose the emotional nuances when converting spoken words to text? Advanced speech recognition systems will recognize nonverbal information, such as speaker identification, intonations, and emotion. Speaker identification systems are now widely used to identify users and verify they are who they claim to be. Speaker intonations and emotions can be detected and represented using EmotionML, a language under development by the World Wide Web Consortium’s Multimodal Interaction Working Group.

Advanced speech synthesis systems will be able to take the user identity to produce a voice similar to that of the user who originally spoke the phrase. For example, after losing his voice to cancer, film critic Roger Ebert began to use a personalized, synthesized voice based on his personal voice patterns. Advanced speech synthesis systems will also use information encoded by EmotionML to insert intonations and emotions into synthesized speech. Avatars will make synthesized speech more interesting and personal.

The big advantages of using speech recognition and speech synthesis in social media include the following:

Users can create messages by speaking or typing and review messages by reading or listening.
Users can store messages in text form so they can be searched, sorted, and filtered using database technology.
Users can receive messages at a convenient time. Sometimes users might receive and create messages in real time, such as in chat spaces or telephone calls. Other times users might receive and create messages when it is convenient, as in voicemail and email.
Users can hear nonverbal information, such as intonation and emotion, expressed in a voice similar to that of the original speaker, using verbal messages.

Online social media systems will be greatly enhanced by capturing and presenting users’ voices. The additional emotional nuances available with voice, but missing from text, will increase the usefulness of social media software for business transactions.

Jim Larson, Ph.D., is a speech applications consultant, co-chair of the World Wide Web Consortium’s Voice Browser Working Group, and program chair of SpeechTEK 2010 and SpeechTEK Europe 2010. He can be reached at jim@larson-tech.com.

Adding a Voice to Tweets

Vonage Integrates with Salesforce's Agentforce Voice

Lorikeet Launches Voice 2.0

Krisp Launches SDK for AI Accent Conversion

AI Voice Generator Market to Be Worth $20.71 Billion by 2031