The Gift of Speech

Article Featured Image

Remember the days when your iPod only played music and your Xbox was just for video games? Well, the days of single modes are coming to a close as more multimodal devices continue to hit the market—some just in time for the holiday season, to boot. Surfing the Web on an iPod Touch and listening to music on an Android mobile phone are just a couple of ways users can merge leisure and business activities on the same device. And, of course, speech has become a part of most, if not all, devices that are being used, emerging as the natural interface that will change the way people interact with their devices. 

As devices become even smaller, the keyboards—actual or touchscreen versions—also are shrinking to even smaller sizes. For example, the new iPod Nano is barely the size of a Post-it note. 

“It’s hard enough on a BlackBerry to type a long, elaborate email, but it’s really hard on these new touchscreen devices,” says Matt Revis, vice president of product management at Nuance Communications. “Speech allows you to create content a lot faster. This is something that professionals really value. They are the ones who are heavy messaging users, text messaging with their colleagues and, even more so, emailing.”

Revis points out that Nuance’s Dragon NaturallySpeaking software for high-end BlackBerrys allows users to dictate email. Also, the new T-Mobile myTouch 3G Slide allows users to send text messages using a natural form of speech. “[It has] very advanced speech capabilities, which allow you to interact with the device in a much more unconstrained, free-form way,” Revis asserts. “You can say something like, Send text to Rebecca, I’m going to be 15 minutes late to the meeting.”

The T-Mobile 3G Slide also allows a user to search the Web in one shot, which Revis says makes searching the Web easier and more efficient. He adds that the smartphone has an application that lets users receive, listen to, read, and respond to texts by voice. “Business travelers are on the go. They’re driving around. They’re in airports or hurrying from meeting to meeting, and the ability to create messages, not just more quickly, but in a variety of settings and states of movement, is also really important to that ‘prosumer’ [professional consumer] segment,” Revis says. 

Grant Shirk, director of product management at Microsoft/Tellme, notes that speech will continue to play an important role as a natural user interface, which is evident in the recently launched Windows Phone 7. In the past, speech has felt a bit like a “bolt-on” experience, but he says that’s going to change. “Microsoft’s approach to speech as a natural interface is to integrate it deep into the product experience, to make sure it is as seamless and useful as possible for our customers,” Shirk says.

By pressing the start button on the device to bring up the speech user interface (UI), a user can then speak commands into the device. “We have focused on three scenarios for mobile users: making calls, navigating the UI, and completing tasks with search,” Shirk says. 

Xbox for Business

Shirk also discussed a number of other new Microsoft products that will not only make good gifts, but could have some surprising business applications, too. For example, Kinect for Xbox 360 is perhaps one of the most talked-about new products that will leverage speech with gesturing in a controller-free gaming experience. Shirk says this interface will engage new users because they don’t need to know how to use the controller to play. Kinect will include games such as “Kinect Adventures,” “Kinectimals,” and “Dance Central.” However, users can also access ESPN, Zune, and other sources without a remote and use voice commands to play, pause, and fast forward. 

The bundle to be released for the holiday season will have a redesigned Xbox 360S, the Kinect device, and the “Kinect Adventures” game, but Kinect can also be added to an existing Xbox 360 console, Shirk adds. “It gets you out of your seat,” he says. “What other system can teach you how to dance?” 

While most users wouldn’t connect the Xbox to business, the system actually possesses some unexpected business uses. Kinect comes with a built-in videoconferencing capability, plus a camera and Internet connection. “The future of walking into the family room and waving to your television and starting a face-to-face conversation with someone across the country or across the world is huge; it’s huge for person-to-person communication, and it’s huge for the way businesses need to think about new channels that they have to communicate with their customers,” Shirk explains. 

But Xbox Kinect isn’t the only device to use speech and gestures, according to Mazin Gilbert, executive director of research at AT&T. AT&T’s new business application, Speak4it, combines speech and gesture for more customized searches. “What’s cool about this application is that we’ve very recently applied multimodal,” says Gilbert, who explains that if a user goes to the map and asks for something while gesturing at the same time, the application can find something very specific. “I can say, Look at a map and say, Dentist and draw a street on this map, and it will find dentists along that street. I can ask for plumbers, and I can circle a small area in this map and it will find plumbers only in that area.” 

According to Gilbert, this app is uniquely different from Kinect on the Xbox because “to the best of my knowledge, that’s the first application ever to be deployed and to have multiple modalities to be used simultaneously,” he asserts. 

Like Shirk, Gilbert also says he sees a future where voice and gesture could be used seamlessly. “Once I can combine gesture and speech I can do a lot of stuff,” Gilbert muses. “I can be sitting in front of my TV with my voice remote and I can draw stuff on the TV, and now I can use speech and gesture to draw simultaneously in ways you couldn’t do before. Before you either spoke or you clicked. That is very constraining.”

In addition, though they’re not gesture-enabled, various voice-controlled remotes are on the market. For example, RCA has a voice-controlled remote that can be used to perform a variety of tasks, from changing channels to responding to the words play and pause. Basically, it seems, almost any task that can be done on a traditional remote can be done on RCA’s voice remote as well. It even comes with built-in verbal directions and the possibility to customize commands. It supports up to six devices, including TV, DVD player, and satellite TV receiver, and its versatile design provides traditional buttons. 

Though these devices have been on the market for a while, they might be getting even more advanced very soon. For example, according to Gilbert, AT&T has come up with some interesting prototypes, though he couldn’t comment on when the services might be available. “I am hoping it will be soon that you will see this service on the market,” he says. 

During the past two years, the prototypes have become radically different, Gilbert attests. “We have a network-based solution, where the intelligence of the voice recognition is in our clouds and not in the device.” Gilbert says the cloud-based option would allow AT&T to scale the service at a low cost for customers, as well as integrate the technology with live programs and the Internet.

Furthermore Gilbert notes that other changes would make remote technology even more advanced. “We have advanced natural language technology to enable users to say phrases like, Rerun Project Runway tomorrow night,” he says. 

Gilbert also points out that combining speech recognition and natural language understanding will become another way people can search the Internet and television, and the prototypes have capabilities that move beyond the actual remote control. “[They are] extended to mobile phones and emerging devices like iPads, iPhones, etc. This makes the technology available to a large market of users with existing hardware devices,” he states.

Generating Buzz

What Gilbert calls emerging devices, like Amazon’s Kindle (or other e-readers) and Apple’s iPad, have become some of the most desirable products on the market. After much controversy over whether a text-to-speech (TTS) option would be violating publishing rights, Amazon has made TTS available on titles where publishers have allowed it. 

While some have criticized the iPad, saying it might not be a necessary device, it’s hard not to be seduced by its lightweight, sleek, and vivid design, as well as the array of tasks that can be performed on it. Indeed, it does look like something imagined for the future. With an iPad users can play games, check email, read books and magazines, and listen to music. They also can have books and other text read to them using apps like Speak4it. 

However, one doesn’t necessarily have to buy an app to get a TTS reading on the iPad because Apple has included a built-in feature called VoiceOver. Meant to assist the visually impaired, VoiceOver reads anything that is tapped on the screen. This means an icon can be read, but also any text touched on the screen, which is a desirable feature to many who aren’t visually impaired. However, Apple’s other speech capabilities can be obtained only by purchasing applications from the iTunes store.

According to Ryan Joe, an analyst at Ovum, Apple doesn’t have its own speech recognition engine, which is why third parties produce most of its speech capabilities.

If Apple wants to incorporate more complex features, then it will have to get its own speech recognition engine, Joe contends. “Right now Apple seems to treat speech interfaces as a value-add, while on the iPhone 3GS and iPhone 4, to pick two products, the features are pretty basic: voice dialing and music controls. However, the purchase of Siri gives Apple a very good virtual assistant product with a voice interface. I anticipate Apple integrating the Siri solution into future iterations of the iPhone and iPad.”

As far as phones go, Google’s Android is also a popular device that is fun to use, but it has business benefits as well. Mike Cohen, who manages speech efforts at Google, contends that speech will become more ubiquitous over time. “What we really want, ultimately, is to have a situation where end users, any time they’re in a usage scenario where they [would] prefer to speak [rather] than type, [that would be] a possibility,” he says. 

Google’s Android applications cover a range of actions end users can accomplish using voice; according to Cohen, the voice search ability is available on a range of phones: iPhone, BlackBerry, Nokia, etc. More recently Cohen says there has been a breakthrough in integrating speech with applications on Voice Actions. Now, without the developer, speech can be used. “All of a sudden the developer didn’t have to do anything,” Cohen says. “[You] bring the keypad up and, lo and behold, people can talk to it. To me, that was one of the most exciting developments in a very long time in the speech technology field.” 

Basically, Voice Action allows a user to do with speech nearly anything that could be done on a keypad. “You can say things like, Search for Chinese food in Palo Alto or Give directions to Mandarin Gourmet,” Cohen says. Users can also  find music, call friends, go to certain Web sites, text, or make calls. In other words, any range of actions—for business or pleasure—could be performed using the service.

Giving Direction

Users can also integrate their devices into the car in other ways. TomTom’s app for the iPhone GPS will be available with updated maps, more detailed traffic information, and advanced lane guidance to show drivers exactly how they will be turning. The app also allows users to seamlessly integrate calls they can answer without distraction. 

While one can get GPS capabilities on smartphones by downloading apps, now more than ever cars are coming to drivers already equipped with eyes-free/hands-free speech capabilities. Not only is there the usual hands-free/eyes-free importance of speech in the car, but also possibilities to simplify access to in-car entertainment. “Who doesn’t want a new car for the holidays?” Shirk says, adding that the Kia UVO will have the ability to plug in any media device and upload music to the car’s media drive, which holds up to 1 gigabyte of data. Users can access any track simply by asking for it. In the past, users had to specify source, Shirk points out, whether it was radio, Zune, or an MP3 player. “Now it doesn’t matter where it lives. You have the ability to say, Play Radiohead, and...it will play Radiohead,” he continues. “It is both a really cool speech engineering [design], but also a really good user-centered design.”

Should users wants something portable that’s not a smartphone app, Garmin’s  Nuvi 550 isn’t bound to the road. Designed to guide users anywhere, this speech-enabled GPS device can be taken on off-road journeys by land—on foot or bicycle—or even by sea. The Nuvi 550 has a waterproof compartment for its battery to further protect it on boating excursions. Like Nuvi’s other versions, the 550 also features turn-by-turn directions, voice response, and the “Where Am I?” feature that tells users exactly where they are.

Fun Devices and Toys Make Use of Speech, Too

Not all of the speech-enabled devices you might find under the tree this year have business applications. Some of these new gadgets, gizmos, and doohickeys offer speech in new, fun, and exciting ways that go beyond the conference room or office.   

>>> The folks at Hammacher and Schlemmer provide an interactive personal assistant called iTalk, which lets users create reminders that they record in their own voice. Italk asks, Can I help you? and responds to users’ answers. It also responds to certain commands: record reminder, play reminder, today’s reminder, reminder off. Those who are too lazy to turn over in bed can ask the iTtalk what time it is or even what day it is. Though this device has some clear business benefits—it’s so easy to lose track of meetings or dates—the downside is it doesn’t look very portable. 

>>> Even a digital thermometer seems outdated in light of the Grill Alert talking remote meat thermometer from Brookstone. A wireless transmitter verbally reports the temperature from the integrated grill thermometer. The device clips to a belt and allows the user to be 300 feet away from whatever he might be cooking. 

>>> The Toy Story movie, with all of its sequels, never seems to get old. And for each sequel there’s a new toy with more juiced-up features. This year’s Buzz Lightyear doll not only talks to you, he responds using voice recognition by moving his head and talking. 

>>> Biscuit Furreal Friends My Lovin Pup is a virtual pet from Hasbro. Priced at $179.99, Biscuit is a bit more expensive than your usual plush stuffed toy, but it is still a cheaper virtual pet than Paro the robot seal (as seen in the Overheard/Underheard section of our September/October 2010 issue), which sells for $6,000. But Biscuit isn’t supposed to be therapeutic. Instead, Biscuit is an interactive toy that is supposed to mimic a real dog. For example, he answers your voice command to sit up, lie down, or speak. Enabled with speech recognition, Biscuit will also give his paw if prompted by voice command. It’s unfortunate that the kitty version of this toy doesn’t do much but purr and—alternately—hiss at you. 

While these toys are a bit on the pricey side, they are a lot cooler (and less creepy) than the Teddy Ruxpins of the 1980s.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

With ToyTalk, Speech Is Child's Play