Microsoft Puts Speech in the Spotlight

Article Featured Image

With the release of Xbox 360 Kinect, Windows Phone 7, and the revamped Ford Sync, Microsoft has plenty of speech news to talk about.

The software giant held a press briefing in New York December 1 to demonstrate these products and make some bold predictions about where speech technologies will be headed in years to come. Ilya Bukshteyn, senior director of marketing at Microsoft Speech Business, called the speech capabilities a natural user interface (NUI), which he likened to other technologies that have become ubiquitous and essential. Users will start to expect a voice-enabled interface from many of the products they use every day, he said.

“Our kids will not reach  for a controller, keyboard, or mouse. They’ll point, gesture, touch, speak, and interact with technology naturally,” Bukshteyn asserted.

Bukshteyn also expects speech recognition to get smarter with each interaction as services move to the cloud. Microsoft’s cloud platform has 11 billion utterances. “That’s 350 utterances every second. When that’s in the cloud, that’s 350 opportunities per second for our service to get better,” he said.

Along the same lines, Bukshteyn added that one of Microsoft’s goals is for users to be able to interact with Microsoft products without having to learn complex commands; instead, a system should train itself to the user’s needs, he said.

But beyond speech, devices are becoming increasingly multimodal, allowing for more than one interface or method of interacting with the devices and services connected to them. As an example, Pearson Cummings, senior marketing communications manager, demonstrated the voice capabilities in the Windows 7 smartphone, which also incorporates a touch element: When Cummings searched for Chinese restaurants in the area, he could interact with the map using touch. “You can use voice to get to it, and then you can take action with your fingers,” he stated.

Bukshteyn likewise moved back and forth between gestures and speech to navigate through a game of Kinectimals on the Xbox. When watching movies on the Xbox, users can command the console verbally to rewind, fast forward, pause, or play.

Moving ahead, Bukshteyn maintained that technology will become so smart that it will also understand user intent. “If I say, Find a Chinese restaurant, I’m probably looking to make a reservation,” he explained. “The system could say, When do you want to go and with whom? I say, I’d like to have dinner with Pearson, and it should be able to tell me when Pearson and I are available. Then [it would ask], Would you like me to make a reservation? It’s not just hearing, but using the brain in the cloud to understand your intent and help you accomplish that task faster. That’s where we see this going in the future.”

This brain also applies to the type of speech that is recognized. For example, the Xbox 360 uses an open mike, but the people in the film were having a conversation, Cummings and Bukshteyn were having a conversation, and the microphone was wide open, but didn’t take action. “In the past you’d have something that would say, Excuse me, I didn’t understand what you said and the person would be throwing the controller at the screen,” Cummings quipped. “In this case, we’ve created the capability to really only understand what you’re trying to do when you’re trying to do it and filter everything else out.”

Cloud connectivity and an NUI that leverages speech also will allow users to access all kinds of services, such as music, directions, and phone calls, while driving. Speech as an interface makes a lot of sense in this context because it can help minimize driver distraction and, since it doesn’t run on a data plan, anyone can access services as long as he has a phone with Bluetooth connectivity.

Much like the Windows 7 phone Cummings had demoed earlier, the phone connected in the car had the ability to disambiguate commands, asking if he wanted to call someone on a cell or at home.

Today, with systems like the Ford Sync, drivers can get everything from traffic reports to horoscopes and sports scores in the car. “I wouldn’t use it for a horoscope,” Cummings joked, saying it’s not an area he’s interested in. “But I use it for sports.” And because information is presented in real time, users can navigate around traffic, detours, or accidents.

Ultimately, Cummings said Ford has reported Sync-enabled vehicles are selling at a much higher rate than those that do not have the technology. “Ford doesn’t just differentiate on the engine and transmission, four tires, and brakes. They differentiate with what they can do in the car, and we’re excited to partner with them. We think it’s interesting that voice is really the front end,” Cummings said.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues