Speech Technology Without Speech Recognition

Perhaps the greatest benefit for speech technology isn’t in the customer-service sector, but in helping individuals with disabilities. Vocal Joystick, a software engine developed by faculty and graduate students at the University of Washington, has already been adopted to help individuals without use of hands or arms to browse the Web, control a screen, or play a video game. Earlier today, the University of Washington announced that it has adopted the system to interface with a robotic arm—an aspect of Vocal Joystick it started working on this year.

Intended for people who can’t manipulate objects around the house, the idea of a robotic arm isn’t new. However integrating it with voice is and the benefits are obvious. For instance, one way of controlling a robotic arm, according to Jeffrey Bilmes, a UW associate professor of electrical engineering, is through a brain-computer interface. "But the issues with brain-computer interfaces is that they’re invasive," he says. "You have to implant some electrodes or sensors in the body. Or if they’re not invasive, they’re noisy and difficult to control." The Vocal Joystick engine uses the precision of voice to give its users greater control.

As it currently stands, the UW’s robotic arm is a research prototype, designed and built solely by UW faculty and students. The ultimate goal is to make the source code available for download. However, there are still kinks developers want to iron out—for instance, minimizing delay between the spoken command and response time. So far, researchers have knocked it down to 60 milliseconds, but given the practical applications for the Vocal Joystick engine, response time needs to be near-immediate.

"Depending on the speech recognition application, you’re tolerant of delay," Bilmes says. "Think about voice mail transcription. That’s something that can take a long time. It doesn’t need to run in real time. But here you’ve got something that needs to happen immediately. What we want is whenever you make a subtle change to your voice, it has to change. There’s no room for delay. The algorithms we’ve developed are such that they’re very accurate, but also low latency. So it’s put a lot of constraints on our design specs."

The Vocal Joystick engine is unique in that it doesn't rely on speech recognition. Rather, it's controlled through vowel sounds, not words. So instead of users uttering phrases like scroll down or click, they simply make sounds to dictate action

Ultimately, Bilmes thinks the Vocal Joystick system will work in conjunction with a conventional speech recognition engine. If, for instance, a user browses the Web with an application powered by Vocal Joystick, he can use vowel sounds to manipulate the cursor, then switch to a verbal interface to fill out a text field.

"There are many things the voice can do," Bilmes says. "While I’m happy with what we’ve done, I think we can do a lot more. This is open-ended research."

Speech Technology Without Speech Recognition

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round