Stanford Model Teaches Turn-Taking to Virtual Assistants
During normal human-to-human interactions, when one person stops to think, the other can usually interpret the silence as the need for more thinking time. That’s because humans can recognize this thinking pause, along with other changes in vocal patterns, like pitch or intonation, visual cues, and other signals to know when to keep talking or when to stop.
Most virtual assistants, though, interpret such moments of silence as a signal that the person is finished talking and it is now their turn.
With the goal of creating a more natural conversational flow, a team of researchers at Stanford University replaced the classification approach traditionally used with a more continuous approach, incorporating prosodic features from voice inputs as well. In doing so, they were able to create models that behave more like how humans take turns when conversing in real life.
The new models ask “In how many seconds can I speak,” as opposed to “Can I speak in the next 300 milliseconds?” This continuous approach ultimately predicts more natural points at which voice assistants can initiate speech, allowing for more humanlike conversation.
Part of the problem with most approaches is that most technological solutions first convert the speech they hear into text, which then gets processed by a dialogue agent that retrieves or generates a text response. This text response is then converted to speech for output. Because the processing involves text, the nuances of a verbal conversation are lost.
The new model developed at Stanford continuously analyzes the voice input to predict before the moment of silence, based on intonation changes, whether the person speaking is mid-sentence. With this information, the system can signal the dialogue system to prepare a response in advance and reduce the gap between turns.
The researchers used a combination of GPT-2, wav2vec, and Gaussian Mixture Models to predict multiple future points for the digital agent to initiate speech. They also used utterance duration, not content, as the main lever they considered in this research. Going forward, they anticipate also incorporating social guidelines for politeness, empathy, and other factors into the training process
Ashwin Paranajape, one of the Stanford researchers and a recent Ph.D. graduate, says future voice assistants will “not be a straight text-to-speech and automatic speech recognition system with pause detection. The hope is that the next phase is more seamless and will take the nuances of voice into account rather than just converting to text.”