The Assistant Paradigm Is Hindering the Use of Voice User Interfaces
The main use case for voice user interfaces, at least measured by the amount of monthly active users, is voice assistants like Alexa, Google Assistant, or Siri. These assistants help users complete simple tasks, such as setting alarms, playing media, or answering questions, hands-free and with the most intuitive user interface—voice.
However, after years of hype around voice assistants, only a small number of people use assistants for anything more complex. David Pierce writes in an article on voice assistants in The Wall Street Journal: "I can't imagine booking a flight or managing my budget through a voice assistant or tracking my diet by shouting ingredients at my speaker." Why is that? Booking flights with voice should be fairly simple because before computers all flights were booked by talking with a travel agent.
Lack of Feedback
The big reason for the inability of assistants to complete more complex tasks is the lack of feedback. When the user is speaking to the assistant, the assistant is totally passive and when the assistant is giving output to the user, it can't take input. This makes the experience fully sequential and turn-based.
This is very unlike a human-to-human conversation, where parties give non-verbal communication cues, such as gestures and facial expressions, to signal their understanding or lack of it. Sometimes these signals can be verbal, such as a short a-ha or eh?
Because the voice assistant lacks these signals, the experience is based on giving voice input and then waiting and hoping that the assistant replies in an expected way.
If the assistant didn't understand correctly, the user must start the experience again from scratch. Let's consider again the flight booking user task in the assistant paradigm. The user says something like "I'd like to get two tickets from Berlin to New York in business class." The assistant answers "There are two flights available from Beirut to New York. Option one departs at 7.52 a.m. and option two at 2.33 p.m.. Which one would you like to book?" and starts waiting for user input. Because there's a mistake, the user needs to reset the conversation and start from the beginning. And most critically, the assistant has already wasted several seconds speaking totally incorrect and irrelevant information.
In a human-to-human conversation the salesperson would reply with something like "So Beirut to New York, let's see" and the customer could immediately correct him with "Sorry, I mean from Berlin to New York," and the conversation would continue naturally.
The big difference between the experiences is the feedback loop. A human conversation has a quick, natural feedback loop and mishearings or misunderstandings are easy to correct. How could this be replicated in a voice user interface?
A Better Alternative
The key to better user experience with voice user interfaces is to get rid of the natural language responses and replace them with real-time visual feedback. When the user is giving input to a computer by using touch, mouse, or keyboard, he sees in real time how his input is affecting the graphical user interface. The same should happen when using voice input.
This real-time visual feedback encourages the user to go on with more complex utterances and enables her to correct herself naturally in case of an error. Voice platforms compete in increasing the accuracy of their speech recognition software, but the key to a good user experience is not making accuracy perfect but rather making corrections easy. Think of keyboards: we make typos all the time, but people don't think that keyboard technology is not mature enough for real use.
While voice assistants have brought voice user interfaces into the mainstream, it's time to ditch the conversational voice assistant paradigm and start to consider voice as another modality along with touch and vision. Voice should not be thought of as a replacement to our current user interfaces but as a complement to enhance them.
When voice is used as a complementary modality, the graphical user interface gives people cues on what they are able to do, getting rid of the skill discovery issues of voice assistants. The graphical user interface reacts to user input in real time, fixing the lack of feedback in voice assistants. This enables more complex user tasks and can finally turn voice into what it can be: the most natural and efficient input modality.