Does Your Intelligent Assistant Really Understand You?
Intelligent assistants are ubiquitous, whether in smart speakers, mobile apps, or chatbots. One of their most highly promoted features is the ability to understand normal language, and, ultimately, to give users the same kind of experience that they would get from talking with a person. In fact, you often hear that talking to them is just like talking to a person.
Actually, aside from the ability to use speech, talking to an intelligent assistant is not really like talking to a person, or at least, an adult person. It's more like talking to small children, as you grope around to find words they know and simplify your language to match their ability to understand. And even most children understand natural language better than an intelligent assistant. It's not that we can't do this, but it's tiring, but it doesn't need to be that way.
As natural language understanding technologists with many years of experience, we wondered if we could quantify the actual natural language capabilities of these assistants by testing them with some questions.
We especially wanted to ask questions that probe for real understanding and not just the ability to look up questions and get the answers from a preprogrammed list.
Sometimes you'll see examples of assistants making snappy comebacks to smart-alecky questions (for examples, see Siri answers 60+ funny questions on freemake.com). These are fun, but they don't demonstrate any real ability to understand language. You can easily confirm this by asking a gotcha question with a slight rephrasing. For example, if you ask Siri "What does Siri mean", it will respond with a cute answer: ""Siri is just the name that they gave me when I got the job." But ask the question just a slightly different way, "Does Siri have any particular meaning?" it will respond "OK, I found this on the web for does Siri have any particular meaning. Check it out." The snappy answers that work have been individually programmed ahead of time. The assistants are simply matching what the users say against previously stored lists. When it comes across a slightly different question, the virtual assistant falls apart. It demonstrates not much more real understanding of your questions than a Magic 8 Ball.
To answer the question of how much real understanding is going on here, we put together a list of queries on different topicswith different levels of complexity that would be appropriate to ask an intelligent assistant. We then posed them to five of the top, readily accessible, assistants.
We looked at five popular virtual assistant systems in our test. We're trying to focus on the general state of the art rather than individual systems, so the systems are anonymized.
Our test follows an earlier version done in 2016, which was reported at the 2016 Conversational Interaction Conference. Because of the earlier test, we can compare the changing performance of the five assistants over time on the same utterances. We prepared a set of 189 utterances, including 158 utterances from the earlier test. We included examples of basic everyday questions like "What will the weather be like tomorrow?"but we also included more complex sentences like "If I have milk on my shopping list, take it off," or "Why does cotton candy melt?"
Here are a few examples from our test, where we got some very strange responses to seemingly easy questions.
- Q: what is the nearest star besides the sun?
A: No exact results matching The Star were found. Here are several pubs instead sorted by distance excluding the Sun. [The Star and The Sun are both pubs]
- Q: Is it Tuesday or Wednesday?
A: According to wikipedia.org, Tuesday is the day between Monday and Wednesday…
- Q: Is it Tuesday or Wednesday?
A: No, Wednesday is not Tuesday. Tuesday is on July 21, 2020.
We first identified a set of seven everyday topics that reflect some of the most common uses of virtual assistants: arithmetic conversions, everyday knowledge, local business search, reminders, time and date, timers and alarms, and weather. We added an eighth category, unknowable, to see how good the assistants were at knowing the limits of their own abilities. The complete set of test questions used for this evaluation, along with our scoring definitions, is freely available at https://github.com/Virtual-Assistant-Tech. We encourage readers to look at our categories and examples and come up with their own versions. This website includes a wiki where you can post anything interesting you find.
The answers to our questions were not just right or wrong. Sometimes the answer was not really wrong, but it wasn't that great, either; these intermediate answers required some additional work on the user's part to figure out the answer. We put these in a usable category. The final set of categories was Good, Usable and Poor:
Good includes correct, complete answers, as well as reasonable non-answers, for example, a request for a clarification about the user's intent or a request for more information.
Usable describes responses where the answer can be found in or inferred from the system's response, or where the system admits defeat, and states that it cannot answer the question. We called admitting defeat Usable because at least the system knows that it can't provide the answer. Saying "I don't know" is much better than giving a wrong answer! The most common usable answer was simply pointing the user to a web search.
Poor includes wrong or partially wrong answers and cases where the system jumped to a conclusion about the question when it should have asked for clarification. An example of this would be not asking which Portland the user meant when answering a question about Portland, since Portland could refer to Portland, Oregon, or Portland, Maine.
For this study, we were only interested in what the assistants could do once they heard the request. We were not testing the accuracy of speech recognition. We tested one system at a time, speaking each test sentence in turn.
We found some clear differences among the assistants, as shown in the chart below. System A and System D had the best performance: Their results were quite similar, and the difference between them was not statistically significant. Systems E and B were in the middle, and System C was significantly worse than the top three systems.
Overall, the performance leaves a lot to be desired, with only 46 percent of responses classified as good, averaged across systems.
Are our questions too hard? We also looked at how the systems react to even the simplest questions. To look at this, we had a class of questions that we considered basic, and we tabulated the responses separately. Basic questions are extremely basic: "What time is it?", "Where is the nearest pizza place?" and "Convert 5 pounds to kilograms."
Even some of the basic questions weren't handled very well. Across all five assistants, we found that only 63 percent of basic questions received a good response, 15 percent were usable, and 21 percent were poor.
A typical example of a basic question that most of the systems got wrong was "Is this Friday", when asked on a Tuesday.
In one case we got the answer, "This Friday is Friday, July 17, 2020." Changing one word to ask "Is today Friday?" produces a perfectly natural response.
Changes since 2016
How has the state of the art progressed since 2016? They have improved a little, but not much. In our testing, the average number of good responses increased only from 38.3 percent to 46.1 percent in four years, as we can see in the chart below. Some systems were better than others, but even the best 2020 system only achieved an overall level of 57 percent good responses. Clearly, there is still plenty of room for improvement.
What can we conclude from this investigation?
- There have been improvements in natural language processing since 2016, but there's still work to be done.
- Not all assistants improved. Developers of virtual assistants need to keep working or they will be left behind.
- And there is a bigger question to ask: Is the current approach to development where more and more data is shoveled into a machine learning system like building a ladder to the moon? Do we need fundamentally new techniques to expand system capabilities to handling harder questions?
We hope that the companies that make virtual assistants will try to address the question of how to handle these more complex utterances. It could be a big step toward systems that are just like talking to a person.
Deborah Dahl is principal at Conversational Technologies. Christy Doran is principal at Clockwork Language.