Video: Benchmarking Voice Assistants, Pt. 1: Method and Calibration

Learn more about intelligent assistants at the next SpeechTEK conference.

Read the complete transcript of this clip:

Kathleen Walch: So about a year ago, like Ron said, we introduced our first voice assistant benchmark. We benchmarked four different platforms. We did Alexa, Siri, Google Home, and Cortana. We wanted to make this very transparent, so we published all the results online as well as recorded with video all of the questions and answers so that in case there was any questions from any of the vendors or people who wanted to recreate these, they could go on and check it out. And actually, I think we'll get into one.

Ronald Schmelzer: That's right.

Kathleen Walch: There was a question, because Cortana actually provided a very interesting answer to a pretty basic question.

Ronald Schmelzer: Right, I think we got that for you. And so basically we had these series of 10 questions that we asked a bank of 10 questions, 10 categories of questions with 10 questions in each category, plus a set of 10 calibration questions just to make sure that we got our test right. And these questions got increasingly harder and harder, and you'll kind of see what they are, and some of them actually are even difficult for humans in some ways, but these are the kinds of things that you'd want these voice assistants to be able to do. And these categories, I will show each of these categories here, and the responses, we basically graded the response. So a Category 0 response was when a voice assistant came back and said, "I don't understand your question," or "I can't help you," or "Please rephrase that," or something like that. A Category 1 response is when it came back with a completely unexpected answer, so basically they just got it wrong. And then Category 2 was basically the word salad. It basically goes well, here's a really, really, really long answer so hopefully, you human can figure out what the right answer is among this. And of course, Category 3 is just straight-up got it correct, and as Kathleen mentioned we used a voice to prevent any issues of

Kathleen Walch: Accents

Ronald Schmelzer: Scottish accents

Kathleen Walch: Yeah, we had a few of our interns do it, and not all of them were native English speakers, so we said, "You know what, we're gonna take that language barrier out of the equation and keep it consistent."

Ronald Schmelzer: Right, so this is sort of calibration, we figured everybody should get these wrong. We basically asked it a bunch of questions that every assistant should get right, and we also asked it a bunch of questions we thought every assistant should get wrong just so that we make sure our test was calibrated. And so here is Cortana answering the question, what is 10 plus 10.

[Voice Interviewer] Alexa, what is 10 + 10?

[Alexa] 10 + 10 = 20.

Ronald Schmelzer: That's the Show device if you haven't seen that one, or is it the Spot?

[Voice Interviewer] Hey Cortana, what is 10 + 10?

[Cortana] According to Wikipedia.org, 10:10 is a charity that enables people to take practical action on climate change

Ronald Schmelzer: That doesn't sound right.

[Cortana] And combines these local actions to inspire a more ambitious approach to the issue at every level of society.

[Voice Interviewer] Hey Cortana, what is 10 + 10?

[Cortana] 10 + 10 = 20.

Ronald Schmelzer: All right

Kathleen Walch: So we used a different voice, and it was able to pick it up on that voice.

[Voice Interviewer] Hey Cortana,

Kathleen Walch: But we shouldn't have had to do that.

[Voice Interviewer] What is 10 + 10?

Ronald Schmelzer: You'll see it fails on that one too, so it fails on half of the Polly voices which are suppose to be standard. So what does that tell for AI machine learning data science people in the audience? What does that tell you? That tells you that they trained their speech recognition model on a much more limited set of voices than the Alexa. We also tested Google Home and Siri, and they both did not fail here, so we sent this to, we tweeted this, and what happened when we tweeted this?

Kathleen Walch: Yeah, so we sent this to Microsoft, and of course they got back to us and said, "We don't believe you." So, that's why we had it on video, and we sent it to them and they said that their engineers were not recreating the same answers that we got, but we clearly had it on video.

Ronald Schmelzer: That's why we have it on video.

Video: Benchmarking Voice Assistants, Pt. 1: Method and Calibration

Video: Benchmarking Voice Assistants, Pt. 8: Conclusions

Video: Benchmarking Voice Assistants, Pt. 7: Idiomatic Speech

Video: Benchmarking Voice Assistants, Pt. 6: Emotional IQ & Common Sense

Video: Benchmarking Voice Assistants, Pt. 5: Reasoning & Logic

Video: Benchmarking Voice Assistants, Pt. 4: Understanding Cause & Effect

Video: Benchmarking Voice Assistants, Pt. 3: Comparison Questions

Video: Benchmarking Voice Assistants, Pt. 2: Basic Concepts

Movate and Krisp Partner on AI-Powered Voice Solutions

DeepL Launches Voice API

BoldVoice Raises $21 Million to Advance AI Voice Coach

DentScribe Introduces Perio Charting