Q&A: Designing Natural Language Home Assistants Using Amazon Alexa

Article Featured Image

David Attwater has over 17 years of experience in speech user interface design and testing. He has a specific focus on customer usability testing and is an acknowledged expert in the design and testing of natural language solutions for customer service, including digital assistants. Attwater will present the SpeechTEK University course, “Designing Natural Language Home Assistants Using Amazon Alexa,” on Sunday April 8, at the SpeechTEK 2018 Conference. SpeechTEK co-chair, James Larson, interviewed Attwater in advance of his conference session.

Q: How are home automation apps such as Alexa and Google home different from other speech applications such as telephone automation?

A: Home automation devices and telephone automation are technically very similar. However, there are some differences. Home automation apps use cloud-based speech recognition with a high-quality microphone array. Phone-based speech recognition is limited by the bandwidth of the telephone system and phone microphones that can be poor quality. As a result, the speech recognition accuracy is much more accurate than over a telephone network. Set against this, however, home automation products use far-field microphones (they listen from across the room) and are often competing against other background noises. The key vendors have done a great job keeping them robust and accurate in such a challenging environment.

Users of Amazon Echo (“Alexa”) and Google Home (“Google Assistant”) are in a completely different setting to users of telephone automation. This affects how motivated they are to explore the user interface and how forgiving they may be to errors. Users of telephone automation are generally placing a call with the hope or expectation of speaking to a person. They are often unwilling to engage with a machine and may already be annoyed about something that has already happened. This context is sometimes termed “victim automation.” So far, users of Amazon Echo and Google Home, however, have willingly chosen to engage the automated persona and are motivated to modify their behavior to accommodate the machine. This context is sometimes termed “volunteer” automation.

Having said this, if these devices ever became the mechanism by which users started their journey toward customer support this context might change.

Q: What is necessary to build an Alexa speech experience?

A: Very similar skill-sets are required to design the user interface for a home automation solution or an automated telephone system.

The simplest home automation dialogs are one-shot (i.e. the user says something and then the machine responds and the interaction is over). Most interactions require more than one turn and may take a number of paths. An experienced conversational interaction designer will be required for all but the simplest solutions.

The tools that are supplied by Amazon and Google do make some aspects of the development very easy. For example, it is fairly easy for a novice to build an entry-level interaction, and the ability to build the speech recognition models (grammars) using example sentences is very powerful. The Google design environment also provides some modest support for building multi-step natural dialogs. Both environments still require good design and coding skills fairly quickly as soon as you try to do anything a little more challenging, and building stable, engaging dialogs with more than one question/answer step still require a skilled designer.

Q: How much artificial intelligence is there actually in the Amazon (and other) home automation systems?

A: Artificial intelligence is a very hot topic at the moment but often ill-defined. The leading home automation products are undoubtedly examples of artificial intelligence but a huge amount of that “intelligence” is still designed and coded by a person. We still have to “fake it” through good design and careful analysis of user needs and expectations.

There are moments when both Amazon Echo and Google Home offer glimpses of where machine intelligence may be heading. For example, Google Home can recognize you by your voice, knows what your name is, and what your relationships are with other family members. Try for example asking it “Hey Google, what’s my name?” There is, however, a huge gap between these small glimpses of intelligence and true artificial intelligence. Alexa may whimsically refuse to “open the pod bay doors” but she really doesn’t have any idea what pod bay doors actually do or why refusing to open them might be chilling or amusing.

There is little visible evidence of sustained learning in the current offerings. A truly intelligent machine would learn from your interactions with it. In the most part, Amazon Echo and Google Home behave like serial amnesiacs. Each interaction happens as if the past had never existed.

Having said this, speech recognition is definitely making big strides forward by learning from past interactions. The underlying speech models are almost certainly being adapted and improved by learning from the vast amounts of data flowing into the host companies every day. You only have to witness someone successfully calling to a home automation product from an adjacent noisy room to recognize that the industry has worked very hard at improving the accuracy of speech recognition.

The underlying tool sets are beginning to open up machine learning capabilities to the developer communities but, as a speech interaction designer, I am still waiting for a fundamental breakthrough in artificial intelligence that will help manage the multi-step decisions in a conversation between a person and a machine.

Q: What problems and difficulties could appear when developing an Alexa speech experience?

A: Amazon Echo speech interactions are vulnerable to the same practical limitations as other speech solutions. Spoken interactions are still hand-designed and hand-crafted by user interface designers. At the top of the list of problems and difficulties would be wrong dialog decisions due to recognition errors; user confusion about what they can say or do at any given moment; and turn-taking difficulties caused by a lack of any intelligent management of interruptions.

We are also seeing a new problem – difficulty in managing the narrative voice between different “skills” (third party applications). As users interact with certain “skills” it is not always easy for a user to know which “entity” they are talking to at any given moment. For example, a user may be discussing train times with Alexa and experience a failure due to background noise or distraction. This will cause them to leave the “skill” without realizing they have done so. They may then continue to talk with Alexa about train times and discover that she mysteriously forgot everything she knew about trains and has no idea what you are asking her. This is a specific example of the general problem of discoverability of the user interface.

Amazon and Google may need to think harder about whether the devices are either portals into multiple “agents” who have no knowledge of each other, or a single persona managing access to various information services. Both have opted for the latter model but the host persona does not keep track of these interactions or distinguish herself (or himself) from the agents she is mediating between. As a result users are not always sure “who” they are talking to.

Q: Where might these tools be extended and enhanced in the next few years?

A: There are many areas that the tools could be extended in the coming years. Here are a few suggestions:

  • Shared re-usable user interface devices – examples might include objects that handle list navigation, calendar entries, address capture etc. The current toolsets provide only modest reusable grammar entities. We would anticipate that a winning toolset would provide mechanisms for developers and designers to share fragments of the user interface to weave together into their final solutions. The need for such devices will become even more pressing as multi-modal devices such as Echo Show become more prevalent.
  • Enhanced support for shared information, language models, and episodic memory. As a community, we really need to tackle the problem of how intelligent assistants relate to each other and share models of the user and the world. A human assistant would know you as a person, keep track of your interactions and transactions, and understand how they thread together. They would also help manage issues of trust and identity on your behalf. For a machine assistant to do this will require steps forward in how intelligent agents share information whilst maintaining user privacy and confidentiality.
  • Enhanced support for natural language. The current tools do not come with any underlying knowledge of language, meaning, or inter-relationships between words and concepts. Each app designer currently has to develop their own complete dialog and meaning models. For example, we may start to see tool support for rich reusable intents and concepts including an innate understanding of synonyms and management of anaphora (reference back to things mentioned previously) etc.
  • Incorporation of machine learning. Tools that allow apps to learn from previous experiences and incorporate learning from related data sources to help with decisions. For example, enabling the assistant to learn more about the user by learning from their emails, or learning from the way users interact and allowing mixed initiative between skills based on prior knowledge of how users interact with their home automation device.

In addition to more advanced tools, we might also expect to see the emergence of new cultures surrounding the way that people interact with talking machines. It is a misunderstanding to think that people want to interact with machines in the same way that they interact with people. On the contrary, people clearly understand the distinction and modify their behavior heavily when talking with a machine. Professor Roger Moore from Sheffield University in the UK poses the very interesting question whether there is such a thing as “half a language.” Consider the way that people speak with pets or children. They use normal language but adjust their vocabulary and tone of voice to their understanding of what the other person or animal can understand. It is an excellent question to ask and we should consider whether a lingua-machine may emerge for similar reasons. If it does what will it look like?

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Video: Are Chatbots Here to Stay?

You can hardly visit a company website without a chatbot popping up to offer you help these days. But these days, chatbots are helping with everything from mental health to identifying possible online predators. Is this trend here to stay? We asked Michael McTear, Allyson Boudousquie, Debra Cancro, and Crispin Reedy at SpeechTEK 2018.

The Rise Of The Voice-Enabled Associate

Speech technology helps retailers deliver on a connected and productive workforce.

Q&A: Strategizing Customer Experiences for Speech

Crispin Reedy is a Voice User Experience designer and usability professional at Versay Solutions. She has over 15 years of experience on the front lines of the speech industry, in the design, usability, and tuning disciplines. She is presenting the SpeechTEK University course "Strategizing Customer Experiences for Speech" on Wednesday April 11 at SpeechTEK 2018. SpeechTEK program chair, James Larson, talked to Reedy in advance of her conference session.