VUI Review Testing - Is It Part of Your Speech Best Practices?
One best practice that is far too often abbreviated or overlooked altogether is Voice User Interface (VUI) Review Testing (VRT).
What Is VRT?
VRT provides a holistic, experiential review of a speech application. The tester (typically a VUI designer) plays the role of a caller and tests the fully developed application by acting out pre-determined use-case scenarios. The objectives are to ensure prompts sound and flow naturally and the functionality conforms to likely caller expectations. The goal is to verify that user-experience quality is achieved before exposing the application to your client or external callers.
As Figure 1 suggests, VRT should be done in the testing phase - just after Dialog Traversal Testing (DTT), and prior to User Acceptance Testing (UAT) by the client and exposure to external callers during Recruited Usability Testing, pilot, or full deployments.
How It Differs From Other Testing
Dialog Traversal Testing
Some project methodologies try to achieve the goal and objectives of VRT through DTT, but these should be treated as separate activities.
The goal of DTT is to certify the application is fully functioning per the design specifications. The objectives are to ensure a bug-free application by verifying that every possible pathway and all prompts are exhaustively tested and can be navigated as designed.
In contrast, VRT tests a limited set of use-case scenarios that a real caller might experience, and evaluates the quality of the VUI based on those experiences. It provides ample, yet not exhaustive, coverage of the system. It does not touch on every prompt or path.
DTT is more mechanical and exhaustive; VRT is more holistic and experiential.
It is important that VRT follow after successful completion of DTT. VRT can only accurately evaluate user experience when the application is fully functional. The tester needs to put his/herself in the mindset of a real caller by role-playing scenarios. If there are functional bugs still lurking, it breaks this illusion and the application cannot be evaluated holistically. So, if VRT identifies residual bugs, VRT should be halted and DTT should resume.
Peer Design Reviews
It is customary during the design phase to get feedback on early designs by having two people act out portions of the dialog. This is essentially a rudimentary Wizard of Oz test with one person playing the system and the other the role of a caller. This approach should also be included in the methodology because it too gives an early indication of caller experience and improves conversational flow.
However, it's not the same as interacting with a fully developed application that has all prompts recorded by the voice talent, recognition grammars, and backend integration (which may be based on test data). VRT gives a more realistic perspective because it's at this point the application "comes to life." For the first time, the VUI designer can interact with the actual application, which provides a unique opportunity to assess the design from a caller perspective.
VRT may seem similar to usability testing because the tester role-plays the part of a caller and attempts to pre-judge what callers' impressions might be in those scenarios.
However, VRT is not a substitute for usability testing. Usability is subjective by nature; only real callers can provide valid usability assessments. All applications inevitably undergo usability evaluation (post-deployment by real callers), so it's best if some type of testing with real users is intentionally included earlier within the project. Usability data can come from a variety of sources, such as focus groups, Wizard of Oz testing, recruited usability studies, whole-call recordings, recognition analysis, caller surveys and interviews, application logging, etc.
Why VRT Is Important
If VRT is similar to other types of speech application testing, why does it warrant its own line item on the project schedule? Why is it so important?
Because it allows usability feedback to be more informative.
VRT leverages the VUI designer's heuristic knowledge of what has or hasn't worked in the past, based on prior usability feedback. If VRT is skipped, there may be known issues still lurking that could have been caught through VRT. Since these issues have come up previously, they are likely significant enough to attract attention. If the application were exposed externally at this point, these glaring issues may eclipse other feedback users may have. In other words, if VRT is not done, you lose an opportunity for your users to tell you something you don't already know. This negates much of the benefits of subsequent usability testing, and can be costly.
What VRT Involves
VRT testing can begin once:
- DTT has verified that the developed application is "bug-free" and functioning as designed.
- Modifications from earlier testing have been implemented and tested.
- Application and design documentation are up-to-date.
- VRT test plan, environment, and test data are ready.
VRT is completed once:
- All use-case scenarios are passing (where "passing" means based on tester's heuristic judgment, the application seems likely to meet callers' expectations for that scenario).
- Modifications identified through VRT have been implemented and re-tested.
- Application and design documentation have been brought in sync.
While DTT testers may be developers or QA testers, VRT testers are typically VUI designers, human factors specialists, and/or speech consultants. VRT requires knowledge of usability issues based on previous speech deployments and a trained ear on how the timing and delivery of prompts can affect usability and caller perceptions.
To get a realistic sense of a caller's initial impression, it's important to have a "fresh ear." If the tester was also the designer of that dialog or has tested that scenario repeatedly, it's more difficult to evaluate what users' initial impressions might be. So it can be useful to rotate use-cases amongst different testers.
An advantage of VRT is that its test plan leverages materials already created earlier. In the analysis phase, the VUI designer will have gathered requirements and observed live customer calls to get a sense of who callers are (Caller profiles) and what they're trying to do (Use-case scenarios) (see sidebar and Figure 2).
Caller profiles serve as input for use-case scenarios, are helpful in selecting the system's persona, and contribute to other VUI requirements. Caller profiles also help VRT testers get into the role of a typical caller when acting out scenarios.
|A Caller Profile Typically Addresses: |
- Who are the callers?
- Caller demographics (note, this may not be the same as customer demographics)
- Different caller types
- Speaking styles
- State of mind (distracted? upset?)
- What are they trying to do?
- Tasks callers want to accomplish
- Which have highest call volume? (apply the 80-20 Rule)
- When do they call?
- After hours treatment
- Seasonal usage peaks
- Where are they calling from?
- Cell/speaker phones
- Noisy environment
- Hands-free (e.g., in vehicle)
- Why are they calling?
- What are their expectations?
- What's their goal?
- Are callers motivated to use self-service?
- How will they get their job done?
- What's their mental model of how the interaction should unfold?
- What can be partially or fully automated and what may require human intervention?
- What information or terminology do they know or have when they call?
- Can they accomplish this task on the Web and is the information consistent?
Use-case scenarios ground the project requirements by tying them to specific situations the automated system is intended to address. They help the designer envision how the design will sound in a real-life, linear conversation as opposed to focusing at a conceptual, modular level (which is the sort of thinking call-flow diagrams and application coding tend to facilitate).
A typical project might define between five and 20 use-case scenarios. The majority should focus on tasks with the largest call volumes. Some scenarios should attempt more than one task per call, as this sheds light on transitions between modules (see Figure 2, use-cases 1 and 2). Some should elicit secondary pathways, such as a noisy environment triggering recognition error handling (use-case 2). This is important because error conditions often aren't sufficiently tested. From the callers' perspectives, they're already frustrated. If the system responds strangely, it will further tarnish their perceptions and they'll be less likely to use the system again. At least one scenario should be an out-of-scope task not included in the application's functionality (use-case 3). There will always be customers who mistakenly call the wrong phone number or are calling about a topic that didn't have high-enough call volumes to justify automating. It's important to get a sense of the caller experience in these situations and determine if the caller can still reach the information they need (perhaps by transferring).
The environment and test data for each use-case (as in Figure 2) need to be ready. The development team should "sanity test" to ensure all call flows are navigable before exposing the application to the tester. The environment ideally should mirror deployment to get an accurate sense of backend interactions and potential latency. Since VRT is often conducted by testers dialing in remotely (which can be helpful in simulating realistic calling environments), testers need to be informed if test data are changed or if the environment goes down. A process for tracking, reviewing, and prioritizing bugs and enhancements discovered through VRT should be established in advance. Good communication is key; otherwise, valuable time can be lost.
Typical Improvements Found Through VRT
Prior to VRT, DTT should have already caught:
- Missing, incorrect, or repeating prompts.
- Obvious quality issues in recorded audio or text-to-speech (TTS).
- Data retrieval, backend transaction, or telephony errors, including long latency.
- Speech/DTMF not properly implemented or barge-in incorrectly disabled.
- Inaccessible call flow due to missing test data or bugs.
- Inconsistencies in design documentation.
Having laid this groundwork, VRT can then focus on:
- Potential usability issues not obvious during earlier design reviews.
- Additional prompt quality issues:
- Recorded audio doesn't match surrounding prompts or context.
- Voice talent's delivery doesn't match persona.
- Prompts don't sound believable, natural, or conversational.
- Adding pauses between prompts where necessary:
- Brief pauses (200 milliseconds) can actually make long prompting seem shorter because it breaks concepts up into easily digestible parts for the listener.
- A longer pause (2.5 to three seconds) can create a pseudo recognition window, giving expert users a chance to barge-in.
- Pauses can make the persona sound friendlier and more natural, less hyper and impatient.
- If backend interactions are extremely fast, additional pauses can make data retrieval or transactions sound more realistic. This gives the impression that the persona is actively working on the caller's behalf (e.g., "Let me check your order status. [one second] Ok, your order was shipped on…").
Level of Effort
The time allotted for VRT depends on the size of the application. As a rule of thumb, estimate about one hour per dialog state. So for 40-to-50 dialog states, budget about one person/week. This includes not simply running through each use-case, but also time to review, re-design, implement, and re-test changes.
VRT provides a rigorous assessment of the Voice User Interface and is integral to user-centric design. It serves as a phase gate to evaluate if an application is ready for external exposure. It ensures a quality user experience and maximizes the benefits of subsequent usability testing. The effort it requires is minimal; and its potential impact on customer satisfaction, user adoption rates, and ultimately ROI is considerable. It is worth establishing VUI Review Testing as part of your speech best practices.
Lizanne Kaiser is a senior principal consultant at Genesys Telecommunications Laboratories, Inc., specializing in Voice User Interface design and usability testing, focusing on the strategic use of speech solutions in contact center environments.
 Robby Kilgore (personal communication, 2004, SpeechTEK, New York) has also described this with the acronym "A.E.I.O.U" = Artifacts, Environment, Interactions, Outcomes, and Users.
 Neither DTT nor VRT is intended to verify recognition coverage and accuracy, which are addressed through Recognition Tuning based on real caller utterances during pilot/full deployment. However, DTT needs to test an example of each command to ensure all pathways are functioning.