The Need for Continuous Speech Recognition Testing
From traditional interactive voice response systems, still very common in customer service, to voice assistants like Alexa or Google Assistant, voice technology is now part of everyday life, and it will continue to gain importance in future. The component stack of a typical voice assistant includes speech recognition and speech synthesis, apart from the usual conversational components like natural language understanding, dialogue management, and natural language generation.
To assure that your voice assistant functions in every situation, testing is crucial. Tools like Botium Box enable companies to implement a holistic test strategy for voice assistants on all levels of the component stack.
The big cloud service providers Google, Amazon, Microsoft, and IBM all provide high-quality speech services with the best recognition rates on the market. But even with those cloud providers, some make it possible to add your own optimizations by uploading additional training data. This is often used for improving recognition rates for domain-specific vocabulary. Apart from the big cloud service providers, there are also a number of free software packages available, like Kaldi (for speech recognition) or MaryTTS (for speech synthesis), which companies install, train, and operate on their own infrastructures.
Continuous speech recognition testing has most benefits for voice assistants using an optimized cloud speech service or completely self-trained language models. As part of a voice assistant test strategy, the quality of the speech recognition should be verified continuously, along with all the other stack components.
Voice assistants should function flawlessly, even in cases when the audio is not perfectly recorded. The intention is not to test the system behavior in separate situations but to simulate real-world applications of the voice assistant. A stable response in different circumstances becomes a matter of delivering quality and consistency.
We don’t only use these technologies in a completely quiet environment, but also when our children are playing in the background (that is usually accompanied with screaming) or when we are driving through a tunnel. They always have to answer accordingly in all circumstances, even if speakers have different voice pitches, accents, or tones. Performance in such real-life scenarios will differentiate your voice assistant from average chatbots with low understanding.
At Botium, we use the term humanification to describe the application of algorithms to introduce noise into the test data. For text-based testing, this means considering typical human behavior patterns and typical human flaws, like typographical errors, case insensitivity, whitespaces (or lack of whitespaces), emojis, and others. For voice-based testing, this means adding some environment-specific background noise.
In the Voice Effects list in Botium Box, you can configure your pipeline of additional noise layers to apply to an audio file. There are several common audio effects available to simulate real-life environments, including the following:
- Background noise;
- Making it sound like a low-bandwidth GSM phone call;
- Simulating a slightly interrupted phone line by adding breaks.
In many cases, you might not actually be interested in the exact transcription of the audio file but whether certain quality criteria about the transcription are met. This is where the word error rate comes into play. It's a measure of how many words in a single transcription have been recognized correctly. For a perfect transcription fully matching the label, it is 0, and the value is between 0 and 1. Depending on your requirements ,you might consider a word error rate of 0.1 (one wrong transcription out of every 10 words) to be OK.
Once your assistant will be deployed and exposed to the real world, it will most likely process user inputs that it has not seen in the training data. Continuous speech recognition testing is the first step to determine whether your voice assistant understands the user correctly, since that is the precondition for giving an accurate answer and completing the task at hand.