Speech Recognition in Education: Unexploited Opportunities
Approximately 98 Percent-plus Accuracy….
Most everybody in the speech industry has heard vendor claims of 95-98 percent-plus speech recognition accuracy. These claims, if slightly qualified, are undeniably true. In fact, using a good quality microphone in a quiet test environment, I have repeatedly obtained 100 percent speech recognition accuracy with several of the major ASR engines.
Detractors of speech recognition technologies might object, however. They might argue that under real-world telephony conditions, recognition accuracy hovers between 84 and 88 percent. And while less impressive than 98 percent-plus, in a well-designed application, recognition rates in the mid-80s are usually acceptable.
What's the Problem?
What contributes to the slip in recognition accuracy? To list a few: cell phones, speakerphones, and prompts that prompt unanticipated out-of-grammar utterances and large grammars. These factors, among others, can have detrimental effects on real-world telephony-based speech recognition accuracy rates. VUI designers accept these things as matters of fact and typically seek to compensate for any recognition shortcomings by using industry best practices and special design tactics.
An Interesting Opportunity
Shortcomings notwithstanding, current speech recognition technologies are capable of supporting tremendous advances in computer-based training (CBT) applications. Speech recognition affords particularly interesting opportunities because it can provide the means to interactively evaluate the utterances of a learner on several educational dimensions (more on this later). Heretofore, this level of intelligent feedback necessitated human intervention.
Another item of interest is that most of the factors that degrade speech recognition accuracy would not be present in an educational environment. In fact, it is easy to imagine a non-telephony-based speech recognition teaching-machine application that always prompts for very specific utterances and never has more than a few items in its grammars at a given time. Such an environment would be devoid of cell phone and speakerphone problems, non-specific prompts and the problems associated with large and highly tolerant grammars. This basic approach would be particularly effective in learning tasks that require memorization.
An Italian Lesson
Let's explore a specific example. An important, if tedious, part of learning a foreign language is rote memorization. Historically, students have used flashcard-type methods to present things like vocabulary words. Each word basically serves as a stimulus for the desired response. For instance, a student studying Italian might be presented with the written English word "dog" and be expected to respond with the Italian equivalent, "il cane" (pronounced, "eel con-nay").
Learning to associate text in one language with text in another language is fine, but that ability has relatively little to do with conversational competence in the target language. Among many other things, language learners must master basic pronunciation. After all, what is the use of associating the text "dog" with the text "il cane" if you think "il cane" should sound like "ill Cain"?
The Human Touch?
In order to teach proper pronunciation, you need the ability to compare the similarity of a learner's utterance to that of a native or model speaker. Let's say that a human tutor asks a student, "How do you say 'dog' in Italian?" and the student, making his best effort, replies, "ill Cain." The student's response is associatively correct. Its form or pronunciation is, however, unacceptable. Seeking to encourage the student's best effort, a good tutor might nod his approval, while simultaneously repeating the answer in its appropriate pronunciation, "eel con-nay." The student would then note the dissimilarity of his pronunciation in comparison to that of the tutor, repeat the correct pronunciation, and advance forward.
When a Disadvantage Becomes an Advantage
I strongly suspect that the role of the tutor above could be automated in a well-designed instructional application. Modeling the basic dialog and turn taking should not pose any particular challenges and I suspect that dynamically adjusting the ASR engine's certainty factor value could be used to shape near-native pronunciations.
Normally, as the ASR engine's certainty factor setting increases, so does the risk of falsely rejecting an in-grammar utterance. But this apparent disadvantage could actually be a powerful advantage in an application designed to teach proper pronunciation to a student of foreign languages.
While research has already been conducted in this basic area (the FLUENCY project at the Language Technologies Institute at Carnegie Mellon University, for example), the potential for using speech technologies in educational applications remains a tremendous and unexploited opportunity.