Speech Technology Magazine

 

The Alpha Bail

A Little Bit of Energy Can Make a Big Difference<@SM>Usually, speech recognition is the preferred modality in telephony applications that require non-numeric input. Imagine asking users to type in something like the name of a movie or a restaurant or a street name using a telephone keypad. That would be a cruel usability joke. When entering information that cannot be otherwise conveyed using telephone keypad numbers, speech recognition, as a rule, provides a far superior…
By Walter Rolandi - Posted Apr 30, 1997
Page1 of 1
Bookmark and Share

A Little Bit of Energy Can Make a Big Difference
Usually, speech recognition is the preferred modality in telephony applications that require non-numeric input.   Imagine asking users to type in something like the name of a movie or a restaurant or a street name using a telephone keypad.  That would be a cruel usability joke.  When entering information that cannot be otherwise conveyed using telephone keypad numbers, speech recognition, as a rule, provides a far superior solution.

But that's not to say that all word-level data entries are without problems. 

Letters of the English alphabet provide a case in point.  Speech recognition engines have trouble recognizing individual letters when uttered in isolation for at least two reasons.  One is that all but one letter (W) is monosyllabic.  Another reason is that so many of the letters in the English alphabet sound essentially the same.  Consider the following "confusion classes":

  • A H J K
  • B C D E G P T V Z
  • F S X
  • I Y
  • M N

Of the 26 letters in the English alphabet, a full 20 letters, or around 70 percent, can cause problems for a recognizer.   Think about the one phoneme difference between a C and an E.   Like most of the letters in each of the confusion classes, the difference between a C and an E is made by only a very small amount of phonetic energy.

Compensatory Practices
VUI designers have evolved a number of ways to overcome the challenge that alphabetic information presents.   One method that works very well is to support the use of the NATO (or similar) Phonetic Alphabet.   Note that I am not advocating requiring users to learn a phonetic alphabet.  I am, however, suggesting that supporting such a system can all but eliminate speech recognition errors when users are obliged to speak letters of the alphabet.

The NATO Phonetic Alphabet was created for use in radio transmissions under combat conditions.  Phonetically distinct words were chosen to represent each letter of the alphabet.  Most of the words convert the letter into a two or three syllable utterance.   Because the letter-words are uniquely distinct, speech engines virtually always recognize them correctly.   Each is listed below.  

A - ALFA
B - BRAVO
C - CHARLIE
D - DELTA
E - ECHO
F - FOXTROT    
G - GOLF
H - HOTEL
I - INDIA
J - JULIETT
K - KILO
L - LIMA
M - MIKE
N - NOVEMBER
O- OSCAR
P - PAPA
Q - QUEBEC
R - ROMEO
S - SIERRA
T - TANGO
U - UNIFORM
V - VICTOR
W -WHISKEY
X - XRAY
Y - YANKEE
Z - ZULU  

That's Alfa-Oscar-Kilo by Me
Of course, the use of the NATO system in VUI applications that are used by military and police personnel is quite natural and appropriate, but again, I would never suggest that users be forced to adopt it.  However, I have found that using the NATO system in confirmation prompts can be a particularly effective way to suggest its use by the user.  For example:

System: State your 3-letter personal identification code.
User:  ABC.
System: Alfa, Bravo, Charlie.  Is that correct?
User:  Yes.

This method has a way of passively getting the user to map the letters he needs to say onto their NATO equivalences.   It also simultaneously models the utterance for the user.   This has the ultimate effect of teaching the user "the system's way" to say the utterance.  Interestingly, users will often spontaneously adopt the practice and use the NATO equivalences, particularly as a function of recognition failure. 

To use another example, let's say that a user has had the above experience many times before.   He states his 3-letter personal identification code, "ABC," to which the system responds in confirmation, "Alfa, Bravo, Charlie.  Is that correct?"  The user thus learns to expect the system to respond, "Alfa, Bravo, Charlie" each time he says "ABC." 

Consider the following, however:

System: State your 3-letter personal identification code.
User:  ABC.
System: Alfa, Bravo, Tango.  Is that correct?
User:  No.  It's Alfa, Bravo, Charlie. 
System: Alfa, Bravo, Charlie.  Is that correct?
User:  Yes.

While not foolproof or universally adopted, many users will spontaneously adopt these codes.  In fact, this practice is likely to lead to the following subsequent interaction:

System: State your 3-letter personal identification code?
User:  Alfa, Bravo, Charlie. 
System: Alfa, Bravo, Charlie.  Is that correct?
User:  Yes.


Walter Rolandi is the founder and owner of The Voice User Interface Company in Columbia, S.C. Rolandi provides consultative services in the design, development and evaluation of telephony-based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at wrolandi@wrolandi.com .
Page1 of 1