Recognition Means More Than Just Getting the Words Right

Little attention has been given to the readability of automatic speech-generated text beyond the accuracy of its words. We judge its quality entirely on the basis of the percentage of spoken words accurately reproduced in text.
This was sufficient when most speech recognition was discrete, requiring distinct pauses between words. However, with the emergence of continuous speech recognition, and its extension into natural language and dialogue, word accuracy is no longer the only measure of readability.
My own interest in the readability of automatic speech-to-text stems from a longtime interest in applications of real-time speech-to-text for interacting with people who are deaf or severely hard of hearing. Successful applications of real-time stenographic translation in classroom environments have added to current interest in automatic speech recognition for the same purpose.
Vis-à-vis the present state of the art, attempting to introduce ASR for communicating with deaf and severely hard of hearing students in a college classroom in which most or all their classmates are hearing, is pushing the envelope a bit, particularly under conditions that include spontaneous speech and multiple users. The text must be both accurate and readable. We are still dealing with the question of feasibility.

Word accuracy and readability

For automatic speech-to-text, word accuracy can be considered the first of several components of the text’s readability. But while word accuracy is essential for readability, for many present and future applications it is not enough. To word accuracy I suggest we add sentence markers. And as speech recognition adds conversation and dialogue to its repertoire of applications, we should add indicators of change in speakers.
Let’s see an example of how accuracy, sentence markers, and identification of speaker changes, can alter the readability of text. This example is drawn from the transcript of an actual lecture. ASR was then used to reproduce that transcript verbatim into the text you see here, with 100% word accuracy.

why do you think we might look at the history of the family history tends to dictate the future okay so there is some connection you’re saying what else evolution evolution you’re on the right track which changes faster technology or social systems technology

But in spite of perfect word accuracy, the readability of this text remains quite low. Imagine, if you will, a classroom context where a student is expected to read text of this kind in real time for an hour or more, at a rate of 150 words per minute, and without the ability to control when to turn the page or freeze the screen. It’s a daunting task.

Adding sentence markers

There are a number of things we can do with the above text to improve its readability. One is to add sentence markers, i.e., (for the present) to use the voice commands for period, question mark, or exclamation point to close each sentence (and automatically add capitalization to the first word in the sentence that follows).

Why do you think we might look at the history of the family? History tends to dictate the future. Okay. So there is some connection you’re saying. What else? Evolution. Evolution. You’re on the right track. Which changes faster technology or social systems? Technology.

It is evident that the addition of sentence markers to the text substantially improves its readability. Nevertheless, it still doesn’t make total sense. As you may have already discerned, there were numerous changes in speakers in this passage, in fact five. These are reflected as you read the same passage below, altered to indicate changes in speakers.

Indicating speaker changes

Instructor: Why do you think we might look at the history of the family?
Student: History tends to dictate the future.
Instructor: Okay. So there is some connection you’re saying. What else?
Student: Evolution.
Instructor: Evolution. You’re on the right track. Which changes faster technology or social systems
Student: Technology.

With the identification of the changes in speakers, the text has become quite readable. Is this all it usually takes? Of course not. For this example I deliberately selected text that already had 100% word accuracy, exhibited excellent grammaticality, and no speech disfluencies.
The point is that a high level of word accuracy in text is no guarantee of its readability. I have suggested that in the recognition of text originating from a single speaker, at least two components contribute to readability, (a) word accuracy, and (b) the use of punctuation to distinguish sentences. In the recognition of dialogue, e.g., conversation, there is a third component, (c) the indication of speaker changes.

Assessing Readability
Word accuracy

There are two imperatives for accurate assessment. The first is that you work from an audio record of the speech. Refer back to what was actually said (including disfluencies, false starts, etc.) as you score the uncorrected text. Recall is a poor substitute for listening to an audio playback, particularly when it comes to recapturing spontaneous speech.
The second imperative is that you count the words spoken carefully and consistently. How we derive our word count can have a considerable influence on how we represent accuracy. On the surface, deriving the word count seems to be a simple matter of doing just that – counting all the words. Not so. Six suggested rules follow:

A word count is derive from words as spoken (either spontaneously or as read aloud from text), not as ASR-transcribed or edited.
The word count should include the speaker’s disfluencies and varitions in grammatical structure. (If correctly transcribed, these will be scored as correct.)
Numbers are considered as single words , regardless of how many numerals or words hey contain, e.g., fifty four/54, one point five/1.5.
Words that are spelled aloud, are counted as single words.
Hyphenated words are counted as two words.
Exluded from the word count are:

Voice (hands off) commands, e.g., for punctuation, repositioning the cursor
Identification of a new speaker (in the case of dialogue)
Extraneous noises (not speech disfluencies), e.g. phone ring, cough, printer.

There is remarkably little literature on procedures for assessing word accuracy. Dating from early reports on the accuracy of discrete-word products, most accuracy rates continue to be reported as the percentage of spoken words correctly transcribed into text, typically with little or no explanation of how it was derived.
In preparation for developing the scores rules which follow, I used both NaturallySpeaking and ViaVoice as I read aloud a variety of materials including the texts of lectures and the scripts of scenes from several contemporary plays. (Plays were of particular interest to me because of their use of dialogues and their simulation of spontaneous speech.)
A classification of error types was derived from an analysis and categorization of almost 2,000 recognition errors that were identified from these readings. This classification was modified several times, leading to the scoring procedure for assessing word accuracy that is discussed here.

General Rules for identifying word errors

Six general rules are used in identifying word errors.

Word error counts are always based on discrepancies from words actually spoken, not on words appearing in the text, e.g.,

spoken words:	transcribed into:	errors:
hanging around	had a route	(2 errors because two spoken words omitted)
into the sea	embassy	(3 errors because three spoken words omitted)

Numbers are counted and scored as single words, regardless of length, character representation, or mathematical symbols, e.g.,

as spoken (counted as one word)	as transcribed numerically (or as spoken)
twenty three	23
nineteen ninety five	1995
three and a half	three in a half (1 error)
ninety eight point six	98.6

Errors (including word deletions) are counted every time they occur.
Speech disfluencies (fillers, repetitions, repairs, false starts) are included in both the total word and word error counts, and classified on the basis of their transcription accuracy like all other words.
Additions to text produced by the following are excluded from the total word and word error counts. Confirm these additions by listening to your audio recording of your speech.
- voice commands
- identifiers of changes in speakers
- extraneous sounds (not speech disfluencies)
The presence or absence of the following are not counted as errors.
- punctuation
- hyphenization of/between words
- capitalization
- contraction or separation of two words if consistent with spoken intent

Word Errors and Scoring Protocol

Word errors can be of a single word type or of multiple word strings, and how such errors are scored can have a big impact on scoring.

Single word errors
Substitution. Single spoken word transcribed into another single word or
recognized as voice command 1 error
Examples. and/in, traffic/terrific, know/now, how's/house left/enough, pad/.
Exceptions. Common slang if synonymous, e.g., yeah/yes, goodbye/bye
Homonym. Single spoken word transcribed as homonym 1 error
Examples. for/four, no/know, right/write, sales/sails
Ending. Single spoken word transcribed with change in ending (tense,
possessive, number, contraction of two words, part of speech) 1 error
Examples. walk/walked, banker/banker's, prairie/prairies,
we/we've, instruct/instruction
Addition. Single spoken word transcribed as two or more words 1 error
Examples. environment/it acquired but, ashore/as short.
Omission. Single spoken word omitted in text 1 error
Disfluencies. Single disfluent word or filler not recognized or recognized incorrectly in text 1 error
Examples. uh/the, that they (pause) they argued/ that they argued

Multiple word errors
Multiple word substitution. String of spoken words transcribed as different word string or single word appearing in text. 1 error for each word in the spoken word string that is omitted from the text.
Note: Associated word strings are usually identified by phonetic resemblance; typically but not necessarily same number of phonemes; otherwise treat as single word substitution errors (see above)
Examples. the men/demand 2 errors, two days/today's 2 errors; come around/had a route 2 errors, are you sick/our use it 3 errors; now close your eyes/no closure eyes 3 errors

Calculation of word accuracy score. To assess the word accuracy of a given text,

Add the total number of words spoken by the speaker, less voice commands, extraneous sounds, and identifiers of speaker changes if any (see Speaker changes).
Using the General rules and the Word errors and scoring protocol, identify and add all word errors.
Subtract (2) from (1).
Divide (3) by (1) to obtain word accuracy percentage or score.

Sentence markers
If the ASR literature is scarce on the topic of assessing accuracy, it may be non-existent with respect to sentence-ending punctuation. In an early example, we saw the effect of sentence markers on the readability of text.
As the ASR user already knows, sentence markers, i.e., punctuation and capitalization, are easily added to text with the use of voice commands. The addition of a punctuation criterion to word accuracy offers us the option of adding a readability score to the existing word accuracy score.
The following rules apply to sentence identification.

Each sentence as spoken is counted to ascertain the total number of sentences spoken. This can be intuitively derived, based on grammar, voice inflexion, etc. A single word is allowable as a sentence.
Each omission of sentence boundaries is scored as a sentence identification error.

Readability based on word accuracy and sentence identification. Unless the speech being transcribed involves dialogue, i.e., two or more speakers, the following procedure can be followed to determine the readability of the text.

Add the total words spoken by the speaker (see Calculation of word accuracy score) and the count of sentences as spoken.
Add the total word errors in the text (see Calculation of word accuracy score) and the number of times sentence boundaries were omitted in the text.
Subtract (2) from (1).
Divide (3) by (1) to obtain readability percentage or score.

Speaker changes (applies only to dialogue).
We just now discussed how measures of word accuracy and of sentence markers can combine to produce a readability score. However, in the early example, which included multiple speakers (an instructor and students), we saw their effect on readability, when they did, and when they did not signify speaker changes.
The following rules apply to indication of speaker changes (detectable from audio playback).

Each change in speaker is counted as a point in the total number of speaker changes. Note that while two speakers may each speak three times, this represents only five changes.
Each omission of notification of change in speaker is counted as an error.

Readability based on word accuracy, sentence identification, and indication of speaker changes. You are reminded that this procedure applies only when two or more speakers are involved. To combine word accuracy, sentence identification, and identification of speaker changes as a measure of readability,

Add the number of speaker changes as spoken, total words spoken, and total sentences spoken.
Add the number of times the identification of speaker changes were omitted from the text, total word errors, and the number of times sentence identification was omitted.
Subtract (2) from (1).
Divide (3) by (1) to obtain the readability percentage or score.

Inter-scorer reliability
Under the assumption that the primary application of this test, at least in the immediate future, will be to assess word accuracy, reliability is reported here only for that component. Five texts, varying between 300 and 344 words, were read aloud and re-transcribed using ASR. This produced one version of each text as spoken, and one version as uncorrected ASR-generated text.
Eleven scorers, ranging from novices to experts in their familiarity with ASR, were given the two versions of each of the five texts and asked to score the uncorrected ASR-generated versions, using the scoring procedures outlined in this article. The errors they detected in each text were converted into word accuracy percentage scores as described earlier.
There were considerable differences in word accuracy scores across the five texts, the most accurate text producing a mean word accuracy score of 97.6%, and the least accurate text a mean word accuracy score of 80.7%. However, within each text the range of scores across the 11 scorers was quite narrow, varying from 0.9% on one text to 2.4% on another.
Cronbach's Coefficient Alpha was selected to estimate the inter-scorer reliability of the procedure for assessing word accuracy, producing a reliability coefficient of .9995 of this magnitude (1 being maximum), clearly shows high agreement among scorers who follow the recommended scoring procedures as described in this article.
In conclusion, for those of us who use ASR primarily for dictation, a perusal of the text may be enough to give us a feel for its accuracy and readability. However, formal assessment, using reliable, well-documented measures, becomes essential for ASR product development and evaluation, and for testing new applications.
Also, applications of ASR that are intended to culminate in text should be readable to the intended reader, just as speech should be intelligible to the intended listener. Speaking of which, Ray Kurzweil, author of The Age of Spiritual Machines, predicts that by the Year 2009, "deaf persons - or anyone with a hearing impairment [will] commonly use portable speech-to-text listening machines, which display a real-time transcription of what people say"
For that to happen, the transcripts to be produced by these machines must be accurate, and more than that, they must be readable.

Ross Stuckless is a professor and research associate at the National Technical Institute for the Deaf, Rochester Institute of Technology in Rochester New York. He can be reached at (716) 475 6449 or by e-mail at ersnvd@rit.edu.

Recognition Means More Than Just Getting the Words Right

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions