From Harmless Threat to Industry Competition in 14 Weeks: Amateur Designers Leave the Experts Speechless

It's 7:00 p.m. and a crowd of industry leaders and university deans swarm around the doors outside the Tufts University amphitheatre.  For the past several hours the members of my human factors class have been in the dim theatre making last minute edits to their presentations. They're rehearsing transitions, checking audio and video, and honing text so they can present lucidly and dramatically the culmination of what they've learned about speech technology and voice user interface (VUI) design. 

The tension is high: for the students it's a final test of their mettle before graduation, and for the industry professionals, who will witness first-hand how a group of students can be turned into their toughest competition in just 14 weeks.

That is to say competitors who only weeks earlier didn't know that the speech industry existed,  may never have programmed before and  had no idea how to deliver a professional-grade presentation. In the months following these presentations, a third of them will have received job offers from some of the biggest names in the field.

Three years ago, my co-teacher Joe Lemay and I decided to create a class in speech-recognition design using best practices learned from years of experience deploying systems. Today's students take the Internet as a given, just the way we find it difficult to imagine a world before television. They SMS while reading their RSS and IM while they email.  The question was: can we, in the short time span of a college course, help these students harness their innate aptitude for using technology to build working speech applications that demonstrate innovative and commercially viable designs?

The answer: Yes.  The approach: Painfully simple.

Rule #1: Don't work from the inside - out, work from the outside - in.

Where the typical college syllabus would begin with a study of VoiceXML, retry prompts, and project management methodologies, we begin by looking at real-world problems that speech can solve. The students learn by experiencing the end-state and imagine how they might want to create something that would solve a similar problem.  Consider how a young ballet dancer is neither told about the limitations of the human body nor the theoretical max of how many pirouettes can be done from a single preparation. What actually occurs is that she hears the music and mimics what the teacher does long before she ever learns to understand the complexities of épaulement and positions of the feet.  

It's just like working in a consulting practice, when Joe Salesguy shows up at our desks with a sheepish grin and a signed contract that promises to deliver the unprecedented on a tight schedule and a fixed budget, to clients who don't fully understand their own business. So we synthesize what Joe tells us, take a look at the customer's Web site, attempt to understand how they really make their money, what they care about, who their customers are, and what the underlying problem is that we really need to solve . Then we apply our domain knowledge and expertise to intuit an answer to the problem long before we even know if and how we're actually going to solve it.

The parallels in my class come on the first day when the students try several speech-recognition systems and then are immediately told to design one - before knowing the slightest thing about barge-ins or mixed initiatives.  Several will start out of the gate with a system that asks, modestly, "Please say your address."  By starting in this way the students then discover limitations and viscerally experience how complex it is to write simple and elegant prompts.  The goal is to not quash their drive to solve a particular problem - but instead allow each of them to stumble blindly into brick walls and dark corners, and only then show them the principles they'll need to learn to solve the problem. Once they start learning the basic structures of VoiceXML and other aspects of recognition technology, they come to understand how certain seemingly simple problems are often quite intractable. By keeping their basic objectives in mind they quickly discover that there are other directions from which to approach the problem. Can we use their caller-ID and look up their number in an address book? The Tufts white-pages? What's a Targus database? These students learn about designing a timeout or a retry prompt only after defining the core of the caller experience.

The worst consulting practices give a designer a set of requirements and limitations on the available technology - an emotional death for the creative mind.  The best teams bring in design representation during the sales cycle when ideas are fresh and possibilities are endless. They also brainstorm design concepts even before requirements are nailed down and they support and defend the designer's plans to go beyond the requirements when the user experience requires it.

Rule #2: Inspiration happens at all times of the day and only creativity allows us to solve real problems.

Traditional curricula begin with teaching us formulae like F=ma and P=mv2, and then feed us a delimited set of problems to show the applicability of those formulae. However, in our daily lives, the problems are open-ended, there are no canonical rules for solving them. We face a real customer, a budget, a business problem, a schedule and a set of unknowns - now what?  At work, we determine if we have, can buy, and can integrate the technology to overcome a barrier.  Will the design work for their customers or for their marketing department?   Will it fit into their brand?  How long will it take to produce the design and can we meet the deadline?  Nothing we were taught in school prepares us to solve those problems. 

This is where the typical education process strands us: at a place where we can't apply a theory or abstract methodology, where only innovation and creativity can solve the larger problem at hand; where we must acquire a new skill without knowing which skill is actually required.

In class, the students hit the same barriers we hit in our consulting practice and as in the real world, their inspiration is as likely to happen at 1:00 p.m. as is to happen at 1:00 a.m.  And while a viable but mundane speech system can be produced by any designer; only a designer who's open to inspiration and willing to use that inspiration as a leaping off point will create a great system.   So the real question is: where do we find inspiration?

As Samuel Johnson said, "The prospect of hanging concentrates the mind wonderfully." And while many clients would like to hang their designers, an impending deadline usually works as well.  It requires creativity to solve problems when encountering the limits of time and knowledge.  The ability and, more importantly, the willingness to be creative is learned through experience and it is for this reason you can't teach someone to be creative.

Rule #3: Don't bore your audience: connect with them on an intellectual and emotional level.

It was 2:00 a.m. when a student instant-messaged me, pleading, "There is no way I can explain this to your mom!" He was referring, of course, to "Blade's Other Rule of Presentations": 

"My mom will be in the audience, and she must understand, or be able to intuit, everything you talk about."

He was wrestling with an architecture diagram that corresponded to over 4,000 lines of code. His problem wasn't to understand the 4,000 lines of code he himself wrote. Rather, it was to understand what was truly essential to communicate, and what was simply implementational detail that could be left out. Once he figured that out, he could talk about a black box that carried out a specific compartmentalized process, he saw how he could animate boxes and arrows, time the actions with his descriptions, and explain the core problem without veering into the weeds. (My mom understood it perfectly.)

Can you even count the number of times you've suffered through a slide presentation with an illogical order, where you were wondering if you should stop the whole thing and just ask something basic, like "What are you talking about?"  I often feel at presentations as if the presenter were using someone else's slides and attempting to tell a story with them but leaving out the important facets and connective tissue that tie all the slides together.  An idea can't be expressed well unless it is understood well. And an idea that isn't expressed well gains no traction in today's companies, where even the most technical people need to understand how to tie their concepts to the bottom line of the business. 

Among the unusual aspects of the class is that their final presentations had shockingly few requirements:  they had to present all the key features of their solution in a traditional business format using PowerPoint and they couldn't bore their audience. 

"How the Projects Came to Life"

The structure of the class:

  • In the first third of the class, each student had to learn the basics of both coding VoiceXML and speech interface design, and practice it several times before starting the final project.  
  • The second third was spent on producing a speech system that performed a small specific task of their choice, then presenting that project in class.
  • The final third was spent on large-scale group projects - with two and three students per group.

The group project: 

  • Students had to choose their own team members, selecting their new co-workers for skill-balance, work-style, and personality (just as we all have to do in business) and then define the roles of each member. 
  • Each group chose a problem to solve that would have real business applicability and produce design documents, VoiceXML code, cast and direct professional voice talents, and create their own PowerPoint presentation and slide backgrounds for an audience of techies from the speech industry, deans and students of the university, and interested parents and friends.

The standout group created an application called iRing, which allows callers to access their iTunes music library, select a song in any number of ways (song name, artist, play count, at random, and more), sample a section of that song, add fade ins and fade outs as desired and turn that sample into a ring tone that can be sent to a  phone.   That summation, however, doesn't capture the stunning brilliance and complexity of iRing.  Every aspect of using it has been beautifully thought out.

  • For example, the system uses a "magic-word" recognizer when it plays back the song for the user to select the sample start point so that coughs and sing-alongs don't prematurely stop the playback.  
  • Another cool feature is that the random selection mode is perceptually random.  It re-distributes the weighting of the number of times an artist appears in the list to truly randomize the odds of getting any particular artist.  For example, if a third of your songs were by a single artist, it wouldn't select a song by that artist a third of the time.  
  • Not only can you send a ring tone to your friend, but you also can record a spoken message to be delivered before the ring tone to provide context and information about what they're receiving.  
  • The application was large (8,000 lines of code) because it had to do a tremendous amount of audio and data manipulation to both convert files from one format to another and perform live cuts and edits.  In addition, the system scrubs the iTunes .xml file at the beginning of each call to create clean grammars that allow the user to select from 15,000 songs, 900 artists, 124 genres and all playlists in under a second.

I can't wait for the IPO.

Instead of requiring that students fulfill a long check-list of requirements created by an external entity, they were forced to consider exactly what the audience would need to hear and then create their own method for how to present the story of their application. They were free to do it without regards to traditional limits.  Why not dress in costume if it gets the point across better? Or have a student start in the back of the room telling a joke? Use a magic trick to create an analogy? Or use only one word on each slide - or no words at all? 

Notice the obvious connection in how designing the flow of a presentation, where a live audience will be critiquing with questioning eyes or even laughter, resembles the creation of a speech system which also has to adhere to the same principles of good presentation technique.  And what this really did is ask the most from the student's creativity and flexibility, which is exactly what the workplace demands.  So, having been thrown into the deep end how did they do?

That night, each group was as amazing as the next: a design to track a shuttle bus, another for a hospital system, a fully-working virtual fitness trainer, a working system that lets you call into your own iTunes library and download any of your songs as a ring tone (with fade-in and fade-out at the points you select).   What's really amazing is that each of these applications was a group's final project, completed in the four-week period ending the semester.

Everyday the Intervoice Design Collaborative encounters problems that have yet to be fully defined. We have to harness our own inspiration and creativity so that we can realize the right design that fulfills the needs, intellectual, emotional, and business related, of our clients and their customers.  Companies that understand the importance of channeling creativity in the context of unknowns are the ones who innovate best, deliver the most, and solve the truly difficult problems in beautiful ways. 

Blade Kotelly is director, Intervoice Design and Usability Collaborative, Worldwide. He is the author of "The Art and Business of Speech Recognition." Kotelly can be reached at blade@intervoice.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues