From Conception to Adolescence, the Speech Industry Is Developing a New Breed of Designers/Developers

At the Tufts University campus, the Experimental College (the oldest organization of its kind in the United States) serves as a center for educational innovation, undergraduate curriculum expansion and faculty-student collaboration in the arts and sciences.  Blade Kotelly of Edify and Joe Lemay, student at MIT Sloan School of Management, are turning out voice user-interface designers.

Kotelly and Lemay co-instruct a class called “Talking to Computers: Designing and Producing Your Own Speech Recognition System.”  The class teaches students everything from designing and prototyping a speech recognition system to programming VoiceXML and casting and directing voice talent.  This class came out of the idea that students can learn two things: one – there is more opportunity for students to enter this field as it continues growing; for example, a student from last year’s course now has a job working as a designer/developer where she actually makes and codes her own VoiceXML designs; and two – it creates an environment where students can quickly pick up the skills they need to learn and create VoiceXML applications that work. This class also helps each student express their ideas in a manner appropriate for a professional environment.  Historically, it would take speech developers/designers months to get up to speed and figure out where to drop the first box to ask the first question, but with new standards emerging and free tools in the hands of any developer, a fresh group of designers is coming out of this class.

Speech Applications Like No Other

By Blade Kotelly

Innovation is what happens when the mind is freed from the shackles of routine execution.  Since students taking the “Designing and Making a Speech Recognition System” class at Tufts University aren’t bound by these shackles, they’re able to innovate like no others.  Here’s a sample of working proof-of-concepts, on the edge of commercial viability from the 2005 spring semester’s class:

Fit Tracker – The Personal Workout and Diet Assistant
David Donatelli (junior, computer science major) takes his fitness seriously, (he can bench-press over 250 pounds) so when it came time for him to create his first speech-recognition application from scratch he made a system for athletes to conveniently enter meal and workout information.  The system also provides real-time feedback to encourage the user to stay and work a bit harder if they’re slacking.

Serious athletes spend copious time writing down the details of their workouts on paper – details that cannot be electronically manipulated and offer no immediate behavioral encouragement.  With David’s solution, an athlete can bust out a cell phone, call into their assistant and simply indicate if their workout and/or diet was ‘on target,’ ‘below’ or ‘above’ target -- and with a little fuzzy-logic, report fairly accurate results. If the athlete is more type-A (like typical Tufts students) they can more granularly indicate resistance, sets and reps for particular body parts and specific caloric intake.
The Runners Conversion Line
Nate Brigham (senior, English major), a competitive and school record-holding distance runner tapped into the esoteric world of a runners’ obsessions with times and distances. When competitive runners complete a one-mile run, they often need to use a complex formula to convert the time of a one-mile run to a hypothetical time for a 1500 meter run. Why? These conversation formulas can be the basis to qualify a runner for another race. Perhaps Puma or Nike could use this system to connect deeply with their serious customers?

VillaVoice - "Make yourself at home, anywhere."
Andy Bonventre (junior, computer science major) had remembered his father’s desire to control what is going on at home while he is away on business.  Andy has given home-owners the ability to turn on the heat while they are heading home from work, switch lights on or off, and lock appliances like a child’s video game system when it’s past their bedtime.  Using X10 devices, an open source API, and Python wrapped together with an elegant speech interface, he produced this:

Andy's Dad: "House status?"
VillaVoice: "It's 11:59 pm. The living room and front porch lights are on. It's 74 degrees downstairs and 68 upstairs, and motion sensors show movement in the dining room and the living room within the hour.
The lights are off in the kitchen but the coffee-maker has been on for the past six hours. Should I turn it off?"
Andy's Dad: "Yes."
VillaVoice: "Done. You can say..."

Now if Andy’s system could only clean my place while I’m overseas, my life would be complete. Wait – I think he’s adding that feature on his summer vacation.

Standards ensure that people have access to transportable technologies - the Web is a great parallel for understanding this.  The Web provided an environment where, regardless of which browser is being used, you can code up content, and it remains the same content delivered in virtually the same way.  This is why computer savvy students were making Web pages at an alarming rate in the late 1990s.  Now, standards, such as SALT and VoiceXML, are working along these same guidelines - enabling more people to cross-pollinate their abilities as developers and designers.  Contrast what’s happening in the industry now with what was happening in the mid-1990s.

In the 1990s, speech technologies were good enough and fast enough with designs that were capable of making public-facing systems that worked.  At this time, designers and developers were very different people, working in completely separate fields.  There was very little overlap in these roles, so the designer would articulate a design that was generic enough for a developer to reinterpret the environment so that, on any particular provider, the developer with highly proficient coding skills and knowledge could execute on it.  Having familiarity with one platform and not another made it very difficult to transfer skills across multiple environments; therefore, designers did not gain much familiarity with a particular platform as they did not reside within the platform company. 

Companies – especially marketing organizations – are now realizing that the phone is one of the most public-facing experiences, aside from advertising, for customers. The big difference is that the product that is being sold is intertwined with the company itself, just as the Web site for the company represents the company itself; but advertising does not represent the company, it is intended to be about the company. 

Even though developers spend so much of their time on the technology, many companies still don’t have great phone systems that users could love. Touchtone is incapable of offering users a really visceral experience, whereas speech interactions give users a direct, deep-felt connection with the company.  This is where human factors come into play.  Human factors are defined as “the scientific discipline concerned with the understanding of interactions among humans and other elements of a system, and the profession that applies theory, principles, data and other methods to design in order to optimize human well-being and overall system performance.4” Once speech technologies came into play, curriculums focusing on human factors were made available to students.

The Experimental College at which Kotelly and Lemay teach offers general education courses aimed at involving students in current important issues and/or interdisciplinary subject areas that traditional departments do not afford students.  They are designed to be discussion-based and participatory in nature, so that students can learn through experience. In Kotelly and Lemay’s class, they have students, who don’t necessarily have a computer background, coding their first simple, teeny tiny applications in their very first class. 

They focus half of the class time on making these systems work, while the other half is spent on learning that the design of every aspect of a system is important to the caller, including elements of solving real business problems, learning about and interpreting a company’s brand, and using a speech system to establish a psychological affiliation with a caller by taking lessons from social psychological research.  The students also spent time working with the sound of the system – recording their own voices and friends’ voices, and then re-recording to make them sound better and make more sense. Finally, they worked with a professional voice talent for usability, testing the system and presenting their finding in a report and in PowerPoint slides with embedded audio, or sometimes with a live demo.  It is this kind of thought process that is necessary for designers and developers to execute good design, which is what enterprises are seeking.

Enterprises large and small want to have designers who do more than just design. In the past companies would hire programmers and designers separately.  Today, they are looking for the designers who think like pragmatic engineers - taking into account the comfort of doing something over an extended period of time.  In other words, they are looking for human factors experts who are really good at the technical aspects of reverting the call flow, making a prompt that sounds good, and can really produce a system that can be pitched to someone fairly quickly.  Corporations of all sizes are looking for people with more than one talent who can design, implement, and clearly express their ideas while being sensitive to both the business aspects as well as the practitioner aspects.  So here, the class teaches students to have a multitude of abilities, which makes them more viable for hiring and makes the case that given the tools and technologies; they can create applications, software packages and platforms that have very portable skill sets.

These students are the future of the speech industry and will feed off of the success of speech technologies, just as the technologies will feed off of the ideas of the students.  The speech industry appears to have reached its adolescence and as it continues to grow, the job market for students such as these will continue to expand.  Driven by market innovation and compelling product offerings, enterprises will be seeking individuals with multiple talents/skills to provide their customers with solutions that outperform anything that they have seen before.  Now, while speech is still growing and developing, is the best time to apply the young minds of these students to help speech reach its fullest potential.

Please visit www.artofspeechrecognition.com to learn more about the speech recognition classes at Tufts University.

*Special thanks to Blade Kotelly, chief VUI designer and director of the Edify Design Collaborative Worldwide, for all of his help in coordinating this article.
Stephanie Owens is the associate editor for Speech Technology Magazine. She can be reached at stephanie@amcommpublications.com .

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues