January 1, 2003
Q & A

Peter Lawless, Vice President of Sales, and David Horowitz, Chief Scientist, Vox Generation

Judith: First of all, congratulations on the funding. It's quite impressive for a small company to be leading such a large project. David: Yes, it's very unusual for a small or medium-size company in this market to have a hard-core speech science team. I like to tell people that we are to conversational speech what SpeechWorks was to the core engine in 1996. Judith: Tell me about FASiL. What is it? Peter: FASiL is an acronym for Flexible and Adaptive Spoken Language and Multimodal Interface. Its goal is to produce a truly conversational language engine, building on existing Vox Generation and SpeechWorks technology. Unlike many of today's commercial speech systems, where the user has to adapt to the computer's way of doing things, FASiL technology will let the user drive the interaction with the computer. Users can also interact with the system through text and images, allowing them to select the most effective way to access information and services. Judith: What is your role in FASiL? David: Vox Generation is the lead organization and I'm FASiL's principal investigator. Judith: Why did Vox Generation get involved with a project like this? David: What motivated us was the strong, positive response to AT&T's Say Anything which was developed for a call-center application that had 15 topics or states. The idea was to develop a next-generation speech recognition technology that would extend conversational interfaces for much more complex domains, such as email reading and personal information management that might have 1000 or more states. We knew that basic N-gram techniques for language modeling are too data-intensive to handle those problems - You'd have to collect corpora for 10 years. So, we developed a new approach to language modeling called unified and hierarchical language modeling which goes one step further than what AT&T did keeping in mind that the basic goal of it is to make conversational interfaces amenable to rapid application development. Judith: Who are the participants and what are their responsibilities? David: SpeechWorks is collaborating with us on language-modeling techniques. We are working with Roberto Pieracini's group there. The University of Sheffield is doing active information management. That is, integrating information across the user's applications. For example, my calendar says that I am going to talk with you today. Using FASiL, any email I received from you will automatically be given a higher priority today. We are working with Yorik Wilks who is an authority in natural language. Media Lab Europe leads the FASiL work package for the user interface group. Because we have a multimodal architecture, they are responsible for the fusion of modalities on the input side and the fission of modalities on the output side. Media Lab Europe are proponents of technology for developing countries and people for computer literacy. They bring that perspective to the project. Cap Gemini-Sweden leads the software architecture. We are developing and testing applications for the 3G mobile networks. We've done a 160 page architecture study for the scalability of FASiL architecture and we are in the second phase of software architecture design. Portugal Telecom is in charge of the corpora collection for our three languages: European Portuguese, Swedish and UK English. They also play a strong role in the software architecture. When we pilot the three languages we will use them as the Telco. Vox Generation is developing software and is the principal of the language modeling work package and new techniques work package. New techniques include things like machine learning for dialog management. We also have two associations involved. The Royal National Institute of the Blind is leading the work package on user needs assessment and user needs analysis. We are also working to dovetail with The Royal National Institute of the Deaf in a project called SymFace, which is a talking avatar for lipreading/speech reading. Peter: I just want to add that the innovation of FASiL is to pilot a full multimodal voice portal application that is 3G mobile-network ready, along with tools for rapid development of new applications. Judith: Why did you include the Institutes for the blind and deaf in this project? Most research projects don't have those types of groups in the beginning -- even though I think that would be good idea. David: It's something that is important to me. I ran a laboratory for disabled people at Tufts School of Medicine for five years. My masters degree is in hearing, acoustics and electrical engineering from MIT. FASiL is intended to be a commercial conversational system so I wanted to make sure it would be accessible to deaf and blind people. I believe that you enhance your design principles by meeting the needs of expert users who rely principally on one modality for access. Consider a non-impaired user who can use a VUI in a lot of noise. The system might call a GUI-only interface that people who are deaf could benefit from as well. Another person might deal with a VUI that could be built so it's suitable for someone who is blind. Judith: I see you are working in three languages. Are you also looking at multi-lingual communications? David: Yes. Our collaborator at Cap Gemini gets multilingual emails. One question that has already surfaced in the project is how to detect the language of an email and switch in the proper TTS for the listener. It's a hard problem, but it's not a core research component. Where possible, we exploit known techniques. This is one of those areas because Yorik Wilks says he has some technology for language identification. Peter: The project will also use existing multimodal technologies where needed and do research to develop technologies that are not yet mature, such as advanced language understanding and dialogue models. Judith: You mentioned that new techniques, such as machine learning, will be a part of FASiL. Please explain more about that. David: Cap Gemini finds this project an exciting futuristic application of artificial intelligence and speech recognition. We feel we can get some leverage from incorporating simple machine learning applied to adaptation of dialog. For example, the user prefers to interact in a particular way when sending an email. Now, if there are three different ways of sending email we can adapt the interface to prioritize them and, hopefully, reduce task completion time. For our initial dialogue management version we using DARPA Communicator architectures that have been shown to work. We hope FASiL will be the next step beyond DARPA Communicator. We also have summarization and keywords spotting and we will do some semantic analysis. We want to incorporate speaker verification where possible, but it's not a key research component. Judith: There appears to be quite a bit of interest and multimodalities in Europe. David: Yes there is. Multimodal is one of the key priorities of FASiL as is multi-lingual. FASiL is particularly interesting to the European Union because it's one of the last grants under in the fifth framework and it's transitioning into the sixth framework which involve multiple objectives, very large consortia, and get much larger grants. FACiL is a nice case study for the European Union because they are going to have another round of competition for investment in this area. The European Union will monitor FACiL for six months to a year before launching their new multimodal grants. We're not sure whether we will participate in that competition.

Dr. Judioth Markowitz is the associate editor of Speech Technology Magazine and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or jmarkowitz@pobox.com.

Peter Lawless, Vice President of Sales, and David Horowitz, Chief Scientist, Vox Generation

AI Voices Indistinguishable from Human Ones, Study Finds

Salesforce Launches Agentforce Voice

SyncWords Launches Vocalics for Real-Time Dubbing

Cleo Partners with CloneOps.ai