The Need for Explainable AI in Speech Technology

Article Featured Image

Modern automatic speech recognition (ASR) systems rely on deep learning, or artificial neural networks. The use of these techniques has significantly reduced word error rates and improved performance, paving the way for the advanced speech recognition applications that we see around us today.

But deep learning applications are viewed as black boxes. The underlying neural network architectures may work well, but because of their complexity, even those who are creating these systems do not always know why they work the way they do. This black box nature has a few concerning implications, particularly when a small change in inputs leads to a large variance in outputs, which sounds technical and dry until you consider that “outputs” can mean “outcomes” for real people. Also, adversarial attacks can be mounted on such AI systems by altering the inputs. Beyond posing a security risk by threatening the applications’ technical robustness, such attacks undermine their trustworthiness, reliability, and regulatory compliance.

Trust in any technology is a prerequisite for its adoption. That’s why as AI capabilities accelerate, there’s been a growing emphasis on ethical artificial intelligence, commonly referred to as responsible AI. The core principles of responsible AI are fairness, accountability, and transparency. Fairness and accountability are intrinsically linked to transparency because if AI is opaque, we cannot address or adjudicate any claims about the lack of fairness or where to affix the accountability when things go wrong.

But there’s a trade-off of sorts between the higher performance of complex neural networks (when compared to simpler machine learning methods) and their lack of transparency. In many application domains, a lack of transparency has led to AI pilot projects not moving to production, slowing down or even halting the adoption of AI because companies preferred tools that use simpler statistical techniques.

That leads us to explainable AI (XAI).

Explainability refers to the ability to understand or make sense of how a deep learning system arrives at a particular prediction or generates an output given a particular input. One nuance here is to ask, “Explainable to whom?” Is the explanation geared toward the end user, the developer of the system, or some other stakeholder? Often, “interpretability” refers to understanding the internal workings of the system, and hence intended for the developers and technical stakeholders, while “explainability” refers to the end results of the AI system. Explainable AI also broadly refers to the growing field that focuses on making AI more transparent. To be clear, explainability is not a silver bullet but an important tool to advance responsible AI.

Explainability improves user trust in AI, and increasingly companies are making it an important criterion in their purchase decisions. Interpretability and explainability enable developers to debug and refine the AI they’re building. Explainability helps identify and correct for the biases (errors) in AI. It also shines a light on the limitations and blind spots of AI and helps formulate usage guidelines—where autonomous operation is acceptable, for example, and where human oversight (or human-in-the-loop protocols) might be needed. On top of that, laws and regulations in some jurisdictions, such as the European Union, are requiring that end users have a “right to explanation” when AI systems are used, particularly for high-stakes use cases, such as speech applications in healthcare or voice biometrics used for authentication and security.

The field of XAI is actively accumulating a set of tools, techniques, and best practices, particularly for image recognition and other non-speech AI applications. While deep learning is widely adopted for speech technology applications such as ASR, awareness and usage of explainable techniques are still in a nascent stage.

The already high level of interest in AI has skyrocketed with the release of ChatGPT and other such generative AI tools. As generative AI offers greater support for multi-modal inputs (that is, not just text-based prompts but also voice-based inputs), the need for speech XAI is going to be felt even more. Explainable AI, then, represents a big industrywide opportunity. 

Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues