Standards for Evaluating Generative AI

Article Featured Image

On Oct. 30, the White House issued the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. This groundbreaking document lays out many important concepts having to do with societal concerns about AI. One of the points it makes: “Artificial Intelligence must be safe and secure. Meeting this goal requires robust, reliable, repeatable, and standardized evaluations of AI systems, as well as policies, institutions, and, as appropriate, other mechanisms to test, understand, and mitigate risks from these systems.”

What are the prospects for standardizing ways to evaluate AI systems, in particular natural language processing (NLP) systems, as the executive order mandates?

Most people are familiar with popular generative AI systems based on large language models, like ChatGPT and Bard. We’ve seen them come up with surprisingly intelligent and helpful responses—and also terribly wrong answers, commonly known as hallucinations. However, these observations are just qualitative impressions and are not standardized and in many cases aren’t repeatable.

Researchers have been evaluating NLP systems for many years, so it’s natural to ask if we can apply traditional NLP evaluation techniques to genAI systems. The answer is a qualified “yes,” but there’s clearly still work to do.

All NLP evaluations depend on starting with a standard set of test inputs (a corpus or dataset) that can be used across systems, because if we want to compare two systems, or even the same system at two different times, we need the same inputs. Standard test corpora, however, aren’t always satisfactory with genAI systems, which can handle them too easily.

The next step, once we have a test corpus, is to give the inputs to the systems we’re testing and see if they do the right thing. But “doing the right thing” can be hard to determine. One outside-the-box evaluation metric was used by the Verge to evaluate two genAI systems: It asked the systems to come up with a chocolate cake recipe. Then they baked the cakes and compared the results, a creative, if not very practical, approach. (In case you’re curious, neither cake was very good.)

Unlike with most traditional NLP tasks, there’s no single correct answer with genAI system responses, just like there’s no single correct cake recipe. We can enlist human judges, which is time-consuming and expensive, or we can try to devise automatic ways to compare the systems’ outputs with verified outputs prepared by humans. This second strategy is commonly used to evaluate translation systems, but it often fails to recognize correct translations and can also evaluate incorrect translations as correct.

So we have a choice between cheap but not very intuitive automatic evaluation metrics and expensive but intuitive scoring like that provided by human judges. Research can refine automatic approaches so that they better correspond to our ideas about correct and incorrect answers.

In the meantime, we can proceed with the first step: creating standard evaluation corpora.

A couple of years ago, my colleague Christy Doran (principal at Clockwork Language) and I started working on an evaluation corpus for intelligent assistants such as Alexa, Siri, and Google Assistant. We found many gaps in their abilities, which we reported in Speech Technology. More recently, Christy, Shyamala Prayaga (a senior software product manager at NVIDIA and an adjunct professor at Carnegie Mellon University), and I extended this work by organizing an online “Digital Assistant Throwdown,” where we invited participants to suggest challenging questions for conversational assistants and recorded the responses.

We have consistently found that it is easy to ask questions that get surprisingly nonsensical answers. An example:

Prompt: Which planets don’t have more than 10 moons?

ChatGPT: Mercury and Venus do not have any natural moons. Earth has one natural moon (the Moon), Mars has two moons (Phobos and Deimos), Jupiter has over 80 known moons, Saturn has over 80 known moons, Uranus has 27 known moons, and Neptune has 14 known moons.

[So far so good: The correct answer is Mercury, Venus, Earth, and Mars. But ChatGPT doesn’t stop there.]

So, Mercury and Venus are the planets in our solar system that don’t have more than 10 moons.

It thinks that Earth has more than 10 moons!

We need to keep collecting this kind of challenging data—it doesn’t solve the problem of evaluating genAI systems, but it’s a good first step. We’ll release the hundreds of examples we collected through the Digital Assistant Throwdown in the near future. We hope that other researchers will be inspired to add to this data and to help make progress toward evaluation procedures for generative AI systems. 

Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversationaltechnologies.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

With Conversational AI, the Standards Work Heats Up

The W3C has published recent standards that will impact AI-powered speech applications.

How to Make ChatGPT Usable for Enterprises

Not surprisingly, a standardized format would make the process a lot easier.

When Not to Standardize

Not every aspect of voice systems can or needs to be interoperable.