Evaluating AI Language Models Just Got More Effective and Efficient
As AI language models advance with increasing frequency, each new iteration is often accompanied by claims of enhanced capabilities. Yet, proving a model’s superiority over its predecessor is a challenging and costly venture. Stanford researchers have proposed a novel method to address this issue, aiming to make our assessments more effective and efficient.
Developers typically subject new AI models to extensive benchmarking tests to prove their advancements. This process relies on vast banks of benchmark questions, potentially numbering in the hundreds of thousands. Each answer must be human-reviewed, significantly adding to both time and costs. Due to practical constraints, not every model can be subjected to every question, leading developers to select a subset. This process bears the risk of overestimating improvements if the selected questions happen to be less challenging. However, a paper published at the International Conference on Machine Learning by Stanford researchers introduces a cost-effective approach to these evaluations.
“The key observation we make is that you must also account for how hard the questions are,” explained Sanmi Koyejo, an assistant professor of computer science at Stanford’s School of Engineering who led the research. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons.”
Sang Truong, a co-author and doctoral candidate at the Stanford Artificial Intelligence Lab (SAIL), emphasized the evaluation process’s resource demands, noting, “This evaluation process can often cost as much or more than the training itself. We’ve built an infrastructure that allows us to adaptively select subsets of questions based on difficulty. It levels the playing field.”
To realize their goal, Koyejo, Truong, and their colleagues turned to a longstanding educational concept known as Item Response Theory (IRT), which considers question difficulty in scoring tests. This theory is comparable to the adaptive strategies used in standardized tests like the SAT. Each response informs the nature of subsequent questions.
The team leverages language models to analyze and score questions based on their difficulty, potentially cutting cost by half or more. These difficulty scores are crucial for comparing the relative performance of different models accurately.
In building a comprehensive, diverse, and cost-effective question bank, the researchers harness AI’s generative power to create fine-tuned question generators. This not only automates the replenishment of question banks but also aids in filtering out “contaminated” questions from the database.
With more thoughtfully designed questions, the authors assert that others in the AI field can conduct more precise performance evaluations using a smaller subset of queries. This approach is designed to be faster, fairer, and less costly.
The new evaluation method demonstrates its versatility by being applicable across various knowledge domains, including medicine, mathematics, and law. Koyejo’s team tested the system against an impressive array of 22 datasets and 172 language models. This approach has proven adaptable to both new models and questions, effectively charting subtle shifts in the performance of GPT-3.5, particularly in terms of safety metrics, which reflect a model’s robustness to data manipulation and adversarial attacks.
Previously, reliably evaluating language models was a costly endeavor fraught with inconsistencies. The Item Response Theory approach introduces rigorous, scalable, and adaptive evaluation protocols, offering significantly improved diagnostics and performance evaluations for developers and fairer, more transparent assessments for users.
“And, for everyone else,” Koyejo noted, “It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”