AI Models’ Reliability Falters Despite Advancements, New Study Finds

A new comprehensive study highlights the increasing unreliability of advanced AI language models, exposing significant mismatches between their performance and human expectations. Researchers emphasize the need for fundamental changes in AI design and development.

A new study spearheaded by researchers from the Valencian Institute for Research in Artificial Intelligence (VRAIN) at the Polytechnic University of Valencia (UPV), the Valencian Graduate School and Research Network in Artificial Intelligence (ValgrAI) and the University of Cambridge has unveiled startling findings about the reliability of large language models.

Recent advancements in AI, including models like OpenAI’s GPT, Meta’s LLaMA and BLOOM, have captivated the world with their enhanced problem-solving abilities. However, the study’s results indicate that these models often stumble on simpler tasks, despite their proficiency with more complex ones.

“Models can solve certain complex tasks in line with human abilities, but at the same time, they fail on simple tasks in the same domain. For example, they can solve several PhD-level mathematical problems. Still, they can get a simple addition wrong,” José Hernández-Orallo, a researcher at VRAIN UPV and ValgrAI, said in a news release.

The study investigated three critical aspects affecting the reliability of these models.

Task Difficulty Mismatch

The research, published in the journal Nature, revealed a significant discordance between the tasks humans find difficult and the models’ performance on those tasks.

“[T]here is no ‘safe zone’ in which models can be trusted to work perfectly,” added Yael Moros Daval, a researcher at VRAIN UPV, emphasizing the inconsistency.

Propensity for Incorrect Answers

Recent models are more inclined to provide wrong answers than to abstain from answering uncertain tasks, a stark contrast to human behavior.

“This puts the onus on users to detect faults during all their interactions with models,” added Lexin Zhou, a researcher at VRAIN UPV.

Sensitivity to Problem Statements

Effective question formulation remains challenging. Prompts that succeed in complex tasks may still fail in simpler ones.

“Users can be influenced by prompts that work well in complex tasks but, at the same time, get incorrect answers in simple tasks,” added co-author Cèsar Ferri, a researcher at VRAIN UPV and ValgrAI.

Implications

The implications of these findings are profound, especially for general-purpose AI used in high-risk applications. The researchers argue that human supervision cannot fully compensate for these inherent reliability issues due to the overconfidence users place in these models.

“Our results suggest a fundamental change is needed in the design and development of general-purpose AI,” concluded Wout Schellaert, a researcher at VRAIN UPV.

This call to action resonates as the use of AI continues to expand into critical areas like health care, finance and autonomous systems.