Despite their impressive outputs, generative AI models, including GPT-4, fail to understand real-world complexities, a new study reveals. Researchers have found significant flaws in the internal world models used by these AI systems, demonstrating that they can falter under changing conditions and emphasizing the need for more robust evaluation metrics.
A new study reveals that generative AI, despite its impressive capabilities, lacks a coherent understanding of the world. Researchers have uncovered that these models can complete tasks such as providing precise turn-by-turn driving directions in New York City but falter dramatically when faced with new variables, such as detours or road closures.
Generative AI, particularly large language models (LLMs) like GPT-4, are primarily trained to predict the next word in a text sequence. Their phenomenal performance, often mimicking human-like understanding, has led many to believe that these models might possess an implicit comprehension of the real world. However, a new study challenges this notion, highlighting significant flaws in the internal models that AI uses to navigate tasks.
“We needed test beds where we know what the world model is. Now, we can rigorously think about what it means to recover that world model,” lead author Keyon Vafa, a postdoc at Harvard University, said in a news release.
Concerning Findings on AI World Models
The research team focused their examination on a specific type of generative AI known as transformers, the backbone of many prominent LLMs. These models were tested using deterministic finite automations (DFAs), a problem class that involves rules-based sequential states, like navigating city streets or playing a board game like Othello.
Their findings were striking. Transformers could accurately generate directions and valid game moves, yet when elements of the task environment were altered, their performance deteriorated quickly.
“I was surprised by how quickly the performance deteriorated as soon as we added a detour. If we close just 1% of the possible streets, accuracy immediately plummets from nearly 100% to just 67%,” added Vafa.
When the team reviewed the AI-generated maps of New York City, the maps bore little resemblance to reality. They were overlaid with imaginary streets and flyovers that do not exist, making the navigation models fail under real-world changes.
New Metrics for Evaluating AI Understanding
To address these inconsistencies, the researchers developed two new metrics to evaluate if a transformer has a coherent world model: sequence distinction and sequence compression.
Sequence distinction assesses whether the AI can differentiate between different states, such as distinct Othello boards.
Sequence compression evaluates if the AI recognizes identical states and understands the sequence of possible next steps from these states.
The study found that transformers, which learned from randomly-generated data, formed somewhat more accurate world models than those trained on strategy-based data. However, none of the tested AI models demonstrated a coherent world model for tasks involving wayfinding.
“One hope is that, because LLMs can accomplish all these amazing things in language, maybe we could use these same tools in other parts of science, as well. But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries,” senior author Ashesh Rambachan, an assistant professor of economics at MIT and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS), said in the news release
Implications and Future Directions
These findings underscore a pivotal limitation in current generative AI models, which could have serious implications for their deployment in real-world scenarios. As LLMs become integrated into various sectors, from autonomous vehicles to scientific research, understanding their capabilities and limitations is crucial.
Future research aims to address these challenges by tackling more diverse problems, including those with partially known rules, and applying these evaluation metrics to real-world scientific issues.
“Often, we see these models do impressive things and think they must have understood something about the world. I hope we can convince people that this is a question to think very carefully about, and we don’t have to rely on our own intuitions to answer it,” Rambachan added.