MIT Researchers Introduce Technique That Boosts Reasoning Skills of AI Language Models

Researchers from MIT have introduced Natural Language Embedded Programs (NLEPs), a technique that enhances the reasoning abilities of large language models. This breakthrough combines programming and natural language, achieving over 90% accuracy in various tasks, and offers greater transparency and trustworthiness in AI systems.

Large language models, like the ones powering ChatGPT, have demonstrated impressive capabilities such as drafting legal briefs, analyzing customer sentiment and translating documents. Yet, these models often stumble when tackling tasks that require intricate numerical or symbolic reasoning.

Enter Natural Language Embedded Programs (NLEPs), a revolutionary technique proposed by researchers from MIT and other institutions. This method prompts language models to generate and execute Python programs to solve complex queries, and then to articulate the solutions in plain language. The approach drastically improves the models’ accuracy on a diverse range of reasoning tasks, making the process more transparent and versatile.

“We want AI to perform complex reasoning in a way that is transparent and trustworthy. There is still a long way to go, but we have shown that combining the capabilities of programming and natural language in large language models is a very good potential first step toward a future where people can fully understand and trust what is going on inside their AI model,” Hongyin Luo, a postdoctoral associate at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the study, said in a statement.

NLEPs operate by having the model follow a structured process. The model first calls necessary functions, imports relevant natural language data (such as a list of U.S. presidents’ birthdays), performs calculations through a function and, finally, presents the result in natural language. This structured, four-step approach ensures higher reliability and makes troubleshooting straightforward, as users can inspect and fix the Python code directly if errors occur.

“It is like a digital calculator that always gives you the correct computation result as long as the program is correct,” Luo added.

The team tested the NLEPs on various tasks, including symbolic reasoning, instruction-following and text classification, achieving over 90% accuracy. Remarkably, NLEPs performed 30% better than other task-specific prompting methods, even when using less sophisticated open-source models.

Alongside improving accuracy, NLEPs offer benefits in terms of data privacy and efficiency. Since programs are run locally, user data does not need to be transmitted to external entities like OpenAI or Google for processing. Additionally, by generating one core program for similar questions, users can replace variables without re-running the model, saving significant computational resources.

“Having language models reason with code unlocks many opportunities for tool use, output validation, more structured understanding into model’s capabilities and way of thinking, and more,” said Leonid Karlinsky, principal scientist at the MIT-IBM Watson AI Lab.

However, one of the limitations is that smaller models may not perform as well with NLEPs due to their restricted training data. Therefore, the researchers are keen to explore how to enhance smaller language models’ capabilities and improve the robustness of prompt variations.

“There is no magic here. We do not have a more expensive or fancy language model. All we do is use program generation instead of natural language generation, and we can make it perform significantly better,” Luo said.

This research, which promises a transparent and efficient future for AI, will be presented at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. It signals a significant step towards building AI systems that are both powerful and easily understandable by humans.