Researchers from MIT and NVIDIA have created HART, a novel AI tool that combines autoregressive and diffusion models to generate high-quality images up to nine times faster than current methods. This breakthrough has significant implications for various industries, including self-driving cars and video game design.
Researchers from MIT and NVIDIA have unveiled a revolutionary artificial intelligence tool that swiftly generates high-quality images, merging the best attributes of two popular generative AI models. Dubbed HART (Hybrid Autoregressive Transformer), this innovative approach promises to accelerate image generation without sacrificing quality, marking a significant advancement in AI technology.
The demand for rapid, realistic image generation is growing, particularly in areas like autonomous vehicle training, where simulations need to mirror real-world complexities to enhance safety. Traditional diffusion models are renowned for producing detailed and lifelike images but are often too slow and resource-intensive. Autoregressive models, while faster, tend to fall short in image quality, producing pictures marred with inaccuracies.
The team led by Haotian Tang, a postdoctoral candidate at MIT at the time and now a research scientist at the GenAI org of Google DeepMind, and Yecheng Wu, an undergraduate student at Tsinghua University, along with senior author Song Han, an associate professor in MIT’s Department of Electrical Engineering and Computer Science, a member of the MIT-IBM Watson AI Lab and a distinguished scientist of NVIDIA, and NVIDIA, developed HART to bridge this gap.
By leveraging the speed of autoregressive models and the refinement capabilities of diffusion models, HART can generate images of comparable or superior quality up to nine times faster than state-of-the-art diffusion models.
The process involves an autoregressive model rapidly capturing the overall structure of the image, followed by a diffusion model finely tuning the details. This two-step process significantly reduces computational requirements, allowing HART to run efficiently on everyday devices like laptops and smartphones.
“If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART,” Tang said in a news release.
HART’s potential applications are vast, from training robots for complex real-world tasks to aiding designers in creating immersive video game environments. This hybrid model’s efficiency and adaptability also position it well for integration with emerging vision-language generative models, paving the way for more sophisticated AI interactions.
LLMs (Large Language Models) serve as versatile interfaces for various AI models, including multimodal systems that can understand and generate both text and visual content.
“LLMs are a good interface for all sorts of models, like multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities,” Han added.
Looking ahead, the researchers aim to expand HART’s capabilities into video generation and audio prediction, leveraging its scalable and generalizable framework. Their ultimate goal is to build advanced vision-language models atop the HART architecture, pushing the boundaries of what AI can achieve.
The research was funded by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program and the National Science Foundation, with NVIDIA providing the GPU infrastructure.