The notoriously high computing and power requirements of AI, especially for tasks like media generation, is one of the main problems. When it comes to mobile phones, only a few expensive devices with powerful silicon are able to run the feature set natively. Even when it’s implemented at scale in the cloud, it can be expensive.
Nvidia has quietly addressed this challenge in partnership with folks at the Massachusetts Institute of Technology (MIT) and Tsinghua University. The team created a hybrid AI tool called HART is a hybrid autoregressive transformator that combines two of AI’s most popular image creation techniques. The result is a tool that is lightning fast with a dramatically lower computing requirement.
To give you an idea how fast it works, I asked it create an image of an parrot playing bass guitar. In just a few seconds, it returned the image below. I could hardly follow the progress bar. When I pressed the same prompt in Gemini before Google’s Imagen 3 Model, it took approximately 9-10 seconds with a 200 Mbps connection.
When AI pictures first made waves, the diffusion method was at the heart of it. It powered products like OpenAI’s Dall E image generator, Google Imagen, and Stable Diffusion. This method produces images with a high level of detail. It is a multistep process to create AI images and is therefore slow and computationally costly.
Auto-regressive models are a second approach that is gaining popularity. They work in a similar way to chatbots, and generate images by using pixel prediction techniques. It is a faster method, but it is also more error-prone.
Demo for HART on-device: Efficient visual generation with hybrid autoregressive transformer
A team at MIT merged both methods into a package called HART. It uses an autoregression to predict compressed images as discrete tokens, while a diffusion model compensates for the quality loss. The overall approach reduces from over two dozen steps to eight.
According to the experts behind HART, it can “generate pictures that match or surpass the quality of state-ofthe-art diffusions models but do so nine times faster.” HART uses an autoregressive with a 700,000,000 parameter range and a diffusion model that is capable of handling 37 million parameters.
Solving the cost-computing crisis
Interestingly, this hybrid tool was able to create images that matched the quality of top-shelf models with a 2 billion parameter capacity. Most importantly, HART was able to achieve that milestone at a nine times faster image generation rate, while requiring 31% less computation resources.
As per the team, the low-compute approach allows HART to run locally on phones and laptops, which is a huge win. So far, the most popular mass-market products such as ChatGPT and Gemini require an internet connection for image generation as the computing happens in the cloud servers.
In the test video, the team showcased it running natively on an MSI laptop with Intel’s Core series processor and an Nvidia GeForce RTX graphics card. That’s a combination you can find on a majority of gaming laptops out there, without spending a fortune, while at it.
HART is capable of producing 1:1 aspect ratio images at a respectable 1024 x 1024 pixels resolution. The level of detail in these images is impressive, and so is the stylistic variation and scenery accuracy. During their tests, the team noted that the hybrid AI tool was anywhere between three to six times faster and offered over seven times higher throughput.
The future potential is exciting, especially when integrating HART’s image capabilities with language models. “In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture,” says the team at MIT.
They are already exploring that idea, and even plan to test the HART approach at audio and video generation. You can try it out on MIT’s web dashboard.
Some rough edges
Before we dive into the quality debate, do keep in mind that HART is very much a research project that is still in its early stages. On the technical side, there are a few hassles highlighted by the team, such as overheads during the inference and training process.
The challenges can be fixed or overlooked, because they are minor in the bigger scheme of things here. Moreover, considering the sheer benefits HART delivers in terms of computing efficiency, speed, and latency, they might just persist without leading to any major performance issues.
In my brief time prompt-testing HART, I was astonished by the pace of image generation. I barely ran into a scenario where the free web tool took more than two seconds to create an image. Even with prompts that span three paragraphs (roughly over 200 words in length), HART was able to create images that adhere tightly to the description.
Aside from descriptive accuracy, there was plenty of detail in the images. However, HART suffers from the typical failings of an AI image generator tool. It struggles with digits, basic depictions like eating food items, character consistency, and failing at perspective capture.
Photorealism in human context is one area where I noticed glaring failures. On a few occasions, it simply got the concept of basic objects wrong, like confusing a ring with a necklace. But overall, those errors were far, few, and fundamentally expected. A healthy bunch of AI tools still can’t get that right, despite being out there for a while now.
Overall, I am particularly excited by the immense potential of HART. It would be interesting to see whether MIT and Nvidia create a product out of it, or simply adopt the hybrid AI image generation approach in an existing product. Either way, it’s a glimpse into a very promising future.