Ai2's MolmoAct AI model 'thinks 3D' and will challenge Nvidia in robotics AI

Want smarter insights delivered to your inbox?

Subscribe to our weekly newsletters and get only the most important information for enterprise AI, data security, and cybersecurity leaders. Subscribe Now Physical AI is a growing area with companies such as Nvidia Google Meta is releasing research on combining large language models with robots.

The Allen Institute for AI’s (Ai2) new research aims to compete with Nvidia and Google on physical AI. It has released MolmoAct 7B – a new open source model that allows robots “reason in the space”. MolmoAct is based on Ai2’s open-source Molmo and “thinks” three-dimensionally. It will also release its training data. Ai2 licenses the model under Apache 2.0, while the datasets fall under CC BY 4.0.

Ai2 classes MolmoAct under an Action Reasoning Model in which foundation models are able to reason about actions within 3D physical space.

This means that MolmoAct uses its reasoning capabilities to understand and plan how it occupies the space, before taking that action.

AI Scaling Has Its Limits.

Power limits, rising token costs, inference delays, and power caps are changing enterprise AI. Join our exclusive event to learn how top teams are: https://bit.ly/4mwGngO

“MolmoAct has reasoning in 3D space capabilities versus traditional vision-language-action (VLA) models,” Ai2 told VentureBeat in an email. MolmoAct is a VLA that can think and reason in space. This makes it more efficient and generalizable.

Physical understanding

Ai2 claims MolmoAct can help robots better understand their environment and make better decisions about how to interact with it. MolmoAct can be used anywhere a machine needs to understand its physical environment, according to the company. “We are primarily thinking about it in a home environment because that is where robotics faces the biggest challenge, as things are constantly changing and irregular. But MolmoAct can work anywhere.”

MolmoAct is able to understand the physical world through “spatially-based perception tokens,” which can be tokens that are pretrained and extracted by a vector quantized variational autoencoder, or a model converting data inputs such as video into tokens. These tokens are different from those used by VLAs because they do not use text inputs.

These tokens allow MolmoAct spatial understanding and to encode geometric structures. The model can estimate the distance between objects using these.

After estimating the distance, MolmoAct predicts a series of “image-space waypoints” or points in an area where it can set up a path. The model will then begin displaying specific actions such as lowering an arm a few inches, or stretching out.

Ai2 researchers claimed that they were able “with minimal fine-tuning” to get the model adapt to different embodiments, such as a mechanical hand or a robot humanoid.

Benchmarking tests conducted by Ai2 revealed that MolmoAct 7B achieved a task success of 72.1% – beating models from Google. Microsoft Nvidia.

A small step forward

Ai2’s research is the latest in taking advantage of the unique advantages of LLMs or VLMs. This is especially true as the pace of innovation for generative AI continues its rapid growth. Ai2’s work and that of other tech companies are seen by experts in the field as building blocks.

Alan Fern is a professor at the Fern, a professor at Oregon State University College of Engineering (19459049), told VentureBeat Ai2’s work “represents an important step in the development of 3D physical reasoning models that are more capable.” “While I wouldn’t call it revolutionary, this is a significant step forward in developing more capable 3D reasoning models,” Fern added. “Their focus is on 3D scene understanding as opposed to 2D models. This marks a significant shift in the right directions.” These benchmarks have improved over previous models, but they still fall short in capturing real-world complex and remain relatively controlled and toys-like in nature. Gather AIhas praised the openness and transparency of the data. They noted that this is “great news” because developing and training models is expensive. This is a solid foundation for other academic laboratories and even dedicated hobbyists to build upon and fine-tune.

Growing interest in physical AI.

For many developers and computer scientist, it has been a dream to create intelligent robots, or at least ones that are spatially aware.

But building robots which can process what they “see” quickly, move and react with ease becomes difficult. Before LLMs were invented, scientists had coded every single movement. This meant a lot more work and a limited range of robotic actions. Now, LLM methods allow robots to determine what actions to take depending on the objects they are interacting with.

Google Research SayCanis a tool that helps a robot understand tasks by using an LLM. This allows the robot to determine what sequence of movements are required to reach a goal. OK-Robot, a collaboration between Meta and New York University, uses visual language models to plan movement and manipulate objects.

Hugging Face has released a $299 desktop robotics robot as part of its effort to democratize the development of robotics. Nvidia released several models for robotic training including Cosmos-Transfer1, which it proclaimed to be the future of physical AI. Fern, an OSU professor, said that despite the limited demos of physical AI, there is more interest. The quest to achieve general intelligence in robots is becoming easier.

The landscape is more difficult now, and there are fewer low-hanging fruits. He said that large physical intelligence models were still in their infancy and are more ripe for rapid advances.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

Ai2’s MolmoAct AI model ‘thinks 3D’ and will challenge Nvidia in robotics AI

Physical understanding

A small step forward

Growing interest in physical AI.

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google...

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers...

Google rolling out Gemini 3 Deep Think for AI Ultra

Recomended

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google Lens and Google Lens

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers Blink cameras and other items

Google rolling out Gemini 3 Deep Think for AI Ultra

OpenAI says ChatGPT can save the average worker an hour per day

OpenAI boasts enterprise win days after internal ‘code red’ on Google threat