Home Technology Open-Source Tools Meta Llama Benchmarking Confusion

Meta Llama Benchmarking Confusion

0
Meta Llama Benchmarking Confusion

Meta Llama models Maverick & Scout are now available, but they may not be the best models in the market.

Katelyn Chedraoui Writer I

Katelyn is a writer with CNET covering social media, AI and online services. She graduated from the University of North Carolina at Chapel Hill with a degree in media and journalism. You can often find her with a novel and an iced coffee during her time off.

Meta has released two new AI models

that have added to the wide range of AI options. Llama 4 is the company’s latest family of AI models that are generative. The Llama 4 models can be tested now on Meta AI’s site. Llama 4 is soon to power many Meta AI features in the company’s Instagram and Messenger services.

The rivalry between Meta and other AI firms is intensifying. Companies are working on building and releasing AI models that can perform more complex tasks and have advanced reasoning without needing a lot of computing power or cash. Meta hopes that its newest models can help it to surpass competitors like ChatGPT or Gemini. Llama benchmarking explained

There’s been some confusion over how the new Llama model compares to other models. LMArenais a crowdsource AI platform created by UC Berkeley SkyLab that allows users to test chatbots. In Meta’s announcement the company claimed that its Maverick model outperformed ChatGPT-4o.

However, the model Meta submitted to LMArena is not the one that is currently available. The model submitted for LMArena testing is “llama-4-maverick-03-26-experimental.” Meta clarifies in a tiny font at the end of a chart (not the announcement) on Llama’s website that the model was optimized for conversationality.” 

Check out the final footnote at the bottom.

Meta/Screenshot by Katelyn Chedraoui

LMArena put out a statement on X/Twitter on Monday saying Meta’s policy interpretation did not match LMArena’s expectations, and that Meta should have been clearer that the submitted model was a “customized model to optimize for human preference.” In other words, it’s possible that Meta submitted a better, more human-friendly model to try and juice its scores. One way to do that could be training a model on test sets, which a set of data and tests typically run during the post-training process, not before.

Meta’s VP of generative AI Ahmad Al Dahle tweeted that claims the company trained on test sets are “simply not true.” He said that the differences in performance people are seeing “was due to the need to stabilize implementations.” As of publication, Meta’s Llama Maverick experimental (the original model submitted) is ranked in second place on LMArena, tied with GPT-4o and preview of Grok 3. Google’s Gemini 2.5 Pro comes in first.

Meta didn’t immediately respond to a comment request.

Meet Scout & Maverick.

The Llama 4 Family has two models available: Scout & Maverick. They are open-weight models and multimodal. This means they can generate code, text, and images. Meta’s open models allow developers to gain some insight into the model building process. Llama 4 is an open-weights model, so you can see the model’s connections and how certain traits are given more weight over time. OpenAI announced this month that they were developing an open weights model for the very first time. According to a video uploaded by CEO Mark Zuckerberg,

www.aiobserver.co

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version