Join our daily and weekday newsletters to receive the latest updates on AI and exclusive content. Learn More
Businesses need to know whether the models that power their agents and applications work in real-life situations. This type of evaluation is complex because it’s hard to predict specific scenarios. The RewardBench benchmark has been updated to give organizations an idea of how models perform in real life.
Allen Institute of AI has launched RewardBench 2 an updated version of their reward model benchmark RewardBench. They claim that this provides a more comprehensive view of model performance, and assesses the models’ alignment with an enterprise’s goals and standards.
Ai2 created RewardBench using classification tasks that measure correlations via inference-time computation and downstream training. RewardBench is primarily concerned with reward models (RM), that can act as judges and evaluate LLM outcomes. RMs are used to assign a “reward”which guides reinforcement learning with feedback from humans (RHLF).
RewardBench is here! We learned a lot from our first reward-model evaluation tool. This one is much harder and more correlated to both downstream RLHF scaling and inference time scaling. pic.twitter.com/NGetvNrOQV
– ai2 (@allen_ai)””https://twitter.com/allen_ai/status/1929576050352111909?ref_src=twsrc%5Etfw””> Nathan Lambert, senior research scientist at Ai2, said to VentureBeat on June 2, 2025 that the first RewardBench was working as intended when launched. The model environment and its benchmarks have evolved rapidly. “As reward models advanced and use-cases became more nuanced we quickly realized with the community that the original version didn’t capture the complexity of human preferences in real life,” he said.
Lambert added that with RewardBench 2, “we set out to improve both the breadth and depth of evaluation–incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.” He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains.
Evaluations of models that evaluate
While rewards models test how well they work, it is also important that RMs are aligned with company values. Otherwise, the fine tuning and reinforcement learning process could reinforce bad behaviors, such as hallucinations and reduce generalization. RewardBench 2 covers 6 different domains, including factuality and precise instruction following. It also includes math, safety, focus, and ties.
Enterprises should use RewardBench 2 differently depending on the application. If they’re performing RLHF on their own, they should adopt best practices and datasets of leading models to their pipelines as reward models require on-policy recipes (i.e. They should use reward models that are similar to the model they want to train using RL. RewardBench 2 shows that users can select the model they want for their domain, and see correlated performance,” Lambert stated.
Lambert said that benchmarks such as RewardBench allow users to evaluate models based on “dimensions that are most important to them” rather than relying solely on a single score. Human preferences are also becoming more nuanced.
Ai 2 releases the first version RewardBench in March 2024. The company claimed that it was the first benchmark leaderboard for reward models. Since then, many methods have been developed for benchmarking and improving RM. Researchers at Meta‘s FAIR was released with reWordBench. DeepSeek has released a new technique titled Self-Principled Criticism Tuning for smarter, scalable RM.
We are super excited to announce the second evaluation of our reward model. It’s significantly harder, much cleaner and well correlated to downstream PPO/BoN sample.
RewardBench 2 being an updated version of RewardBench tested both existing models and newly trained ones to see if they continued to rank highly. These included a variety models, including versions of Gemini and Claude, GPT 4.1 and Llama 3.1, as well as datasets and models such as Qwen, Skywork and its own Tulu.
According to the company, larger reward models performed better on benchmarks because their base models were stronger. Overall, the best-performing models are variations of Llama 3.1 Instruct. Skywork data is “particularly helpful” in terms of safety and focus, and Tulu performed well on factuality.
Ai2 said that while they believe RewardBench 2 “is a step forward in broad, multi-domain accuracy-based evaluation” for reward models, they cautioned that model evaluation should be mainly used as a guide to pick models that work best with an enterprise’s needs.
VB Daily provides daily insights on business use-cases
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.