Hugging Face shows that test-time scaling can help small language models to punch above their weight

December 20, 2024 12 :46 PM

Image Credit: VentureBeat with Ideogram

Subscribe to our daily and weekly emails for the latest updates on AI. Learn More

Hugging Face researchers demonstrated in a new case-study how small language models can be configured to perform better than much larger models. Their findings show that the 70B version can be outperformed by a Llama 3 with 3B parameters.

Hugging Face fully documented the process and provides a road map for enterprises who want to create customized reasoning models.

Image source: Hugging Face

Scaling test-time computation

This work is inspired by OpenAI o1, a model that uses extra “thinking to solve complex math, coding, and reasoning problems.

A key idea behind models such as o1 is scaling “test-time computation,” which means using more computing cycles during inference. This allows you to test and verify various responses and reasoning pathways before producing the final result. Scaling test time compute is particularly useful when there isn’t enough memory to run large models.

Since OpenAI has not revealed the inner workings of o1, researchers have speculated about its operation and tried to reverse engineer it. There are several open alternatives to the o1 model.

The Hugging Face work is built on a DeepMind report released in August that examines the tradeoffs of inference-time versus pre-training computation. The study offers comprehensive guidelines for balancing training and inference computation to achieve the best results within a budget.

The success of this technique depends on two key components, in addition to the extra inference-time computation: a reward algorithm that evaluates SLM’s responses, and a search algo that optimizes its path to refine answers.

Image source: Hugging Face

Different reasoning algorithms

“Majority voting” is the easiest way to use test time scaling. The same prompt is sent multiple times to the model and the most popular is selected. Majority voting is useful for simple problems. However, its benefits quickly diminish when faced with complex reasoning problems or tasks that have consistent errors across generations.

“Best-of N” is a more advanced reasoning technique. In this technique, an SLM generates several answers, but instead of using majority voting, it uses a reward model to evaluate and select the best answer. Weighted Best-of N, a more nuanced method of this method takes into account consistency to select answers that are both confident as well as more frequent than others.

Researchers used a “process rewards model” (PRM) to score the SLM’s responses not only on its final answer, but also the multiple stages that it goes through in order to get there. Their experiments showed that PRMs and Weighted Best of N brought Llama 3.2 1B to the same level as Llama 3.2 8B in the difficult MATH 500 benchmark.

Image source: Hugging Face

Addition of search

In order to improve the performance of the model, the researchers have added search algorithms to its reasoning process. Instead of generating an answer in one pass, the researchers used “beam-search,” an algorithm that guides model’s answer generation step by step.

The SLM generates partial answers at each step. The search algorithm uses a reward model to evaluate answers and select a subset worth exploring further. The process is repeated till the model exhausts the inference budget, or reaches the right answer. In this way, the inference buget can be reduced to focus only on the most promising solutions.

Researchers found that, while beam search improved the model’s performance for complex problems it tended to underperform on simple problems. To overcome this challenge, the researchers added two additional elements to their inference strategies.

The first was Diverse verifier tree search (DVTS), a variation of beam-search that ensures the SLM does not get stuck in false reasoning pathways and diversifies its responses branches. Second, they developed “compute-optimal scale strategy” as suggested in the DeepMind article, which dynamically selects the best test time scaling strategy based upon the difficulty of the input problems.

Combining these techniques allowed Llama 3.2 1B outperform 8B by a significant margin. They also found the strategy to be scalable and that when applied to Llama 3.2 3B they were able outperform the 70B model.

It’s not a perfect solution

but it is a good start. Scaling the test-time compute alters the dynamics of model cost. Now enterprises can choose where to allocate compute resources. If you have limited memory or are willing to tolerate slower response times you can use a smaller model and spend more time on inference-time cycles. This will generate more accurate results.

Test-time scaling has its own limitations. In the experiments conducted by Hugging Face researchers, they used a specially-trained Llama 3.1-8B model to represent the PRM. This required running two models simultaneously (even though it was much more resource-efficient that the 70B model). Researchers acknowledge that “self-verification” is the holy grail for test-time scale, where the original model verifies itself instead of relying on a third party. This is a research area that is still open.

This study’s test-time scaling technique is limited to problems that can be evaluated clearly, such as math and coding. Further research is needed to create reward models and verifyors for subjective tasks like creative writing and product designing.

It is clear that test-time scale has generated a great deal of interest and activity. We can expect to see more tools and techniques emerge in the next few months. It is important for enterprises to keep a close eye on the development of the landscape.

Daily insights into business use cases from VB Daily

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

Hugging Face shows that test-time scaling can help small language models to punch above their weight

Scaling test-time computation

Different reasoning algorithms

Addition of search

It’s not a perfect solution

Sam Altman at TED2025: Inside the most uncomfortable

Serving tech enthusiasts since over 25 years, the world’s first 3D...

James Webb telescope captures dual ringed nebula with stunning detail

A small US city experiments with AI to find out what...

Recomended

Sam Altman at TED2025: Inside the most uncomfortable

Serving tech enthusiasts since over 25 years, the world’s first 3D UV texturing printer can paint on virtually any surface.

James Webb telescope captures dual ringed nebula with stunning detail

A small US city experiments with AI to find out what residents want

The Download: tracking street drug evolution and the next wave in military AI

The US government has imposed a license requirement on Nvidia H20 chips exports