Study shows AI models can generalize better on their own with less supervision

Image Credit: VentureBeat with Ideogram

Learn More

Subscribe to our daily and weekly emails for the latest updates on industry-leading AI content. Learn More


When left to their own devices, language models can generalize more effectively. A new study from Hong Kong University and University of California Berkeley shows. The findings, which are applicable to both large language (LLMs), and vision language (VLMs), contradict one of the most common beliefs in the LLM community – that models must be trained on hand-labeled examples. The researchers have shown that too many examples of hand-crafted models can negatively affect the model’s generalization to unknown data.

SFT vs RL for model training

Supervised fine-tuning has long been the gold standard in training LLMs and VLMs. After a model has been pre-trained using raw text and images, companies and AI laboratories post-train the model on a large dataset containing hand-crafted examples of questions/answers or requests/responses. After SFT, a model can go through additional training stages such as Reinforcement learning from human feedback (RLHF) is a method of learning implicit preferences based on signals like answer rankings or liking/disliking model responses.

The SFT can be used to steer a model towards the tasks that its creators have intended. The data collection process is slow and expensive, and is a bottleneck in many companies and laboratories.

Recent advances in LLMs has created interest in pure reinforcement-learning (RL) approaches. The model is given a problem and left to learn the task on its own, without any hand-crafted examples. DeepSeek R1, an OpenAI o1 rival, is the most notable example. It used reinforcement learning to master complex reasoning tasks.

Generalization vs. memorization

Overfitting is a key problem of machine learning systems, where a model performs well with its training data, but fails to generalize it to new examples. The model may give the false impression that it has learned the task when in fact it is just memorizing its training examples. It can be difficult to distinguish between memorization and generalization in large AI models.

This new study focuses primarily on the generalization capabilities of RL and SFT in textual and visually reasoning tasks. In textual reasoning, a LLM trained with a set rules should be able generalize to variants. In visual reasoning, the VLM should be able to perform consistently despite changes in visual input such as color or spatial layout.

In their experiments, the researchers used two representative tasks. First was GeneralPoints, a benchmark that evaluates a model’s arithmetic reasoning capabilities. The model is given four cards, as textual descriptions or images, and is asked to combine them to reach a target number. For studying ruled-based generalization, the researchers trained the model using one set of rules, then evaluated it using a different rule. For visual generalization, they trained the model using cards of one color and tested its performance on cards of other colors and numbering schemes.

The second task is V-IRL is a test that tests the spatial reasoning abilities of a model in an open-world navigation domain using realistic visual input. This task is also available in pure-language or vision-language versions. Researchers evaluated generalization by changing what kind of instructions and visuals the model was trained on and tested with.

They ran their tests on Llama-3.2-Vision-11B, warming the model up by training it on a small SFT dataset, then creating separate versions for each task and training paradigm. For each task, they separately scaled the training on RL and SFT. The SFT process trains the model on additional hand-crafted solutions, while RL lets the model generate many solutions for each problem, evaluate the results and train itself on the correct answers.

The findings show that reinforcement learning consistently improves performance on examples that are drastically different from training data. On the other hand, SFT seems to memorize the training rules and doesn’t generalize to out-of-distribution (OOD) examples. These observations apply to both text-only and multimodal settings.

SFT-trained models perform well on training examples (in-distribution) while showing poor performance on unseen examples (out-of-distribution) (source: arXiv)

Implications for Real-World Applications

Although their experiments showed that RL was better at generalizing than the SFT, they also found that the SFT was helpful in stabilizing the output format of the model, and that it is crucial for RL to be able to achieve its performance improvements. The researchers found that RL training would not produce desirable results without the initial SFT phase.

These results are a little different from those of DeepSeek R1 Zero, which was trained on pure RL afterward. Researchers suggest that the researchers’ experiments used a different backbone model.

There is a lot more potential to be tapped in RL-heavy methods. In use cases with verifiable outcomes, letting models learn on their os can often lead unanticipated results. Humans could not have crafted them themselves. This could be very useful in situations where creating examples by hand can be time-consuming and expensive.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.


www.aiobserver.co

More from this stream

Recomended