Nvidia researchers boost LLMs reasoning skills by getting them to ‘think’ during pre-training

October 11, 2025

Scientists at Nvidia have introduced an innovative method that revolutionizes the way large language models (LLMs) develop reasoning capabilities.

This novel technique, termed Reinforcement Learning Pretraining (RLP), incorporates reinforcement learning directly into the early stages of model training, rather than applying it only after initial pretraining.

According to the research team, this strategy promotes autonomous reasoning before the model attempts to predict subsequent tokens, effectively instilling independent cognitive processes earlier in the training pipeline.

By enabling models to reason using raw text data without relying on external validation tools, RLP-trained models demonstrate marked enhancements in mastering intricate reasoning tasks downstream, signaling a future where AI systems are more versatile and adept at handling complex real-world challenges.

Conventional Training Paradigm for Large Language Models

Traditionally, LLMs undergo a two-phase training regimen. Initially, they are pretrained on extensive text corpora using a next-token prediction objective, where the model learns to anticipate the following word or token in a sequence. This phase primarily equips the model with linguistic structure, factual knowledge, and basic pattern recognition.

Subsequently, in the fine-tuning stage, models acquire advanced reasoning skills such as Chain-of-Thought (CoT) reasoning, which involves articulating intermediate reasoning steps. This phase often employs supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF), both of which depend on carefully curated datasets.

The researchers highlight a fundamental mismatch between this sequential training approach and human cognition, which integrates new information with existing knowledge in a parallel, rather than linear, manner. Current pretraining techniques lack mechanisms to foster this integrated reasoning early on, limiting the model’s ability to develop deep reasoning skills from the outset.

Mechanics of Reinforcement Learning Pretraining (RLP)

RLP reimagines the training process by treating the generation of reasoning chains (CoT) as an explicit action preceding token prediction. At each step, the model first produces an internal “thought” or reasoning sequence, which it then uses alongside the original context to predict the next token.

The model’s internal reasoning is rewarded based on how much it improves the accuracy of the next-token prediction compared to a baseline model that predicts tokens without generating thoughts. This reward is computed automatically by measuring the change in prediction probability, eliminating the need for human annotations or external evaluators.

Only reasoning steps that enhance prediction accuracy receive positive reinforcement, effectively teaching the model to generate useful, goal-directed thoughts using the same vast, unstructured datasets employed in standard pretraining.

This iterative feedback loop enables the model to discern when straightforward prediction suffices and when deeper reasoning is necessary. As the authors describe, “RLP shapes the model’s internal thinking by rewarding only those reasoning processes that demonstrably aid next-token prediction.”

Importantly, RLP is designed to augment rather than replace subsequent fine-tuning stages. Bryan Catanzaro, Nvidia’s VP of Applied Deep Learning Research and co-author of the study, emphasizes that while RLP provides a strong foundational boost, traditional supervised fine-tuning and RLHF remain essential for refining model behavior. RLP essentially gives models a “head start” in developing reasoning skills.

Empirical Validation and Practical Implications of RLP

In rigorous testing on models such as Qwen3-1.7B and Nemotron-Nano-12B, Nvidia’s team evaluated RLP across diverse benchmarks in mathematics and scientific reasoning. The findings reveal that RLP-enhanced models consistently outperform their conventionally trained peers, especially in tasks demanding complex reasoning.

For industries, this advancement could translate into more dependable AI outputs in multi-step processes like financial forecasting, legal contract analysis, or scientific data interpretation.

Catanzaro notes, “By encouraging the model to deliberate before predicting, RLP helps internalize a coherent reasoning style, reducing subtle logical errors that often arise in extended workflows.”

While RLP-trained models still require standard safeguards such as verification mechanisms, human oversight, and consistency checks, the approach establishes a more robust baseline for reasoning capabilities.

Crucially, the benefits of RLP persist and even amplify through later fine-tuning phases, addressing the common issue of catastrophic forgetting where models lose previously acquired skills. After identical post-training, RLP models scored 7-8% higher than baseline models, demonstrating that RLP builds durable reasoning foundations that complement downstream alignment.

Efficiency is another standout feature. On the Qwen3-1.7B model, RLP boosted performance by 17% compared to traditional continuous pretraining and outperformed a related method called Reinforcement Pretraining via prefix-matching rewards (RPT). This advantage held even when the baseline was trained with 35 times more data, confirming that the gains stem from the method itself rather than increased computational resources.

Moreover, RLP exhibits remarkable scalability and adaptability, successfully extracting reasoning signals from broad, general-purpose web data rather than relying solely on specialized datasets. When applied to the hybrid Mamba-Transformer architecture in Nemotron-Nano-12B, RLP achieved a 35% relative improvement over a heavily trained baseline using only a fraction of the data.

While these results suggest a more resource-efficient path to building powerful models, Catanzaro frames RLP as a conceptual breakthrough in the learning process rather than a direct solution to the high costs of large-scale training. “This approach transforms how models assimilate information during pretraining, fostering smarter learning. It complements, rather than replaces, large-scale pretraining,” he explains.

RLP: Paving the Way for Smarter AI Foundations

Ultimately, RLP signals a shift away from viewing pretraining as a monolithic next-token prediction task. Instead, future LLMs may be trained with hybrid objectives that encourage robust reasoning from the very beginning.

Catanzaro offers a compelling metaphor: “Next-token prediction teaches a model to recognize the world’s patterns; reinforcement learning objectives like RLP teach it how to reason about those patterns. Combining these approaches can cultivate deeper, more structured thinking earlier in training, making learning more dynamic, inquisitive, and efficient.”

Although much remains to be explored about the interplay of reinforcement learning during pretraining, it is clear that introducing exploratory reasoning early opens new dimensions for scaling AI-not just in size, but in cognitive sophistication.