QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

Imagine harnessing Reinforcement Learning (RL) post-training on a colossal 32-billion parameter language model (LLM) using 4-bit NVFP4 precision-executed on a single NVIDIA H100 GPU-with BF16-level accuracy and achieving 1.2 to 1.5 times faster step speeds. Researchers from NVIDIA, in collaboration with MIT, HKU, and Tsinghua, have unveiled QeRL (Quantization-enhanced Reinforcement Learning), an innovative training framework that enables RL post-training in 4-bit FP4 (NVFP4) precision. This approach maintains gradient computations at higher precision through LoRA, delivering over 1.5× speed improvements during rollout and approximately 1.8× faster end-to-end training compared to QLoRA in certain scenarios. Notably, QeRL marks the first successful RL training of a 32B policy model on a single H100-80GB GPU.

Revolutionizing the RL Training Cycle with QeRL

In typical RLHF, GRPO, or DAPO workflows, the majority of training time is consumed by rollouts-the process of token generation. QeRL transforms this bottleneck by converting the policy weights to NVFP4 (FP4) format with dual-level scaling, while preserving logits and gradients in higher precision through LoRA. This hybrid precision strategy ensures stable backpropagation and leverages hardware-optimized FP4×BF16 kernels (Marlin) for accelerated sampling. Consequently, rollouts and prefill stages become significantly faster without the need to maintain a full-precision policy model.

Technically, QeRL integrates Marlin FP4 kernels into both rollout and prefill operations, while LoRA restricts the number of trainable parameters. This targeted optimization addresses the most time-consuming and resource-intensive phase of RL training, especially for tasks involving extended reasoning sequences.

Quantization as a Catalyst for Enhanced Exploration

A pivotal discovery in QeRL’s development is that deterministic FP4 quantization increases policy entropy, effectively flattening token probability distributions early in training. This phenomenon fosters improved exploration compared to 16-bit LoRA and NF4-based QLoRA baselines. To regulate this effect dynamically, QeRL introduces Adaptive Quantization Noise (AQN), which applies channel-wise Gaussian noise to LayerNorm scale parameters following an exponential decay schedule. This design preserves kernel fusion efficiency-avoiding additional weight tensors-while smoothly transitioning the model from exploration to exploitation phases.

Experimental ablations demonstrate that QeRL achieves accelerated reward acquisition and higher ultimate performance on complex math reasoning benchmarks under both GRPO and DAPO algorithms. These results support the hypothesis that structured noise in parameter space can serve as a beneficial exploration mechanism in RL, contrasting with its typically adverse effects in supervised fine-tuning.

Performance Highlights and Benchmarks

Using the Qwen2.5 model as a backbone, QeRL’s NVFP4+LoRA approach surpasses traditional LoRA and QLoRA in both rollout throughput and total training duration. Specifically, it achieves more than 2× rollout throughput improvements on 14B and 32B models compared to QLoRA, and approximately 1.8× faster end-to-end training in representative experiments. This efficiency gain enables training a 32B policy model with GRPO on a single H100-80GB GPU, a feat previously unattainable due to memory constraints.

In terms of accuracy, QeRL remains competitive with higher-precision methods. For a 7B parameter model, it attains 90.8% accuracy on GSM8K and 77.4% on MATH500, outperforming 16-bit LoRA and QLoRA baselines and matching full-parameter fine-tuning results. Across broader mathematical reasoning datasets such as BigMath, QeRL maintains parity or superiority while converging more rapidly, thanks to its enhanced exploration capabilities.

Understanding QeRL’s Scope and Limitations

It is important to clarify that QeRL employs weight-only FP4 quantization combined with LoRA updates, without claiming FP4 precision for logits or gradients. The primary advantages lie in increased rollout and prefill throughput and reduced memory consumption. Empirical evidence suggests that the entropy introduced by quantization acts as a beneficial exploration signal during RL training, modulated effectively by AQN. However, the generalizability of these benefits to other domains beyond math reasoning or to RL tasks involving safety constraints and tool use depends heavily on reward function design and sequence length considerations.

Summary of Key Insights

  • QeRL leverages NVFP4 4-bit weight quantization alongside LoRA to accelerate rollout phases and reduce memory usage, enabling RL training of 32B LLMs on a single H100-80GB GPU.
  • Quantization serves as an exploration enhancer: FP4 increases policy entropy, while Adaptive Quantization Noise (AQN) applies scheduled channel-wise noise through LayerNorm scaling.
  • Efficiency gains include over 1.5× rollout speedups compared to 16-bit LoRA and approximately 1.8× faster end-to-end training relative to QLoRA; rollout throughput exceeds 2× on 14B and 32B models versus QLoRA.
  • Accuracy remains robust, with Qwen2.5-7B achieving 90.8% on GSM8K and 77.4% on MATH500, matching full-parameter fine-tuning benchmarks.
  • NVFP4 is a hardware-optimized 4-bit floating-point format featuring two-level scaling (FP8 E4M3 block scalers combined with FP32 tensor scales), enabling efficient Marlin kernel execution.

Final Thoughts

QeRL represents a significant advancement in RL training efficiency by quantizing weights to NVFP4 precision while maintaining high-precision updates and logits through LoRA. Its introduction of Adaptive Quantization Noise provides a novel mechanism to harness quantization-induced entropy as a controlled exploration signal. Demonstrated primarily on math reasoning tasks with GRPO and DAPO, QeRL’s success hinges on the availability of NVFP4-optimized kernels like Marlin. This breakthrough opens new avenues for scaling RL training on large language models with constrained hardware resources.

More from this stream

Recomended