NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale

Researchers at NVIDIA have overcome a major bottleneck in the efficiency of large language model (LLM) inference by introducing Jet-Nemotron, a series of models available in 2 billion and 4 billion parameter sizes. These models achieve an astonishing up to 53.6 times faster generation throughput compared to top-tier full-attention LLMs, while maintaining or even improving upon their accuracy. Crucially, this advancement is not the result of training new models from scratch but rather a retrofit of existing pre-trained models through an innovative method called Post Neural Architecture Search (PostNAS). This development promises to revolutionize the way businesses, developers, and researchers approach LLM deployment and optimization.

Why Speed Matters in Contemporary LLMs

Modern leading LLMs such as Qwen3, Llama3.2, and Gemma3 have pushed the boundaries of accuracy and versatility. However, their reliance on the O(n²) self-attention mechanism results in significant computational and memory demands, especially when handling long-context inputs. This complexity inflates operational costs and limits deployment options, particularly on edge devices or systems with limited memory. Previous attempts to replace full-attention Transformers with more efficient architectures-like Mamba2, GLA, and RWKV-have struggled to match the accuracy of full-attention models, until the emergence of Jet-Nemotron.

PostNAS: A Targeted, Cost-Effective Model Transformation

The heart of this breakthrough lies in PostNAS, a neural architecture search framework tailored to retrofit pre-trained models efficiently. The process unfolds as follows:

  • Preserving Learned Intelligence: Starting with a state-of-the-art full-attention model (e.g., Qwen2.5), the multilayer perceptron (MLP) layers are frozen to retain the model’s existing knowledge and minimize retraining expenses.
  • Precision Replacement: The computationally intensive full-attention layers are substituted with JetBlock, a novel linear attention module optimized for NVIDIA’s latest GPU architectures.
  • Hybrid and Hardware-Conscious Optimization: Through super-network training combined with beam search, PostNAS identifies the minimal and most effective placement of full-attention layers required to maintain high accuracy across diverse tasks such as retrieval, mathematics, multi-modal understanding, and coding. This search is customized for specific tasks and hardware, prioritizing throughput on target devices rather than merely reducing parameter counts.
  • Scaling and Deployment: The outcome is a hybrid-architecture LLM that leverages the original model’s intelligence while drastically reducing latency and memory usage.

JetBlock stands out by employing dynamic causal convolution kernels that adapt based on input, unlike previous static kernel designs in linear attention blocks. It also eliminates unnecessary convolutions to enhance efficiency. Coupled with hardware-aware hyperparameter tuning, JetBlock not only matches but often surpasses prior linear attention models in both speed and accuracy.

Jet-Nemotron’s Impressive Performance Metrics

NVIDIA’s technical evaluation reveals remarkable results:

Model MMLU-Pro Accuracy Generation Throughput (tokens/s on H100) KV Cache Size (MB, 64K context) Remarks
Qwen3-1.7B-Base 37.8% 61 7,168 Full-attention baseline
Jet-Nemotron-2B 39.0% 2,885 154 47× throughput, 47× smaller cache
Jet-Nemotron-4B 44.2% 1,271 258 21× throughput, state-of-the-art accuracy
Mamba2-2.7B 8.6% 2,507 80 All-linear, significantly lower accuracy
RWKV7-1.5B 13.4% 3,050 24 All-linear, much lower accuracy
DeepSeek-V3-Small (MoE) 2.2B activated, 15B total parameters, lower accuracy

Jet-Nemotron-2B not only matches but often surpasses Qwen3-1.7B-Base across key benchmarks-including mathematics, commonsense reasoning, coding, retrieval, and long-context tasks-while delivering a 47-fold increase in generation throughput.

This translates to a 53.6× acceleration in decoding speed at a 256K token context length, resulting in a staggering 98% reduction in inference costs for equivalent token volumes. Additionally, prefill operations are accelerated by over six times at the same context length.

Memory consumption is dramatically reduced by a factor of 47 (154MB cache versus 7,168MB for Qwen3-1.7B-Base), making Jet-Nemotron-2B a game-changer for edge computing. It runs 8.84× faster on Jetson Orin and 6.5× faster on RTX 3090 compared to Qwen2.5-1.5B.

Practical Implications Across Industries

Business Executives: Maximizing Return on Investment

  • Cost-effective large-scale inference: With a 53× throughput improvement, organizations can serve exponentially more users for the same cost or drastically reduce hosting expenses by up to 98%.
  • Enhanced operational efficiency: Reduced latency, increased batch sizes, and alleviated memory constraints enable cloud providers to deliver cutting-edge AI services at commodity prices.
  • New business opportunities: Previously cost-prohibitive applications such as real-time document analysis, extended-context conversational agents, and on-device AI copilots become feasible.

Developers and AI Practitioners: State-of-the-Art Performance on Edge Devices

  • No need for accuracy-compromising techniques: Jet-Nemotron’s compact KV cache (154MB) and 2B parameter size fit comfortably on devices like Jetson Orin, RTX 3090, and even some mobile processors, eliminating reliance on cloud offloading.
  • Seamless integration: Existing checkpoints from models like Qwen, Llama, or Gemma can be retrofitted without retraining or modifying data pipelines, preserving accuracy.
  • Instant, scalable AI services: Applications such as search engines, AI copilots, summarization tools, and code generation become more responsive and scalable.

Researchers: Accelerating Innovation with Lower Barriers

  • Reduced costs for architecture experimentation: PostNAS enables architecture search on frozen backbone models, cutting down time and financial investment compared to full pre-training cycles.
  • Hardware-aware optimization as a new standard: By factoring in KV cache size alongside parameter count, Jet-Nemotron introduces a paradigm shift in evaluating and optimizing model efficiency for real-world deployment.
  • Faster iteration cycles: PostNAS serves as a rapid testing platform-new attention mechanisms can be validated quickly before committing to expensive pre-training.

Conclusion

The release of Jet-Nemotron and its core component JetBlock (with open-source code available) empowers the AI community to retrofit existing models for unprecedented efficiency gains. PostNAS is more than a one-time innovation; it represents a versatile framework capable of accelerating any Transformer-based model, significantly lowering the barriers to future breakthroughs in large language model research and deployment.

More from this stream

Recomended