AI21’s Jamba reasoning 3B redefines what ‘small’ means in LLMs — 250K context on a laptop

October 9, 2025

AI21 Labs has introduced a new player in the enterprise-focused small model arena, aiming to alleviate data center congestion by enabling AI models to operate directly on user devices.

Introducing Jamba Reasoning 3B: Compact Yet Powerful

The company’s latest offering, Jamba Reasoning 3B, is an open-source, lightweight model capable of extended reasoning, code generation, and fact-based responses. This model can process over 250,000 tokens in a single inference and is optimized to run efficiently on edge devices such as laptops and smartphones.

According to Ori Goshen, AI21’s co-CEO, the shift toward deploying AI inference on local devices is driven by economic factors. “The current industry landscape faces costly data center expansions, and the revenue generated does not justify the rapid depreciation of hardware investments,” Goshen explained. He envisions a hybrid future where AI computations are split between on-device processing and GPU-powered data centers, balancing efficiency and performance.

Performance and Architecture: Tested on Everyday Hardware

Jamba Reasoning 3B leverages a hybrid design combining the innovative Mamba architecture with Transformer models, enabling it to handle a 250,000-token context window on consumer-grade hardware. This architecture not only accelerates inference speeds by 2 to 4 times compared to traditional models but also reduces memory consumption, lowering computational demands.

In practical tests on a standard MacBook Pro, the model achieved a processing rate of 35 tokens per second. Goshen highlighted that Jamba Reasoning 3B excels in tasks such as function calling, policy-driven content generation, and routing to external tools. For example, routine queries like generating meeting agendas or retrieving event details can be efficiently handled on-device, while more complex reasoning tasks can be offloaded to GPU clusters.

Enterprise Adoption of Compact AI Models

Businesses are increasingly adopting a blend of small, specialized models tailored to their industry needs alongside streamlined versions of large language models (LLMs). For instance, recent releases include models ranging from 140 million to 950 million parameters, designed specifically for domains like mathematics, coding, and scientific analysis rather than conversational AI. These models are optimized to run on devices with limited computational resources.

Earlier entrants like Gemma pioneered the concept of portable AI models for laptops and mobile devices, setting the stage for newer innovations. Financial services companies have also developed niche models, such as FICO’s Focused Language and Focused Sequence models, which are fine-tuned to answer finance-specific queries with precision.

What sets Jamba Reasoning 3B apart, according to Goshen, is its remarkably small size combined with the ability to perform complex reasoning tasks without compromising speed or efficiency.

Benchmark Results and Privacy Advantages

In comparative evaluations, Jamba Reasoning 3B demonstrated superior performance against other compact models like Llama 3.2B-3B and Phi-4-Mini. It led the pack in the IFBench and Humanity’s Last Exam benchmarks, though it ranked just behind Qwen 4 on the MMLU-Pro test.

Beyond performance, small models like Jamba Reasoning 3B offer enhanced privacy benefits for enterprises by enabling local inference, thereby eliminating the need to transmit sensitive data to external servers. Goshen emphasized, “Optimizing AI experiences for end-users increasingly involves keeping models on devices, which aligns with privacy and customization priorities.”

Looking Ahead: The Hybrid AI Ecosystem

As AI continues to evolve, the balance between on-device processing and cloud-based computation will become crucial. Models like Jamba Reasoning 3B exemplify this trend by delivering powerful, efficient AI capabilities directly on everyday devices, reducing reliance on expensive data center infrastructure while maintaining high performance and privacy standards.