Attention ISN’T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique

Since its groundbreaking introduction in 2017, the transformer architecture has become a foundational pillar in the evolution of artificial intelligence. This design, first unveiled in a landmark Google publication, revolutionized how models process information by leveraging the attention mechanism-a mathematical technique enabling models to weigh the relevance of every part of their input dynamically.

Today, nearly all prominent large language models (LLMs)-including OpenAI’s GPT series, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama-are built upon variations of this attention-based framework. However, despite its transformative impact, the attention mechanism is increasingly revealing critical limitations. Its computational and memory demands grow quadratically with the length of the input sequence, making it prohibitively expensive for tasks requiring analysis of extensive documents, lengthy codebases, or prolonged video streams. This scalability bottleneck threatens to constrain both academic research and industrial applications as AI models strive to handle ever-larger contexts.

Introducing Power Retention: A New Paradigm Beyond Attention

On October 28, 2025, Manifest AI, a relatively obscure startup, unveiled a novel approach that challenges the dominance of attention-based transformers. Their new model, Brumby-14B-Base, is a retrained adaptation of the open-source transformer Qwen3-14B-Base, but with a crucial difference: it completely eliminates the attention layers.

Instead, Brumby employs an innovative mechanism called Power Retention. This recurrent, hardware-friendly architecture maintains and updates information across arbitrarily long sequences without incurring the exponential memory growth typical of attention. By compressing past context into a fixed-size latent state, Power Retention enables efficient processing of extremely long inputs while preserving the model’s expressive capabilities.

Remarkably, Brumby-14B-Base, with its 14 billion parameters, was trained at a modest cost of approximately $4,000-an order of magnitude cheaper than conventional transformer training budgets-and achieves performance comparable to leading transformer models such as Qwen3-14B and GLM-4.5-Air on a variety of reasoning and comprehension benchmarks.

Architectural Innovation: From Quadratic Attention to Linear Retention

Traditional transformers rely on computing queries (Q), keys (K), and values (V) for each token, then performing pairwise similarity calculations across the entire input sequence. This full attention operation, while powerful, scales quadratically with sequence length, causing computational and memory costs to skyrocket as inputs grow longer.

Power Retention retains the Q, K, and V inputs but replaces the global attention matrix with a recurrent memory matrix, denoted as S. At each time step, S is updated using the incoming key, value, and a learned gating mechanism, effectively summarizing past information into a fixed-size state. This approach resembles recurrent neural networks (RNNs) more than transformers, enabling constant per-token computational cost regardless of sequence length.

Because the recurrence involves tensor powers of the input-hence the term “power retention”-the model can capture complex, higher-order relationships between tokens over long distances. This design theoretically allows indefinite retention of long-term dependencies while maintaining the efficiency of RNNs and the representational power of transformers.

Efficient Retraining: Leveraging Existing Models for Rapid Adaptation

One of the most striking achievements of Brumby-14B is its training efficiency. Manifest AI retrained the model in just 60 hours using 32 Nvidia H100 GPUs, costing roughly $4,000-less than 2% of the typical expense for training a model of similar size from scratch.

However, this cost efficiency hinges on starting from a pretrained transformer checkpoint. As Jacob Buckman, Manifest AI’s founder, explained, “Training Brumby from scratch at this price point is not feasible. The low cost is possible because we build upon the weights of an existing transformer.”

This retraining process involved removing the attention layers from Qwen3-14B-Base and substituting them with Power Retention modules. Since the original weights were optimized for attention dynamics, the model initially struggled to utilize its learned knowledge effectively. Approximately 3,000 additional training steps served as a “relearning” phase, recalibrating the weights to function within the new architecture.

An apt analogy is retraining a virtuoso pianist to play guitar: while the underlying musical understanding remains, the physical execution requires adaptation. After this brief retraining, Brumby regained performance parity with the original transformer model, demonstrating that attention-free architectures can inherit and adapt prior knowledge efficiently.

Performance Benchmarks: Matching Transformers on Key Tasks

Brumby-14B-Base consistently delivers results on par with transformer counterparts across a suite of standard benchmarks:

Task Brumby-14B Qwen3-14B GLM-4.5-Air Nemotron Nano (12B)
ARC 0.89 0.94 0.92 0.93
GSM8K 0.88 0.84 0.83 0.84
GSM8K (Platinum) 0.87 0.88 0.85 0.87
HellaSwag 0.77 0.81 0.85 0.82
MATH 0.62 0.54 0.47 0.26
MBPP 0.57 0.75 0.73 0.71
MMLU 0.71 0.78 0.77 0.78
MMLU (Pro) 0.36 0.55 0.51 0.53

While Brumby slightly trails transformers on knowledge-intensive tasks like MMLU-Pro, it excels in mathematical reasoning and long-context understanding-areas where attention-based models often struggle. This suggests that retention-based architectures may offer structural advantages for reasoning over extended sequences and complex dependencies.

Hardware Advantages and Inference Speed

Beyond algorithmic efficiency, Power Retention’s design significantly enhances hardware utilization. Because the recurrent state updates involve only localized matrix operations, inference scales linearly with input length.

Manifest AI’s custom CUDA kernels, developed within their Vidrial framework, reportedly achieve hardware utilization rates of 80-85%, surpassing FlashAttention2’s 70-75% and Mamba’s 50-60%. (Mamba is another emerging post-transformer architecture that replaces attention with a state-space model, offering linear complexity but lower hardware efficiency in early tests.)

These improvements translate into speedups of up to 100× on very long sequences compared to traditional attention mechanisms, although Manifest AI notes that production-scale stress testing is ongoing.

Economic Impact and Scalability

The $4,000 training cost for a 14-billion-parameter model represents a dramatic reduction in foundation model development expenses. Buckman highlighted that retraining efficiency improves with model size, with larger models requiring fewer steps to adapt successfully.

While Manifest AI has yet to validate retraining costs for models in the 700-billion-parameter range, preliminary estimates suggest expenses between $10,000 and $20,000-still substantially below typical transformer training budgets. This cost-effectiveness could democratize access to large-scale AI experimentation, enabling smaller organizations and research groups to innovate without prohibitive compute investments.

Seamless Integration and Future Deployment

Manifest AI designed the Power Retention approach for easy adoption. According to Buckman, integrating it into existing workflows requires minimal code changes-often just a single line modification and resuming training from a checkpoint.

The architecture supports faster training and inference on long contexts, with kernels compatible across NVIDIA and AMD GPUs via Triton, alongside specialized CUDA implementations. While integration with popular inference engines like vLLM is underway, early results indicate that distributed inference and multi-user GPU partitioning are more straightforward with this recurrent-state design.

Vision for the Future: Modeling Intelligence at Scale

Manifest AI’s broader ambition extends beyond incremental improvements. Their mission is to develop neural networks capable of modeling the full spectrum of human output-not just the static artifacts of intelligence but the dynamic cognitive processes that generate them.

Power Retention represents an initial step toward architectures that can continuously and efficiently simulate thought processes over extended periods, potentially transforming how AI systems understand and interact with complex information.

Community Response and Industry Perspectives

The announcement of Brumby-14B sparked lively debate within the AI research community. Some critics argued that the $4,000 training cost was misleading, emphasizing that the figure reflects retraining from pretrained transformer weights rather than training a foundation model from scratch.

In response, Buckman clarified that the cost claim was accurate within the context of retraining and that the initial public messaging was part of a broader explanation thread. He acknowledged that while the transformer era is not over, innovations like Brumby mark the beginning of a transition toward new modeling paradigms.

Conclusion: A New Chapter in AI Architecture

Brumby-14B-Base’s release signals a potential turning point in AI development. By substituting attention with Power Retention, Manifest AI has demonstrated that it is possible to achieve transformer-level performance with drastically reduced computational overhead and without specialized hardware.

This breakthrough could reshape the economics of training and deploying large models, lowering barriers for open research and smaller enterprises. Moreover, it may catalyze renewed architectural diversity in AI, ending the half-decade dominance of transformers and inspiring fresh theoretical and practical exploration.

As Buckman aptly summarized, “The transformer era is not finished, but the journey toward its successor has begun.”

More from this stream

Recomended