Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

August 23, 2025

Revolutionizing AI Infrastructure for Large-Scale Language Models

The rapid evolution of large language models (LLMs) is marked by exponential growth in parameter sizes, the adoption of sparse mixture-of-experts (MoE) architectures, and unprecedentedly long context windows. Cutting-edge models such as DeepSeek-R1, LLaMA-4, and Qwen-3 now boast trillions of parameters, pushing the boundaries of computational power, memory throughput, and inter-chip communication speeds. While MoE architectures enhance computational efficiency by selectively activating expert subsets, they introduce complexities in routing and load balancing. Additionally, context lengths extending beyond one million tokens impose significant demands on attention mechanisms and key-value (KV) cache storage, which must scale dynamically with user concurrency. Real-world deployments face further hurdles from unpredictable input patterns, uneven expert utilization, and bursty traffic, necessitating a fundamental redesign of AI infrastructure that integrates hardware-software synergy, adaptive orchestration, and elastic resource allocation.

Key Trends Driving LLM Development

Three pivotal trends define the current landscape of LLM advancements: the relentless increase in model parameters, the rise of sparse MoE frameworks, and the expansion of context windows to support long-form reasoning. Models like LLaMA 4, DeepSeek-V3, and Google’s PaLM have scaled into the trillion-parameter regime, leveraging MoE to activate only relevant experts per token, thereby balancing computational load and model capacity. Meanwhile, context windows now span hundreds of thousands to over a million tokens, enabling complex, multi-turn dialogues and document-level understanding. However, these innovations place extraordinary pressure on data center resources, demanding enhanced compute power, memory capacity, and bandwidth. They also introduce challenges in parallel processing, heterogeneous workloads, data synchronization, and storage throughput.

Introducing CloudMatrix: A Paradigm Shift in AI Data Center Architecture

To address these escalating demands, Huawei researchers have developed CloudMatrix, an innovative AI data center architecture tailored for large-scale LLM workloads. The inaugural deployment, CloudMatrix384, integrates 384 Ascend 910C Neural Processing Units (NPUs) alongside 192 Kunpeng CPUs, interconnected via a high-bandwidth, low-latency Unified Bus. This architecture enables seamless peer-to-peer communication across all nodes, facilitating flexible pooling and dynamic scaling of compute, memory, and network resources. Such a design is particularly advantageous for MoE models, where expert parallelism and distributed KV cache access require intensive inter-node communication. By eliminating traditional hierarchical bottlenecks, CloudMatrix384 offers a unified system that excels in communication-heavy AI tasks.

CloudMatrix-Infer: Optimized Serving Framework for Scalable LLMs

Building on the hardware foundation, CloudMatrix-Infer is a specialized serving framework engineered to maximize the potential of the CloudMatrix architecture. It partitions workloads into distinct pools for prefill, decoding, and caching, enabling efficient resource utilization and parallelism at scale. The framework supports extensive expert parallelism and incorporates hardware-conscious optimizations such as pipelining and INT8 quantization, which reduce computational overhead without sacrificing model accuracy. Benchmarking with the DeepSeek-R1 model on the CloudMatrix384 supernode demonstrated remarkable performance: a prefill throughput of 6,688 tokens per second per NPU and a decoding throughput of 1,943 tokens per second, all while maintaining latency below 50 milliseconds. Even under stringent latency constraints of 15 milliseconds, the system sustained 538 tokens per second during decoding. Notably, INT8 quantization preserved accuracy across 16 diverse benchmarks, confirming that efficiency gains do not compromise output quality.

Performance Highlights and Comparative Analysis

When compared to leading systems such as SGLang running on NVIDIA H100 GPUs and DeepSeek on H800 hardware, CloudMatrix-Infer consistently outperforms in throughput, latency, and scalability metrics. The peer-to-peer Unified Bus architecture enables direct all-to-all communication, which is critical for managing the complex data flows inherent in MoE and large context window processing. This design choice mitigates the communication bottlenecks typical of hierarchical network topologies, resulting in smoother workload distribution and faster data convergence. The integration of 384 Ascend 910C NPUs and 192 Kunpeng CPUs into a single supernode exemplifies a new class of disaggregated yet tightly coupled AI data center systems.

Future Outlook: Scaling AI with CloudMatrix

CloudMatrix represents a forward-looking approach to AI infrastructure, designed to transcend the limitations of conventional cluster architectures. By harmonizing high-bandwidth interconnects, resource disaggregation, and intelligent orchestration, it lays the groundwork for next-generation LLM deployments that demand both scale and efficiency. The success of CloudMatrix384 and CloudMatrix-Infer in handling trillion-parameter models with massive context windows underscores the architecture’s potential to support increasingly sophisticated AI applications, from real-time language translation to complex multi-document analysis. As AI models continue to grow in size and complexity, architectures like CloudMatrix will be essential to meet the computational and operational challenges ahead.