
A newly released 14-page technical paper from the team behind DeepSeek-V3, with DeepSeek CEO Wenfeng Liang as a co-author, sheds light on the “Scaling Challenges and Reflections on Hardware for AI Architectures.” This follow-up to their initial technical report delves into the intricate relationship between large language model (LLM) development, training, and the underlying hardware infrastructure. The paper moves beyond the architectural specifics of DeepSeek-V3 to explore how hardware-aware model co-design can effectively address the limitations of current hardware, ultimately enabling cost-efficient large-scale training and inference.
The rapid scaling of LLMs has exposed critical bottlenecks in current hardware architectures, particularly concerning memory capacity, computational efficiency, and interconnect bandwidth. DeepSeek-V3, trained on a cluster of 2048 NVIDIA H800 GPUs, serves as a compelling case study demonstrating how a synergistic approach between model design and hardware considerations can overcome these limitations. This research focuses on the interplay between hardware architecture and model design in achieving economical large-scale training and inference, aiming to provide actionable insights for efficiently scaling LLMs without compromising performance or accessibility.
Key areas of focus in the paper include:
- Hardware-Driven Model Design: Analyzing how hardware characteristics, such as FP8 low-precision computation and scale-up/scale-out network properties, influence architectural choices within DeepSeek-V3.
- Hardware-Model Interdependencies: Investigating how hardware capabilities shape model innovation and how the evolving demands of LLMs drive requirements for next-generation hardware.
- Future Directions for Hardware Development: Drawing practical insights from DeepSeek-V3 to guide the co-design of future hardware and model architectures for scalable and cost-effective AI systems.
DeepSeek-V3’s Design Principles: Addressing Core Scaling Challenges
DeepSeek-V3 incorporates several key architectural innovations, as illustrated in Figure 1 of the paper, including the DeepSeekMoE architecture and Multi-head Latent Attention (MLA). These designs directly tackle the core challenges of scaling LLMs: memory efficiency, cost-effectiveness, and inference speed.
Memory Efficiency: MLA and KV Cache Optimization
LLMs exhibit exponential growth in memory demands, outpacing the slower growth of high-speed memory like HBM. While multi-node parallelism offers a solution, optimizing memory usage at the source remains crucial. DeepSeek addresses this bottleneck with Multi-head Latent Attention (MLA), which employs projection matrices to compress the key-value (KV) representations of all attention heads into a smaller latent vector, trained jointly with the model. During inference, only this compressed latent vector needs to be cached, significantly reducing memory consumption compared to storing full KV caches for each head.
Beyond MLA, DeepSeek highlights other valuable techniques for KV cache size reduction, providing inspiration for future advancements in memory-efficient attention mechanisms:
- Shared KV (GQA; MQA): Multiple attention heads share a single set of key-value pairs, drastically compressing storage.
- Window KV: Limiting the context window for KV caching.
- Quantization Compression: Reducing the precision of stored KV values.
Table 1 in the paper compares the per-token KV cache memory footprint of DeepSeek-V3, Qwen-2.5 72B, and LLaMA-3.1 405B. DeepSeek-V3 achieves a remarkable reduction, requiring only 70 KB per token, significantly lower than LLaMA-3.1 405B’s 516 KB and Qwen-2.5 72B’s 327 KB.
Cost-Effectiveness: DeepSeekMoE for Sparse Computation
For sparse computation, DeepSeek developed DeepSeekMoE, an advanced Mixture-of-Experts (MoE) architecture (Figure 1, bottom right). MoE models offer two key advantages in terms of cost-effectiveness:
- Reduced Training Compute: By selectively activating a subset of expert parameters per token, MoE architectures allow for a substantial increase in the total number of parameters while maintaining manageable computational demands. For instance, DeepSeek-V3 boasts 671B parameters, nearly three times that of its predecessor V2 (236B), yet only activates 37B parameters per token. In contrast, dense models like Qwen2.5–72B and LLaMa3.1–405B require all parameters to be active during training. Table 2 demonstrates that DeepSeekV3 achieves comparable or superior performance to these dense models with an order of magnitude less computational cost (around 250 GFLOPS per token vs. 394 GFLOPS for the 72B dense model and 2448 GFLOPS for the 405B dense model).
- Advantages for Personal Use and Local Deployment: The selective activation of parameters in MoE models translates to significantly lower memory and compute requirements during single-request inference. DeepSeek-V2 (236B parameters), for example, only activates 21B parameters during inference, enabling near or above 20 tokens per second (TPS) on AI SoC-equipped personal computers — a capability far exceeding that of similarly sized dense models on comparable hardware. This opens possibilities for personalized LLM agents running locally.
Enhanced Inference Speed: Overlapping Computation and Communication
DeepSeek prioritizes both system-level maximum throughput and single-request latency for inference speed. To maximize throughput, the model employs a dual micro-batch overlapping architecture from the outset, intentionally overlapping communication latency with computation.
Furthermore, DeepSeek decouples the computation of MLA and MoE into distinct stages. While one micro-batch performs part of the MLA or MoE computation, the other concurrently executes the corresponding scheduling communication. Conversely, during the second micro-batch’s computation phase, the first micro-batch undertakes the combine communication step. This pipelined approach enables seamless overlap of all-to-all communication with continuous computation, ensuring full GPU utilization. In production, DeepSeek utilizes a prefill and decode separation architecture, assigning large-batch prefill and latency-sensitive decode requests to different-sized expert-parallel groups, maximizing system throughput under real-world serving conditions.
The paper also touches upon the importance of test-time scaling for reasoning models and highlights the critical role of high token output speed in reinforcement learning workflows and for reducing user-perceived latency in long inference sequences. Optimizing inference speed through hardware-software co-innovation is therefore paramount for the efficiency of reasoning models.
Low-Precision Driven Design: FP8 Training and LogFMT
FP8 Mixed-Precision Training
While quantization techniques like GPTQ and AWQ have significantly reduced memory requirements primarily for inference, DeepSeek has pioneered the use of FP8 mixed-precision training for a large-scale MoE model. Despite NVIDIA’s Transformer Engine supporting FP8, DeepSeek-V3 marks a significant step as the first publicly known large model to leverage FP8 for training. This achievement, resulting from close collaboration between infrastructure and algorithm teams, along with extensive experimentation, significantly reduces computational costs while maintaining model quality, making large-scale training more feasible. Figure 1 illustrates the FP8 precision used in the forward and backward passes during training.
LogFMT for Efficient Communication
DeepSeek also employs low-precision compression for network communication within the DeepSeek-V3 architecture. During EP parallelism, tokens are scheduled using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16, thereby significantly shortening communication time.
Beyond traditional floating-point formats, DeepSeek experimented with a novel data type called LogFMT-nBit (Logarithmic Floating-Point Formats).
Interconnect-Driven Design: Addressing Hardware Limitations
Current Hardware Architecture and its Constraints
DeepSeek currently utilizes the NVIDIA H800 GPU SXM architecture (Figure 2), which, while based on the Hopper architecture similar to the H100, features reduced FP64 compute performance and NVLink bandwidth (400 GB/s down from 900 GB/s in H100) due to regulatory requirements. This significant reduction in intra-node scaling bandwidth poses challenges for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 network interface cards (NICs) to enhance inter-node scaling capabilities.
Hardware-Aware Parallelization and Model Co-design
To navigate the limitations of the H800 architecture, the DeepSeek-V3 model incorporates hardware-aware design considerations for parallelization, including: avoiding Tensor Parallelism (TP), enhancing Pipeline Parallelism (PP), and accelerating Expert Parallelism (EP). Specific details of these strategies are available in the original paper.
A key aspect of model co-design is “node-aware routing” for the TopK expert selection strategy in the MoE architecture. Given the approximately 4:1 bandwidth difference between intra-node (NVLink, ~160 GB/s effective) and inter-node (IB, ~40 GB/s effective per NIC) communication, DeepSeek designed the routing to leverage the higher intra-node bandwidth. By grouping the 256 routing experts (4 per GPU in an 8-node, 64-GPU setup) into 8 groups of 32 experts, each residing on a single node, and algorithmically ensuring that each token is routed to at most 4 nodes, DeepSeek mitigates the IB communication bottleneck and improves effective communication bandwidth during training. Tokens destined for experts on the same node can be sent via IB once and then forwarded via NVLink, reducing redundant IB traffic.
Scale-Up and Scale-Out Convergence: Future Hardware Directions
While node-aware routing reduces bandwidth demands, the bandwidth disparity between NVLink and IB complicates the implementation of communication-intensive kernels. Currently, GPU Streaming Multiprocessors (SMs) handle both network message processing and data forwarding via NVLink, consuming significant compute resources. DeepSeek advocates for integrating intra-node (scale-up) and inter-node (scale-out) communication into a unified framework.
Integrating dedicated co-processors for network traffic management and seamless forwarding between NVLink and IB domains could reduce software complexity and maximize bandwidth utilization. Hardware support for dynamic traffic deduplication could further optimize strategies like DeepSeek-V3’s node-aware routing. DeepSeek also explores emerging interconnect protocols like Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UALink), noting the Unified Bus (UB) as a recent approach to converging scale-up and scale-out. The paper details methods for achieving this convergence at the programming framework level, including unified network adapters, dedicated communication co-processors, flexible forwarding and broadcast/reduce mechanisms, and hardware synchronization primitives.
Bandwidth Contention and Latency
Another limitation of current hardware is the lack of flexibility in dynamically allocating bandwidth between different traffic types on NVLink and PCIe. For instance, transferring KV cache data from CPU memory to GPUs during inference can saturate PCIe bandwidth, leading to contention with inter-GPU EP communication via IB, potentially degrading overall performance and causing latency spikes. DeepSeek suggests solutions including dynamic NVLink/PCIe traffic prioritization, I/O chiplet integration, and CPU-GPU interconnect within the scale-up domain.
Large-Scale Network-Driven Design: Multi-Plane Fat-Tree
Network Co-design: Multi-Plane Fat-Tree
For DeepSeek-V3 training, a Multi-Plane Fat-Tree (MPFT) scale-out network was deployed (Figure 3). Each node, equipped with 8 GPUs and 8 IB NICs, assigns each GPU-NIC pair to a different network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS distributed file system. The scale-out network utilizes 64-port 400G IB switches, theoretically supporting up to 16,384 GPUs while retaining the cost and latency advantages of a two-layer network. However, due to policy and regulatory constraints, the actual deployment involved over two thousand GPUs.
The deployed MPFT network did not fully realize its intended architecture due to current limitations of the IB ConnectX-7. Ideally (Figure 4), each NIC would have multiple physical ports, each connected to a separate network plane but presented to the user as a single logical interface via port bonding. This would allow a single Queue Pair (QP) to seamlessly send and receive messages across all available ports, similar to packet spraying. Native out-of-order layout support within the NIC would be necessary to ensure message consistency and correct ordering semantics, as packets from the same QP might traverse different network paths and arrive out of order. InfiniBand ConnectX-8 natively supports four planes, and future NICs with full support for advanced multi-plane capabilities will significantly benefit the scalability of two-layer fat-tree networks for large AI clusters. Overall, multi-plane architectures offer significant advantages in fault isolation, robustness, load balancing, and scalability for large systems.
DeepSeek highlights several advantages of MPFT, including its composition as a subset of Multi-Rail Fat-Tree (MRFT) allowing seamless integration of existing NVIDIA and NCCL optimizations for MRFT networks, cost-effectiveness, traffic isolation, reduced latency, and robustness. Performance analysis comparing MPFT and MRFT (Figures 5 and 6, Table 4) revealed that the all-to-all performance of multi-plane networks is very similar to single-plane multi-rail networks, and the performance of MPFT and MRFT was nearly identical when training the V3 model on 2048 GPUs.
Low-Latency Networking
In DeepSeek’s model inference, large-scale EP heavily relies on all-to-all communication, which is sensitive to both bandwidth and latency. Even microsecond-level inherent network latency can significantly impact system performance.
DeepSeek analyzes the latency characteristics of IB and RoCE (Table 5), noting IB’s consistently lower latency, making it preferable for latency-sensitive workloads like distributed training and inference. While RoCE offers a potentially cost-effective alternative, its current latency and scalability limitations prevent it from fully meeting the demands of large-scale AI systems. DeepSeek proposes specific improvements for RoCE, including dedicated low-latency RoCE switches, optimized routing policies, and enhanced traffic isolation or congestion control mechanisms.
To further reduce network communication latency, DeepSeek utilizes InfiniBand GPUDirect Async (IBGDA). Traditionally, network communication involves CPU proxy threads, introducing additional overhead. IBGDA allows GPUs to directly populate Work Request (WR) content and write to RDMA doorbell MMIO addresses, eliminating the significant latency associated with GPU-CPU communication. By managing the entire control plane within the GPU, IBGDA avoids CPU bottlenecks, especially when sending numerous small packets, as the GPU’s parallel threads can distribute the workload. DeepSeek’s DeepEP and other works have demonstrated significant performance gains using IBGDA, leading DeepSeek to advocate for broad support of such features across various accelerator devices.
Discussion and Insights for Future Hardware Architecture Design
Building upon the identified hardware limitations and proposed solutions in specific application contexts, the paper broadens the discussion to offer forward-looking directions for future hardware architecture design:
- Robustness Challenges: Addressing hardware failures and silent data corruption through advanced error detection and correction mechanisms for building non-stop AI infrastructure.
- CPU Bottlenecks and Interconnect Limitations: Optimizing CPU-accelerator collaboration, particularly breaking the limitations of traditional interfaces like PCIe for high-speed, bottleneck-free intra-node communication.
- Intelligent Networks for AI: Creating low-latency and intelligent networks with technologies like co-packaged optics, lossless mechanisms, and adaptive routing to handle complex communication demands.
- Memory Semantic Communication and Ordering: Resolving data consistency and ordering challenges in current memory semantic communication, exploring hardware-level built-in guarantees for improved communication efficiency.
- Computation and Compression in the Network: Offloading computation and compression capabilities into the network, especially for specific workloads like EP, to unlock network bandwidth potential.
- Memory-Centric Architecture Innovations: Addressing the memory bandwidth crisis driven by exponential model scaling, exploring cutting-edge technologies like DRAM stacking and wafer-scale integration.
The paper delves into each of these areas with specific insights and recommendations, highlighting the need for a holistic co-design approach between hardware and software to enable the continued advancement and accessibility of large-scale AI.
In conclusion, this technical report provides valuable insights into the challenges and solutions encountered during the development and training of DeepSeek-V3. By meticulously analyzing the interplay between model architecture and hardware limitations, DeepSeek offers a compelling vision for the future of AI infrastructure, emphasizing the critical role of hardware-aware co-design in achieving cost-efficient and scalable large language models. The paper’s detailed exploration of techniques like MLA, DeepSeekMoE, FP8 training, LogFMT, and the MPFT network, coupled with its forward-looking recommendations for hardware development, serves as a significant contribution to the field of large-scale AI research and engineering.
The Paper Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures is on
The post first appeared on .