Deepep
DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also as known as MoE dispatch and combine. The library also supports low-precision operations, including FP8.
To align with the group-limited gating algorithm proposed in the DeepSeek-V3 paper, DeepEP offers a set of kernels optimized for asymmetric-domain bandwidth forwarding, such as forwarding data from NVLink domain to RDMA domain. These kernels deliver high throughput, making them suitable for both training and inference prefilling tasks. Additionally, they support SM (Streaming Multiprocessors) number control.
For latency-sensitive inference decoding, DeepEP includes a set of low-latency kernels with pure RDMA to minimize delays. The library also introduces a hook-based communication-computation overlapping method that does not occupy any SM resource.
Notice: the implementation in this library may have some slight differences from the DeepSeek-V3 paper.
Performance
Normal kernels with NVLink and RDMA forwarding
We test normal kernels on H800 (~160 GB/s NVLink maximum bandwidth), with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow the DeepSeek-V3/R1 pretraining setting (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).
Type | Dispatch #EP | Bottleneck bandwidth | Combine #EP | Bottleneck bandwidth |
---|---|---|---|---|
Intranode | 8 | 153 GB/S (Nvlink) | 8 | 158 GB/S (Nvlink) |
Internode | 16 | 43 GB/s (RDMA) | 16 | 43 GB/s (RDMA) |
Internode | 32 | 44 GB/s (RDMA) | 32 | 47 GB/s (RDMA) |
Internode | 64 | 46 GB/s (RDMA) | 64 | 45 GB/s (RDMA) |
Low-latency kernels with pure RDMA
We test low-latency kernels on H800 with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow a typical DeepSeek-V3/R1 production setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining).
Dispatch #EP | Latency | RDMA bandwidth | Combine #EP | Latency | RDMA bandwidth |
---|---|---|---|---|---|
8 | 163 us | 46 GB/S. | 8 | 318 us | 46 GB/S. |
16 | 173 us | 43 GB/s | 16 | 329 us | 44 GB/S. |
32 | 182 us | 41 GB/s | 32 | 350 us | 41 GB/s |
64 | 186 us | 40 GB/s | 64 | 353 us | 41 GB/s |
128 | 192 us | 39 GB/S. | 128 | 369 us | 39 GB/S. |
256 | 194 us | 39 GB/S. | 256 | 360 us | 40 GB/s |
Quick start
Requirements
- Hopper GPUs (may support more architectures or devices later)
- Python 3.8 and above
- CUDA 12.3 and above
- PyTorch 2.1 and above
- NVLink for intranode communication
- RDMA network for internode communication
Download and install NVSHMEM dependency
DeepEP also depends on our modified NVSHMEM. Please refer to our NVSHMEM Installation Guide for instructions.
Development
# Build and make symbolic links for SO filesNVSHMEM_DIR=/path/to/installed/nvshmem python setup.py build# You may modify the specific SO names according to your own platformln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so# Run test cases# NOTES: you may modify the `init_dist` function in `tests/utils.py`# according to your own cluster settings, and launch into multiple nodes python tests/test_intranode.pypython tests/test_internode.pypython tests/test_low_latency.py
Installation
NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
Then, import deep_ep
in your Python project, and enjoy!
Network configurations
DeepEP is fully tested with InfiniBand networks. However, it is theoretically compatible with RDMA over Converged Ethernet (RoCE) as well.
Traffic isolation
Traffic isolation is supported by InfiniBand through Virtual Lanes (VL).
To prevent interference between different types of traffic, we recommend segregating workloads across different virtual lanes as follows:
- workloads using normal kernels
- workloads using low-latency kernels
- other workloads
For DeepEP, you can control the virtual lane assignment by setting the NVSHMEM_IB_SL
environment variable.
Adaptive routing
Adaptive routing is an advanced routing feature provided by InfiniBand switches that can evenly distribute traffic across multiple paths. Currently, low-latency kernels support adaptive routing, while normal kernels do not (support may be added soon). Enabling adaptive routing for normal internode kernels may lead to deadlocks or data corruption issues.
For low-latency kernels, enabling adaptive routing can completely eliminate network congestion caused by routing conflicts, but it also introduces additional latency. We recommend the following configuration for optimal performance:
- enable adaptive routing in environments with heavy network loads
- use static routing in environments with light network loads
Congestion control
Congestion control is disabled as we have not observed significant congestion in our production environment.
Interfaces and examples
Example use in model training or inference prefilling
The normal kernels can be used in model training or the inference prefilling phase (without the backward part) as the below example code shows.
Example use in inference decoding
The low latency kernels can be used in the inference decoding phase as the below example code shows.
import torchimport torch.distributed as distfrom typing import Tuple, Optionalfrom deep_ep import Buffer# Communication buffer (will allocate at runtime)# NOTES: there is no SM control API for the low-latency kernels_buffer: Optional[Buffer] = None# You may call this function at the framework initializationdef get_buffer(group: dist.ProcessGroup, num_max_dispatch_tokens_per_rank: int, hidden: int, num_experts: int) -> Buffer: # NOTES: the low-latency mode will consume much more space than the normal mode # So we recommend that `num_max_dispatch_tokens_per_rank` (the actual batch size in the decoding engine) should be less than 256 global _buffer num_rdma_bytes = Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts) # Allocate a buffer if not existed or not enough buffer size if _buffer is None or _buffer.group != group or not _buffer.low_latency_mode or _buffer.num_rdma_bytes num_rdma_bytes: # NOTES: for best performance, the QP number **must** be equal to the number of the local experts assert num_experts % group.size() == 0 _buffer = Buffer(group, 0, num_rdma_bytes, low_latency_mode=True, num_qps_per_rank=num_experts // group.size()) return _bufferdef low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int): global _buffer # Do MoE dispatch, compatible with CUDA graph (but you may restore some buffer status once you replay) recv_hidden_states, recv_expert_count, handle, event, hook = _buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts, async_finish=False, return_recv_hook=True) # NOTES: the actual tensor will not be received only if you call `hook()`, # it is useful for double-batch overlapping, but **without any SM occupation** # If you don't want to overlap, please set `return_recv_hook=False` # Later, you can use our GEMM library to do the computation with this specific format return recv_hidden_states, recv_expert_count, handle, event, hookdef low_latency_combine(hidden_states: torch.Tensor, topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple): global _buffer # Do MoE combine, compatible with CUDA graph (but you may restore some buffer status once you replay) combined_hidden_states, event_overlap, hook = _buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle, async_finish=False, return_recv_hook=True) # NOTES: the same behavior as described in the dispatch kernel return combined_hidden_states, event_overlap, hook
For two micro-batch overlapping, you can refer to the following figure. With our receiving hook interface, the RDMA network traffics are happening in the background, without costing any GPU SMs from the computation part. But notice, the overlapped parts can be adjusted, i.e. the 4 parts of attention/dispatch/MoE/combine may not have the exact same execution time. You may adjust the stage settings according to your workload.
Notices
- For extreme performance, we discover and use a behavior-out-of-doc PTX instruction:
ld.global.nc.L1::no_allocate.L2::256B
. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers.nc
. But the correctness is tested to be guaranteed with.L1::no_allocate
on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may addDISABLE_AGGRESSIVE_PTX_INSTRS=1
tosetup.py
and disable this, or file an issue. - For better performance on your cluster, we recommend to run all the tests and use the best auto-tuned configuration. The default configurations are optimized on the DeepSeek’s internal cluster.
License
This code repository is released under the MIT License, except for codes that reference NVSHMEM (including csrc/kernels/ibgda_device.cuh
and third-party/nvshmem.patch
), which are subject to NVSHMEM SLA.
Citation
If you use this codebase, or otherwise found our work valuable, please cite:
@misc{deepep2025, title={DeepEP: an efficient expert-parallel communication library}, author={Chenggang Zhao and Shangyan Zhou and Liyue Zhang and Chengqi Deng and Zhean Xu and Yuxuan Liu and Kuai Yu and Jiashi Li and Liang Zhao}, year={2025}, publisher={GitHub}, howpublished={url{https://github.com/deepseek-ai/DeepEP}},}