Huawei releases an open-weight model trained on Huawei Ascend graphics cards

July 2, 2025

[Submitted on 27 May 2025 (v1), last revised 28 May 2025 (this version, v2)]

Authors :Yehui Tan, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng It is common to observe that some experts are activated more than others. This leads to inefficiency in the system when running experts on different devices simultaneously. Mixture of Grouped Experts, or MoGE, groups experts during selection to balance the workload of experts better than MoE. It constrains tokens so that they activate an equal number experts within each predefined group of experts. This architectural design balances the computational load between devices when a model is executed on multiple devices. This increases throughput, especially during the inference phase. We also build Pangu Pro MoE using Ascend NPUs. This is a sparse MoGE model with 72 billion parameters total, of which 16 billion are active for each token. Pangu Pro MoE configuration is optimized through extensive system simulations for Ascend 800I A2 and 300I Duo. Our experiments show that MoGE does indeed lead to better expert load-balancing and more efficient execution of both model training on Ascend and inference. Pangu Pro MoE can achieve 1148 tokens/s for inference and this can be improved further to 1528 tokens/s by speculative accelerating, outperforming 32B and 72B Dense Models. We also achieve an excellent cost/performance ratio for inference of models on Ascend 300I Duo. Our studies show that Ascend NPUs can train Pangu Pro MoE in massive parallelization, making it a leading model in the sub-100B class of total parameters. It also outperforms prominent open-source models such as GLM-Z1-32B or Qwen3-332B.