Home News Comparing Nvidia’s top-tier GPUs with Huawei’s rack scale boogeyman

Comparing Nvidia’s top-tier GPUs with Huawei’s rack scale boogeyman

0
Comparing Nvidia’s top-tier GPUs with Huawei’s rack scale boogeyman

Analysis. Nvidia is able to resume shipments to China of its H20 graphics cards. But while the chip might be plentiful, bit-barn operators in the area now have more capable alternatives available.

Huawei’s CloudMatrix rack systems are among the most promising. It teased them this week at the World Artificial Intelligence Conference in Shanghai.

This system is powered by Huawei’s latest Ascend NPU, the P910C. If you can get one, it promises twice as much floating point performance as the H20, and more memory, though slower. Huawei’s CloudMatrix system is clearly aiming higher than Nvidia’s sanctions-compliant silicon. Huawei’s largest iron is 60 percent faster than Nvidia’s Blackwell based GB200 Rack systems. It also boasts roughly twice the memory bandwidth and just over 3.5x HBM.

What is the secret to a company being effectively blacklisted by Western chip tech? Simple: The CloudMatrix is huge, with more than 5x as many accelerators and 16x the space of Nvidia NVL72.

Ascend 910C dissected

Huawei’s Ascend NPU is at the heart of CloudMatrix 384. Each accelerator is equipped with two compute dies that are connected using a high-speed interconnect chip-to-chip capable of transferring data at speeds of 540GB/s and 270GB/s respectively.

These dies can produce 752 teraFLOPS in dense FP16/BF16 performance. Eight stacks of 128GB high-bandwidth memories are used to feed all of that compute. Each die receives 1.6TB/s memory bandwidth, for a total 3.2TB/s.

Click to enlarge for a breakdown of Huawei’s latest NPU, the Ascend 910C.

For those who have been following AI chip development in 2025, this is not exactly what one would call competitive. Nvidia’s H200, which is only two years old, boasts 83 teraFLOPS more floating point performance in FP16. It also has 13GB more HBM and 1.6TB/s higher memory bandwidth.

You can’t buy an H200 legally in China so a better comparison is the H20. Nvidia will resume shipping the H20 any day now. The H20 has a slight advantage in memory bandwidth. However, the Ascend P910C offers more HBM (128GB as opposed to 96GB) with more than twice the performance of floating-point.

While the P910C does not support FP8, Huawei claims that INT8 performs nearly as well, at least in terms of inference.

The P910C is a powerful alternative to Nvidia’s China-spec acceleration chips, even though they are no match for Nvidia’s latest Blackwell chips.

NPUs working together are strong

However, most cutting-edge large language model models are not being trained or ran on a single processor. It’s not possible to do this because there isn’t enough memory or bandwidth. The performance of the chip is less important because you can scale up and out more efficiently. Huawei’s latest NPUs are designed to do just that.

Huawei Ascend P910C has a scale-up interconnect (UB) similar to NVLink, which allows Huawei stitch multiple accelerators into one big one. This is exactly what Nvidia does with its HGX servers and rack systems.

Each P910C has 14 28GB/s UB connections (seven per compute node), which connect to 7 UB switch ASICs baked in each node, forming a non-blocking mesh with 8 NPUs and 4 Kunpeng CPUs.

Unlike Nvidia H20 or B200, Huawei’s UB switch has a bunch spare ports that connect up to a secondary tier of UB spinal switches. This allows Huawei to scale up from eight NPUs in a box to 32 per rack or 384 per “supernode” – hence the name CloudMatrix 384.

Nvidia, however, only supports a computing domain with 72 GPUs. This is less than a fifth of Huawei’s. This is how the Chinese IT giant claims to have a higher system performance than its Western competitor on paper.

With only 32 NPUs in each rack, you can expect that the CloudMatrix is much larger than Nvidia’s NvL72. Huawei’s largest AI iron spans sixteen racks, with 12 racks for compute and 4 for networking.

Nvidia’s NVLink technology can support scale-up systems with up to 576 graphics cards, but we have yet to see a system like this in the wild.

Huawei’s CloudMatrix offers 400Gbps scale-out networking for deployments that require more than 384 NPUs. The company claims that this allows for training clusters up to 165,000 NPUs.

Inference Performance

These large-scale computing fabrics offer a few advantages for inference. This is especially true when it comes to massive mixtures of experts (MoE), which are a recent phenomenon from China.

With more chips, operators can better leverage techniques such as tensor, expert, or data parallelism to increase inference throughput, and drive down overall token costs.

Huawei explained that in the CloudMatrix 384, a mixture of experts model, like DeepSeek R1, could be configured to have a single expert on each NPU die, as Huawei explained in a paper published last month.

Huawei has developed a CloudMatrix Infer platform that disaggregates the prefill, decode and caching. Researchers wrote “Unlike existing KV cache-centric architectures, this design enables high-bandwidth, uniform access to cached data via the UB network, thus reducing data locality constraints, simplifying task scheduling, and improving cache efficiency,” . If you recognize any of this, it’s because Nvidia announced at GTC a similar system called Dynamo for its GPUs, which we examined in depth back in March

Huawei demonstrated CloudMatrix Infer dramatically increased performance on DeepSeek R1, with a single GPU processing 6,688 tokens per second and generating tokens in a rate as high as 1,943 tokens per second.

This might sound unbelievable, but it is worth noting that aggregate throughput was achieved at a batch size 96. Individual performance was closer 50ms per token output or 20 tokens per second. The NPU’s throughput drops to 538 tokens/second at an eight-batch size when individual performance is pushed to around 66 tokens/second. This will make a difference to thinking models such as the R1.

Huawei claims that under ideal conditions it was able achieve a prompt processing efficiency of 4.5 tokens/sec for each teraFLOPS. This puts it just ahead of Nvidia H800, which is at 3.96 tokens/sec for each teraFLOPS. Huawei showed similar performance in the decode phase. The rack system managed to gain a 10 percent advantage over Nvidia H800. As always, these vendor claims should be taken with a grain or two of salt. Inference performance depends heavily on your workload.

Cost, power, and density

Although token/sec per TeraFLOPS can provide some insight into the overall efficiency, the most important metric to consider is the cost of the tokens produced by the system. This is usually measured as tokens per dollar/watt.

While the CloudMatrix’s sheer size allows it to compete and even surpass the performance of Nvidia Blackwell systems that are much more powerful, it doesn’t really matter if it is more expensive to deploy or operate.

Although it’s difficult to determine the official power ratings of Huawei’s CloudMatrix system, SemiAnalysis speculatedthe entire system would likely pull somewhere around 600 kilowatts. This is compared to the 120kW of GB200 NVL72.

If these estimates are accurate, Nvidia’s NvL72 would not only be several times more dense in terms of computing power, but also more than three times more energy-efficient, with 1,500 gigaFLOPS/watt compared to Huawei’s 465 gigaFLOPS/watt.

While access to cheap energy is a major problem in the West, it’s not a huge issue in China. In recent years, Beijing has made a number of investments in its national grid system, including the construction of large numbers solar farms and nuclear power plants to reduce its dependence on coal-fired energy plants.

  • Taxman pays $140M after Cadence fined over China export violations.
  • Report says a billion dollars worth of Nvidia chip fell off a truck, and made its way to China.
  • Republican criticizes Trump admin’s decision on GPU sales to China.
  • AMD is cleared to rejoin Nvidia in selling some underpowered AI processors to China.

Infrastructure costs may be the bigger issue. Huawei’s CloudMatrix will reportedly retail for around $8.2 million . Nvidia’s rack systems NVL72 are estimated to cost around $3.5 million each.

If you’re a Chinese model developer, Nvidia racks won’t be of interest to you. Huawei has little competition in the rack-scale AI accelerators arena thanks to Uncle Sam’s export restrictions. Its only bottleneck could be how many P910Cs China’s foundry champion SMIC is able to produce.

US lawmakers remain convinced SMIC is unable to produce chips of this complexity at high volume. Not too long ago, industry experts thought SMIC lacked the technology to manufacture 7nm process nodes and smaller, but that turned out to not be the case.

While it is unclear in what volume Huawei will be able produce CloudMatrix, Nvidia’s CEO Jensen Huang is happy to fill Chinese datacenters with all the H20s they can handle. Nvidia has reportedly ordered another 300,000. H20 chips from TSMC in order to meet the strong demand of Chinese customers. The Register contacted Huawei for a comment

but had not heard back by the time of publication. (r)

www.aiobserver.co

Exit mobile version