AMD’s poor software optimizer is letting Nvidia keep a stranglehold on AI chips

December 31, 2024

Serving technology enthusiasts for more than 25 years.
TechSpot is the place to go for tech advice and analysis.

It’s the Software, Stupid. As the year draws to a close, AMD had hoped that its powerful new MI300X chips would finally allow it to gain ground on Nvidia. SemiAnalysis’s extensive investigation suggests that the company’s software issues are allowing Nvidia to maintain its comfortable lead. SemiAnalysis compared AMD’s Instinct MI300X to Nvidia’s newest H100 and H200 chips, noting several differences. The MI300X is an AMD CDNA 3 GPU accelerator designed for high-performance computing and AI workloads.

On the surface, AMD’s performance numbers look impressive: the chip boasts 1,307 TeraFLOPS FP16 compute power, and a massive 192GB HBM3 memory. This is superior to both Nvidia and Intel’s rival offerings. AMD’s solutions promise lower ownership costs than Nvidia’s expensive chips and InfiniBand network.

As the SemiAnalysis team discovered after five months of intensive testing, raw specs do not tell the whole story. AMD’s software ecosystem was difficult to use effectively despite the MI300X’s impressive hardware. SemiAnalysis relied heavily on AMD engineers for continuous bug fixes and issues during their benchmarking.

This is a far cry from Nvidia’s hardware and software, which they noted tends to work smoothly out of the box with no handholding needed from Nvidia staff.

Moreover, the software woes weren’t just limited to SemiAnalysis’ testing – AMD’s customers were feeling the pain too. For instance, AMD’s largest cloud provider Tensorwave had to give AMD engineers access to the same MI300X chips that Tensorwave had purchased, just so AMD could debug the software.

Also read: Not just the hardware: How deep is Nvidia’s software moat?

The troubles don’t end there. From integration problems with PyTorch to subpar scaling across multiple chips, AMD’s software consistently fell short of Nvidia’s proven CUDA ecosystem. SemiAnalysis also noted that many AMD AI Libraries are essentially forks of Nvidia AI Libraries, which leads to suboptimal outcomes and compatibility issues.

“The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience. As fast as AMD tries to fill in the CUDA moat, Nvidia engineers are working overtime to deepen said moat with new features, libraries, and performance updates,” .

However, the analysts did find a glimmer in the pre-release BF16 software development branches that showed much better performance. Nvidia’s next-generation Blackwell chips will be available by the time this code is released (though Nvidia has reportedly been experiencing some growing pains in its rollout).

SemiAnalysis, taking these issues into consideration, listed a number of recommendations for AMD, beginning with giving Team Red engineers more computing and engineering resources to improve and fix the ecosystem.

She took our recommendations seriously
and asked us and her team a lot questions
Several changes are already in motion!
Excited to see improvements coming https://t.co/38aAwwIdEI

– Dylan Patel (@dylan522p) December 23, 2024

SemiAnalysis founder Dylan Patel even met with AMD CEO Lisa Su. He wrote on X that Lisa Su understands the need to improve AMD’s Software Stack. He also said that many changes were already in the works.

It’s a steep climb after years of neglecting this crucial component. Even though analysts want AMD’s performance to be competitive with Nvidia’s, the “CUDA moat” appears to keep Nvidia in the lead.