Monday, December 23rd 2024

AMD's Pain Point is ROCm Software, NVIDIA's CUDA Software is Still Superior for AI Development: Report

The battle of AI acceleration in the data center is, as most readers are aware, insanely competitive, with NVIDIA offering a top-tier software stack. However, AMD has tried in recent years to capture a part of the revenue that hyperscalers and OEMs are willing to spend with its Instinct MI300X accelerator lineup for AI and HPC. Despite having decent hardware, the company is not close to bridging the gap software-wise with its competitor, NVIDIA. According to the latest report from SemiAnalysis, a research and consultancy firm, they have run a five-month experiment using Instinct MI300X for training and benchmark runs. And the findings were surprising: even with better hardware, AMD's software stack, including ROCm, has massively degraded AMD's performance.

"When comparing NVIDIA's GPUs to AMD's MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD," noted SemiAnalysis, breaking down arguments in the report further, adding that "AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD's weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience."
NVIDIA has a massive advantage in that the software is fully functional. "As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates," noted the SemiAnalysis report. Tinybox and Tinybox Pro developer Tinygrad also confirmed this multiple times on their X profile, which also had a massive issue with AMD software in the past.

When taking a look at AMD Instinct MI300X and NVIDIA H100/H200 chips from 2023, the MI300X emerges as a clear winner performance-wise. It reaches 1,307 TFLOP/s for FP16 calculations, surpassing NVIDIA's H100, which delivers 989 TFLOP/s. The MI300X has 192 GB of HBM3 memory and a memory bandwidth of 5.3 TB/s. These specifications are even favourable to NVIDIA's H200, which offers 141 GB of HBM3e memory and 4.8 TB/s of memory bandwidth. The AMD chip also even has a lower total cost of ownership model, which has a 40% cheaper networking alone. On paper, the AMD chip looks superior to NVIDIA's Hopper offerings, but in reality, not so much.

AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack. Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge. Finally, SemiAnalysis has noted that the AMD software stack has been improved based on their suggestions. Still, there is a long way to go before the company reaches NVIDIA's CUDA level of stability and performance. For detailed analysis, visit SemiAnalysis report here.
Source: SemiAnalysis
Add your own comment

11 Comments on AMD's Pain Point is ROCm Software, NVIDIA's CUDA Software is Still Superior for AI Development: Report

#1
Daven
Not retaining your own hardware for internal development is a big mistake. My company does the same thing, selling everything we produce. This severally limits any opportunity to take market share from competitors.

Finally, all hardware companies need to become software companies. Engineers and black box management are stuck in the past.

Edit: oh and the article didn’t say moat enough…moat.
Posted on Reply
#2
dj-electric
Idk why my brain works like this, but upon reading the title I thought AMD has an SOC called Pain Point.
Posted on Reply
#3
Timbaloo
dj-electricIdk why my brain works like this, but upon reading the title I thought AMD has an SOC called Pain Point.
So I was not the only one
Posted on Reply
#4
john_
So, they still haven't learned.
Maybe if their share price drops down to $50?
Posted on Reply
#5
phanbuey
they need to do what they did with Xilinx and partner->acquire an AI software company.
Posted on Reply
#6
hsew
AleksandarKTensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge.
Brutal. Meanwhile nVidia has Jetson Dev kits that anyone can buy for under $300. How does AMD justify this?
Posted on Reply
#7
TPUnique
Damn, that's pretty bad. I want to start dipping my toes into ML projects starting from next year, and was looking forward to potentially getting a Strix Halo platform. Guess I'll put this plan on hold. And get an Intel-build as an interim product, since I really don't want to support nVidia's practices of giving as little VRAM as possible for as much as they can possibly charge..
Posted on Reply
#8
Neo_Morpheus
AleksandarKAMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack. Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge.
Man, if true (not doubting but these days, lots of media love to distort things and the new normal is to only publish anti AMD articles and news) but this is beyond f*cked up on AMD's part.

But I will admit, it sounds way to crazy to be real.

Funny enough, I bumped into this:

blog.runpod.io/amd-mi300x-vs-nvidia-h100-sxm-performance-comparison-on-mixtral-8x7b-inference/

And someone that do work with MI300 hardware and ROCm, posted this:



Mere coincidence that both use the appropriate name Ngreedia. :D
Posted on Reply
#9
bug
The "hey, we're the good guys because OSS" argument doesn't hold when there's $$$ at stake, it would seem.

Funny enough, at some point I believe AMD hardware was actually superior when it came to compute. However, what matters is the complete stack.
Posted on Reply
#10
ymdhis
dj-electricIdk why my brain works like this, but upon reading the title I thought AMD has an SOC called Pain Point.
It would be funny if one company decided to use that theme for codenames. Pain Point, followed by Torture Point, followed by Suffering Point, followed by Guillotine Point followed by Homicide Point followed by Genocide point, etc...
Posted on Reply
#11
R-T-B
bugFunny enough, at some point I believe AMD hardware was actually superior when it came to compute.
Was for a bit, for crypto compute mainly. Because everyone and their dog wrote up cheap mining programs in OpenCL...
Posted on Reply
Add your own comment
Dec 23rd, 2024 12:10 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts