Monday, December 23rd 2024

AMD's Pain Point is ROCm Software, NVIDIA's CUDA Software is Still Superior for AI Development: Report

The battle of AI acceleration in the data center is, as most readers are aware, insanely competitive, with NVIDIA offering a top-tier software stack. However, AMD has tried in recent years to capture a part of the revenue that hyperscalers and OEMs are willing to spend with its Instinct MI300X accelerator lineup for AI and HPC. Despite having decent hardware, the company is not close to bridging the gap software-wise with its competitor, NVIDIA. According to the latest report from SemiAnalysis, a research and consultancy firm, they have run a five-month experiment using Instinct MI300X for training and benchmark runs. And the findings were surprising: even with better hardware, AMD's software stack, including ROCm, has massively degraded AMD's performance.

"When comparing NVIDIA's GPUs to AMD's MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD," noted SemiAnalysis, breaking down arguments in the report further, adding that "AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD's weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience."
NVIDIA has a massive advantage in that the software is fully functional. "As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates," noted the SemiAnalysis report. Tinybox and Tinybox Pro developer Tinygrad also confirmed this multiple times on their X profile, which also had a massive issue with AMD software in the past.

When taking a look at AMD Instinct MI300X and NVIDIA H100/H200 chips from 2023, the MI300X emerges as a clear winner performance-wise. It reaches 1,307 TFLOP/s for FP16 calculations, surpassing NVIDIA's H100, which delivers 989 TFLOP/s. The MI300X has 192 GB of HBM3 memory and a memory bandwidth of 5.3 TB/s. These specifications are even favourable to NVIDIA's H200, which offers 141 GB of HBM3e memory and 4.8 TB/s of memory bandwidth. The AMD chip also even has a lower total cost of ownership model, which has a 40% cheaper networking alone. On paper, the AMD chip looks superior to NVIDIA's Hopper offerings, but in reality, not so much.

AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack. Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge. Finally, SemiAnalysis has noted that the AMD software stack has been improved based on their suggestions. Still, there is a long way to go before the company reaches NVIDIA's CUDA level of stability and performance. For detailed analysis, visit SemiAnalysis report here.
Source: SemiAnalysis
Add your own comment

33 Comments on AMD's Pain Point is ROCm Software, NVIDIA's CUDA Software is Still Superior for AI Development: Report

#26
Patriot
Visible NoiseI don’t get into pointless arguments with apologists of any brand.

Later.
I am sorry that people with relevant knowledge and expertise intimidate you, that is a sad way to live.
Posted on Reply
#27
Visible Noise
PatriotI am sorry that people with relevant knowledge and expertise intimidate you, that is a sad way to live.
Nah, I just don’t deal with people that pull an appeal to authority, especially when they claim they are the authority and their post history shows it’s evident they are on a team.

Merry Christmas!
Posted on Reply
#28
Neo_Morpheus
Not sure if related but bumped into this today on X:


Posted on Reply
#29
igormp
PatriotNegative. The article focused only on training models using predefined containers not pretrained models for inference.
It is super super easy to run models on mi100/250x/300x 7900gre/xt/xtx

I read as far as the paywall goes.
I also... have used ROCm since vega64/mi25
I also... have used Cuda since K80/GTX690

I currently run a hive of mi100s, and a sxm v100 box.

When it comes to inference, MI300x gets day0 support. Training is very lacking and Nvidia's deep bench of software engineers shows.
I expect part 2 of the article to be a bit different.

I am fully aware of the lacking's of AMDs ecosystem, but I am also aware of its strengths.
And the ability to just grab containers and go exists... hugging face is full of native containers for ROCm, Hipify can convert most* things that are cuda native, abet at performance penalty.
But when it comes to inference AMD is not a 2nd class citizen. It has full support with triton, and flash attention...
And Llama 405b fp16 launched exclusively on mi300x, most likely due to the ram requirements.
As it was quantized down to fp8, then it could fit on 8x h100 80gb, but as it was announced, Meta and AMD announced together that all Meta 405b live instances were run on mi300x.
If that is still true or was just a limited exclusivity while it was quantized down... idk...

But claiming things like a mi300x cant run OOB models is just... ignorant af, and not even what the article claims.
It claims bad training performance and strange bugs and as a user of the ecosystem... yup. AMD has strange bugs.
They have known lockups for multi gpu instances... and the solution is to run additional grub parameters, perfectly stable with iommu=pt, randomly hangs without.
But all this information is in the tuning guides. The install process is easy, and hugging face is full of models to run.
I mean, FA2 only got supported on AMD GPUs recently. Even though pytorch does include ROCm support OOB nowadays, you often face issues not found with CUDA.
ROCm's performance is way subpar still, achieving like a fraction of its theoretical performance (both in terms of memory bandwidth and also FLOPs).

It is still clearly a second class citizen, but it's the second class citizen. As soon as something comes out (defaulting to CUDA, of course), then people immediately get their hands trying to port it to ROCm.
The strides it has made in the past years is really impressive. I remember trying it out with an rx480 back then, and immediately buying a 1050ti to replace it, nowadays it's not 100% (nor that close), but you sure can get your hands dirt and at least get something working out of it.

As for lockups and hangs, eh, I've heard this quite a lot from some folks that do work with many AMD GPUs, but it's also not that uncommon in the Nvidia world either (albeit to a lesser degree). Just get a GH200 (lambdalabs even has those with a discount for now) and have some fun locking up your machine trying to use their so called "unified" memory haha
Posted on Reply
#30
Visible Noise
x.com/dylan522p/status/1871287937268383867?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1871287937268383867%7Ctwgr%5E23a72326cfeed5bd11d70d926a3958201c955173%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fforum.beyond3d.com%2Fthreads%2FnvidiaE28099s-machine-learning-ecosystem-advantage.63893%2F


You get an hour and a half with the CEO. Then she spends the next hour and a half tearing someone a new one for getting surprised by the media.

Heads need to roll in AMDs software group.
Posted on Reply
#31
tommo1982
Visible Noisex.com/dylan522p/status/1871287937268383867?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1871287937268383867%7Ctwgr%5E23a72326cfeed5bd11d70d926a3958201c955173%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fforum.beyond3d.com%2Fthreads%2FnvidiaE28099s-machine-learning-ecosystem-advantage.63893%2F


You get an hour and a half with the CEO. Then she spends the next hour and a half tearing someone a new one for getting surprised by the media.

Heads need to roll in AMDs software group.
Do you have a link? I don't have X or Facebook account.
Posted on Reply
#32
Visible Noise
tommo1982Do you have a link? I don't have X or Facebook account.
Posted on Reply
#33
Patriot
I have mixed feelings on Geohotz. They announced, built, and started shipping something before testing it... And then demanded AMD to opensource FW on desktop cards that they were shipping for enterprise workloads. They weren't even using the workstation cards. Looks like they also demanded $1M in test boxes.
Kinda nuts demanding enterprise support for desktop cards. That said, I would like AMD to follow through on promises. All they have done so far is the first half of what they promised and handed out a guide on how to talk to the fw better.
Posted on Reply
Add your own comment
Jan 22nd, 2025 20:22 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts