• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD's Pain Point is ROCm Software, NVIDIA's CUDA Software is Still Superior for AI Development: Report

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,657 (0.99/day)
The battle of AI acceleration in the data center is, as most readers are aware, insanely competitive, with NVIDIA offering a top-tier software stack. However, AMD has tried in recent years to capture a part of the revenue that hyperscalers and OEMs are willing to spend with its Instinct MI300X accelerator lineup for AI and HPC. Despite having decent hardware, the company is not close to bridging the gap software-wise with its competitor, NVIDIA. According to the latest report from SemiAnalysis, a research and consultancy firm, they have run a five-month experiment using Instinct MI300X for training and benchmark runs. And the findings were surprising: even with better hardware, AMD's software stack, including ROCm, has massively degraded AMD's performance.

"When comparing NVIDIA's GPUs to AMD's MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD," noted SemiAnalysis, breaking down arguments in the report further, adding that "AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD's weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience."



NVIDIA has a massive advantage in that the software is fully functional. "As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates," noted the SemiAnalysis report. Tinybox and Tinybox Pro developer Tinygrad also confirmed this multiple times on their X profile, which also had a massive issue with AMD software in the past.

When taking a look at AMD Instinct MI300X and NVIDIA H100/H200 chips from 2023, the MI300X emerges as a clear winner performance-wise. It reaches 1,307 TFLOP/s for FP16 calculations, surpassing NVIDIA's H100, which delivers 989 TFLOP/s. The MI300X has 192 GB of HBM3 memory and a memory bandwidth of 5.3 TB/s. These specifications are even favourable to NVIDIA's H200, which offers 141 GB of HBM3e memory and 4.8 TB/s of memory bandwidth. The AMD chip also even has a lower total cost of ownership model, which has a 40% cheaper networking alone. On paper, the AMD chip looks superior to NVIDIA's Hopper offerings, but in reality, not so much.

AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack. Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge. Finally, SemiAnalysis has noted that the AMD software stack has been improved based on their suggestions. Still, there is a long way to go before the company reaches NVIDIA's CUDA level of stability and performance. For detailed analysis, visit SemiAnalysis report here.

View at TechPowerUp Main Site | Source
 
Joined
Dec 12, 2016
Messages
1,955 (0.67/day)
Not retaining your own hardware for internal development is a big mistake. My company does the same thing, selling everything we produce. This severally limits any opportunity to take market share from competitors.

Finally, all hardware companies need to become software companies. Engineers and black box management are stuck in the past.

Edit: oh and the article didn’t say moat enough…moat.
 
Joined
Sep 6, 2013
Messages
3,393 (0.82/day)
Location
Athens, Greece
System Name 3 desktop systems: Gaming / Internet / HTPC
Processor Ryzen 5 7600 / Ryzen 5 4600G / Ryzen 5 5500
Motherboard X670E Gaming Plus WiFi / MSI X470 Gaming Plus Max (1) / MSI X470 Gaming Plus Max (2)
Cooling Aigo ICE 400SE / Segotep T4 / Νoctua U12S
Memory Kingston FURY Beast 32GB DDR5 6000 / 16GB JUHOR / 32GB G.Skill RIPJAWS 3600 + Aegis 3200
Video Card(s) ASRock RX 6600 + GT 710 (PhysX) / Vega 7 integrated / Radeon RX 580
Storage NVMes, ONLY NVMes / NVMes, SATA Storage / NVMe, SATA, external storage
Display(s) Philips 43PUS8857/12 UHD TV (120Hz, HDR, FreeSync Premium) / 19'' HP monitor + BlitzWolf BW-V5
Case Sharkoon Rebel 12 / CoolerMaster Elite 361 / Xigmatek Midguard
Audio Device(s) onboard
Power Supply Chieftec 850W / Silver Power 400W / Sharkoon 650W
Mouse CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Keyboard CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Software Windows 10 / Windows 10&Windows 11 / Windows 10
So, they still haven't learned.
Maybe if their share price drops down to $50?
 
Joined
Nov 13, 2007
Messages
10,847 (1.74/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6400 1:1 CL30-36-36-76 FCLK 2200
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
they need to do what they did with Xilinx and partner->acquire an AI software company.
 
Joined
May 19, 2011
Messages
113 (0.02/day)
Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge.

Brutal. Meanwhile nVidia has Jetson Dev kits that anyone can buy for under $300. How does AMD justify this?
 

TPUnique

New Member
Joined
Dec 17, 2024
Messages
3 (0.43/day)
Damn, that's pretty bad. I want to start dipping my toes into ML projects starting from next year, and was looking forward to potentially getting a Strix Halo platform. Guess I'll put this plan on hold. And get an Intel-build as an interim product, since I really don't want to support nVidia's practices of giving as little VRAM as possible for as much as they can possibly charge..
 
Joined
Dec 6, 2022
Messages
464 (0.62/day)
Location
NYC
System Name GameStation
Processor AMD R5 5600X
Motherboard Gigabyte B550
Cooling Artic Freezer II 120
Memory 16 GB
Video Card(s) Sapphire Pulse 7900 XTX
Storage 2 TB SSD
Case Cooler Master Elite 120
AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack. Tensorwave, which is among the largest providers of AMD GPUs in the cloud, took their own GPU boxes and gave AMD engineers the hardware on demand, free of charge, just so the software could be fixed. This is all while Tensorwave paid for AMD GPUs, renting their own GPUs back to AMD free of charge.
Man, if true (not doubting but these days, lots of media love to distort things and the new normal is to only publish anti AMD articles and news) but this is beyond f*cked up on AMD's part.

But I will admit, it sounds way to crazy to be real.

Funny enough, I bumped into this:


And someone that do work with MI300 hardware and ROCm, posted this:

mi300.png


Mere coincidence that both use the appropriate name Ngreedia. :D
 
Last edited:

bug

Joined
May 22, 2015
Messages
13,844 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
The "hey, we're the good guys because OSS" argument doesn't hold when there's $$$ at stake, it would seem.

Funny enough, at some point I believe AMD hardware was actually superior when it came to compute. However, what matters is the complete stack.
 
Joined
Jul 29, 2022
Messages
533 (0.61/day)
Idk why my brain works like this, but upon reading the title I thought AMD has an SOC called Pain Point.
It would be funny if one company decided to use that theme for codenames. Pain Point, followed by Torture Point, followed by Suffering Point, followed by Guillotine Point followed by Homicide Point followed by Genocide point, etc...
 
Joined
Aug 20, 2007
Messages
21,544 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Funny enough, at some point I believe AMD hardware was actually superior when it came to compute.
Was for a bit, for crypto compute mainly. Because everyone and their dog wrote up cheap mining programs in OpenCL...
 
Joined
Oct 27, 2009
Messages
1,191 (0.22/day)
Location
Republic of Texas
System Name [H]arbringer
Processor 4x 61XX ES @3.5Ghz (48cores)
Motherboard SM GL
Cooling 3x xspc rx360, rx240, 4x DT G34 snipers, D5 pump.
Memory 16x gskill DDR3 1600 cas6 2gb
Video Card(s) blah bigadv folder no gfx needed
Storage 32GB Sammy SSD
Display(s) headless
Case Xigmatek Elysium (whats left of it)
Audio Device(s) yawn
Power Supply Antec 1200w HCP
Software Ubuntu 10.10
Benchmark Scores http://valid.canardpc.com/show_oc.php?id=1780855 http://www.hwbot.org/submission/2158678 http://ww
I think there is a reason semi-analysis focused on training. As AMD has focused on inference performance. Meta trains on h100/h200 and runs the models (inference) exclusively on mi300x.
This both backs up the analysis as well as shows distortion by not giving the full picture.
AMD needs to work on software to gain competitiveness on training, and there may be architectural limitations that cap its overall training performance (xGMI interconnect arch)
They definitely need better regression testing and testing in general. They have acquired several Ai software companies this year that may help with this.

So the current reality is...
If you are using off the shelf models mi300x excels, if you finetune those models, AMD excels, If you train from scratch... AMD kinda sucks.
The analysis also fails to grasp the reality of availability... sometimes its better to have not as good than nothing.
 
Last edited:
Joined
Nov 6, 2016
Messages
1,777 (0.60/day)
Location
NH, USA
System Name Lightbringer
Processor Ryzen 7 2700X
Motherboard Asus ROG Strix X470-F Gaming
Cooling Enermax Liqmax Iii 360mm AIO
Memory G.Skill Trident Z RGB 32GB (8GBx4) 3200Mhz CL 14
Video Card(s) Sapphire RX 5700XT Nitro+
Storage Hp EX950 2TB NVMe M.2, HP EX950 1TB NVMe M.2, Samsung 860 EVO 2TB
Display(s) LG 34BK95U-W 34" 5120 x 2160
Case Lian Li PC-O11 Dynamic (White)
Power Supply BeQuiet Straight Power 11 850w Gold Rated PSU
Mouse Glorious Model O (Matte White)
Keyboard Royal Kludge RK71
Software Windows 10
Brutal. Meanwhile nVidia has Jetson Dev kits that anyone can buy for under $300. How does AMD justify this?
This insinuated that AMD is PURELY limited by will power, is this what mostly everyone believes here? That AMD has access to equal resources that Nvidia does and the only thing limiting AMD is simply "not wanting to do better"? I'm seriously asking....

We all agree Lisa Su is competent, correct? Do any of us actually believe that people are telling her: "We need to do better with our software" and she's like "Ahhh, screw it"?

So what is it then? I imagine it's difficult for them to get ahold and maintain talent, Nvidia and Intel can afford to pay them more, and both competitors have far larger R&D budgets, is that the problem? Is it a workplace "culture" problem? It'd be amazing to hear from someone who has worked there to see if that's the case... If anyone has some educated and informed guesses, I'd love to hear them, because it surely cannot be that AMD is just being "stupid" or something.....but there definitely is a problem or problems
 
Last edited:
Joined
Jan 2, 2019
Messages
155 (0.07/day)
This insinuated that AMD is PURELY limited by will power, is this what mostly everyone believes here? That AMD has access to equal resources that Nvidia does and the only thing limiting AMD is simply "not wanting to do better"? I'm seriously asking....

We all agree Lisa Su is competent, correct? Do any of us actually believe that people are telling her: "We need to do better with our software" and she's like "Ahhh, screw it"?

So what is it then? I imagine it's difficult for them to get ahold and maintain talent, Nvidia and Intel can afford to pay them more, and both competitors have far larger R&D budgets, is that the problem? Is it a workplace "culture" problem? It'd be amazing to hear from someone who has worked there to see if that's the case... If anyone has some educated and informed guesses, I'd love to hear them, because it surely cannot be that AMD is just being "stupid" or something.....but there definitely is a problem or problems

Here are my comments as a C/C++ Software Engineer who worked for AMD.

>>...Is it a workplace "culture" problem? It'd be amazing to hear from someone who has worked there to see if that's the case...

I worked for AMD as a contractor. I have very-very good memories for just a couple of fellow developers. No any good memories for the management of AMD. In overall.

The Environment inside of AMD is Very Toxic.

>>...AMD's software stack, including ROCm, has massively degraded AMD's performance...

Worked with ROCm a lot and I would rate ROCm as A-Piece-of-Over-Complecated-Software-Crap.

>>...MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD...

Not true based on my experience however it is possible things have changed after my contract was over.

>>...AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible...

Partially true since I was able to see how a lot of bugs were Not fixed at all.

>>...We were hopeful that AMD could emerge as a strong competitor to NVIDIA

Not possible due to internal problems with retaining very experienced C/C++ software engineers.

>>...AMD's weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience...

Very surprised to read about it since QA was Very Strong when I was working for AMD. It is possible things have changed after my contract was over.

>>...AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack...

Absolutely surprised to read about it. Once again, it is possible things have changed...
 
Joined
May 13, 2010
Messages
6,084 (1.14/day)
System Name RemixedBeast-NX
Processor Intel Xeon E5-2690 @ 2.9Ghz (8C/16T)
Motherboard Dell Inc. 08HPGT (CPU 1)
Cooling Dell Standard
Memory 24GB ECC
Video Card(s) Gigabyte Nvidia RTX2060 6GB
Storage 2TB Samsung 860 EVO SSD//2TB WD Black HDD
Display(s) Samsung SyncMaster P2350 23in @ 1920x1080 + Dell E2013H 20 in @1600x900
Case Dell Precision T3600 Chassis
Audio Device(s) Beyerdynamic DT770 Pro 80 // Fiio E7 Amp/DAC
Power Supply 630w Dell T3600 PSU
Mouse Logitech G700s/G502
Keyboard Logitech K740
Software Linux Mint 20
Benchmark Scores Network: APs: Cisco Meraki MR32, Ubiquiti Unifi AP-AC-LR and Lite Router/Sw:Meraki MX64 MS220-8P
AMD's internal teams have little access to GPU boxes to develop and refine the ROCm software stack.
Stingy af!! And this is why they can't get ahead
 
Joined
Mar 26, 2009
Messages
177 (0.03/day)
I think there is a reason semi-analysis focused on training. As AMD has focused on inference performance. Meta trains on h100/h200 and runs the models (inference) exclusively on mi300x.
This both backs up the analysis as well as shows distortion by not giving the full picture.
AMD needs to work on software to gain competitiveness on training, and there may be architectural limitations that cap its overall training performance (xGMI interconnect arch)
They definitely need better regression testing and testing in general. They have acquired several Ai software companies this year that may help with this.
Part 2 of the article is going to focus on Inference, the story is not that different there though. Companies still prefer NVIDIA for Inference for a reason, and no, Meta is not running inference exclusively on MI300.

So the current reality is...
If you are using off the shelf models mi300x excels, if you finetune those models, AMD excels, If you train from scratch... AMD kinda sucks.
Wrong, the article used off the shelf models, and the MI300x sucked hard.
 
Top