AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1

btarunr · Aug 29, 2024

The MLCommons consortium on Wednesday posted MLPerf Inference v4.1 benchmark results for popular AI inferencing accelerators available in the market, across brands that include NVIDIA, AMD, and Intel. AMD's Instinct MI300X accelerators emerged competitive to NVIDIA's "Hopper" H100 series AI GPUs. AMD also used the opportunity to showcase the kind of AI inferencing performance uplifts customers can expect from its next-generation EPYC "Turin" server processors powering these MI300X machines. "Turin" features "Zen 5" CPU cores, sporting a 512-bit FPU datapath, and improved performance in AI-relevant 512-bit SIMD instruction-sets, such as AVX-512, and VNNI. The MI300X, on the other hand, banks on the strengths of its memory sub-system, FP8 data format support, and efficient KV cache management.

The MLPerf Inference v4.1 benchmark focused on the 70 billion-parameter LLaMA2-70B model. AMD's submissions included machines featuring the Instinct MI300X, powered by the current EPYC "Genoa" (Zen 4), and next-gen EPYC "Turin" (Zen 5). The GPUs are backed by AMD's ROCm open-source software stack. The benchmark evaluated inference performance using 24,576 Q&A samples from the OpenORCA dataset, with each sample containing up to 1024 input and output tokens. Two scenarios were assessed: the offline scenario, focusing on batch processing to maximize throughput in tokens per second, and the server scenario, which simulates real-time queries with strict latency limits (TTFT ≤ 2 seconds, TPOT ≤ 200 ms). This lets you see the chip's mettle in both high-throughput and low-latency queries.

AMD's first submission (4.1-0002) sees a server featuring 2P EPYC 9374F "Genoa" processors and 8x Instinct MI300X accelerators. Here, the machine clocks 21,028 tokens/sec in the server test, compared to 21,605 tokens/sec scored in an NVIDIA machine combining 8x NVIDIA DGX100 with a Xeon processor. In the offline test, the AMD machine scores 23,514 tokens/sec compared to 24,525 tokens/sec of the NVIDIA+Intel machine. AMD tested the 8x MI300X with a pair of EPYC "Turin" (Zen 5) processors of comparable core-counts, and gained on NVIDIA, with 22,021 server tokens/sec, and 24,110 offline tokens/sec. AMD claims that is achieving a near-linear scaling in performance between 1x MI300X and 8x MI300X, which speaks for AMD's platform I/O and memory management chops.

AMD's results bode well for future versions of the model, such as LLaMA 3.1 with its gargantuan 405 billion parameters. Here, the 192 GB of HBM3 with 5.3 TB/s of memory bandwidth come in really handy. This earned AMD a partnership with Meta to power LLaMa 3.1 405B. An 8x MI300X blade packs 1.5 TB of memory with over 42 TB/s of memory bandwidth, with Infinity Fabric handling the interconnectivity. A single server is able to accommodate the entire LLaMa 3.1 405B model using the FP16 data type.

View at TechPowerUp Main Site

yfn_ratchet · Aug 29, 2024

I think the main selling point here is going to be deployment + running costs. If this can consistently be cheaper to deploy and run than Nvidia proportionally, then there's definitely something here. If not, they're still chasing coattails as far as I'm concerned.

Prima.Vera · Aug 29, 2024

Good. nGreedia's monopoly must be challenged.

las · Aug 29, 2024

How does it fare vs Blackwell B200 tho? H100 is old news at this point

john_ · Aug 29, 2024

las said:
How does it fare vs Blackwell B200 tho? H100 is old news at this point

From what I can understand, B200's advantage is FP4 support.
I have no idea about compute tasks, but I think this is the equivalent advantage to DLSS in gaming. I was reading that Nvidia says that their FP4 is very accurate thanks to their software.

W1zzard · Aug 29, 2024

john_ said:
I was reading that Nvidia says that their FP4 is very accurate thanks to their software.

I don't think any FP4 is better than the other one? but still, having hardware support for it can be useful.

In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"

mb194dc · Aug 29, 2024

Nvidia probably already sold enough ML hardware for the next 10 years or even longer. Given the lack of really decent use cases and fundamental flaws with the technology.

By the time AMD get them on the market it won't be there anymore?

Assimilator · Aug 29, 2024

Weird AMD, why didn't you show us H100 running with AMD CPUs? And why did you test with H100 when B200 is available? It's almost like you're trying to skew this to make you look better... AGAIN.

W1zzard said:
In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"

Oh boy...

TheToi · Aug 29, 2024

I wonder why they use llama 2 on their benchmark, llama 3 was released a moment ago already and since a month we are at llama 3.1

ncrs · Aug 29, 2024

TheToi said:
I wonder why they use llama 2 on their benchmark, llama 3 was released a moment ago already and since a month we are at llama 3.1

It's because they are using the MLPerf Inference benchmark suite which specifies certain models at locked versions for reproducibility.

AnotherReader · Aug 29, 2024

W1zzard said:
I don't think any FP4 is better than the other one? but still, having hardware support for it can be useful.

In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"

The ISA reference for MI300 includes instructions that operate on BF16 data.

evernessince · Aug 29, 2024

mb194dc said:
Nvidia probably already sold enough ML hardware for the next 10 years or even longer. Given the lack of really decent use cases and fundamental flaws with the technology.

By the time AMD get them on the market it won't be there anymore?

AI is used in the engineering, medical, and artistic fields and is already indispensable to them. TSMC and it's customers themselves use AI to improve photo-lithography masks and chip design is aided by AI.

The AI bubble may "pop" at some point similar to the dotcom bubble but what's left behind will still be significant just the same as the dotcom bubble.

Tomorrow · Aug 29, 2024

las said:
How does it fare vs Blackwell B200 tho? H100 is old news at this point

From Nvidia's own benchmarks the difference is like 20k vs 30k but B200 also uses 1000W compared to 700W for H100 (that MI300X matches according to Nvidia's slides) and 750W for MI300X itself.

Assimilator said:
Weird AMD, why didn't you show us H100 running with AMD CPUs? And why did you test with H100 when B200 is available? It's almost like you're trying to skew this to make you look better... AGAIN.

But, but why did Nvidia in their benchmarks use Xeon and not Epyc?
Could it be that they're NOT obliged to use competitors hardware, just like AMD?
It makes sense for AMD to test with their own CPU if they have the solution.
It's the same reason B200 has ARM and Nvidia fused together. Not Xeon and Nvidia.

W1zzard · Aug 29, 2024

AnotherReader said:
The ISA reference for MI300 includes instructions that operate on BF16 data.

View attachment 361175

Thanks! However, BF16 == BFLOAT16 != Block Float 16. Some additional context here: https://www.techpowerup.com/review/amd-zen-5-technical-deep-dive/7.html

AnotherReader · Aug 29, 2024

W1zzard said:
Thanks! However, BF16 == BFLOAT16 != Block Float 16. Some additional context here: https://www.techpowerup.com/review/amd-zen-5-technical-deep-dive/7.html

Thanks for explaining the difference between these three formats. I believe you're correct about block float 16 being unsupported; there are no references to it in the MI300's ISA documentation.

Patriot · Aug 29, 2024

las said:
How does it fare vs Blackwell B200 tho? H100 is old news at this point

B100/200 should be faster than mi300... but this is old news to old news. Mi300 has been being deployed into el Capitan since june'23. Mi325x will be going against the b100/200 and should both show up this fall. I still expect b100/200 to win on fp4 inference workloads but mi325x will still be competitive overall given how much faster the mi300 was. Also Nvidia essentially gave up competing on FP64 workloads.

W1zzard · Aug 29, 2024

AnotherReader said:
Thanks for explaining the difference between these three formats. I believe you're correct about block float 16 being unsupported; there are no references to it in the MI300's ISA documentation.

It is quite exotic, but has interesting properties, and it was also an opportunity for AMD to talk more about formats, relevance, maybe some other innovations they've added .. but nope

Minus Infinity · Aug 30, 2024

las said:
How does it fare vs Blackwell B200 tho? H100 is old news at this point

AMD will have MI350 to compete against those soon enough. Performance to price ratio is far higher though for AMD and Intel.

las · Aug 30, 2024

Minus Infinity said:
AMD will have MI350 to compete against those soon enough. Performance to price ratio is far higher though for AMD and Intel.

Except that companies need a complete solution like Nvidia is providing, not just a GPU that performs well in a cherrypicked benchmark.

This is why AMD bought up ZT Systems for 5 billions, they want to provide a complete solution, right now they are just providing a GPU.

And this is why Nvidia is king of AI. Lets see if AMD gets on the train before it leaves.

Patriot · Aug 30, 2024

las said:
Except that companies need a complete solution like Nvidia is providing, not just a GPU that performs well in a cherrypicked benchmark.

This is why AMD bought up ZT Systems for 5 billions, they want to provide a complete solution, right now they are just providing a GPU.

And this is why Nvidia is king of AI. Lets see if AMD gets on the train before it leaves.

El Capitan my Capitan. Yes nvidia leads in software development and pushing new non-industry standards that lock you into their ecosystem. Thankfully AMD has been fighting back with consortiums and uses standards like OAM so that you can use their gpus' in future systems they develop to compete against the DGX or in any partner system that uses OAM. Ala HPE, Dell, Supermicro... etc etc.

ML Perf is a bit cherry picked, it heavily favors nvidia as they have hundreds of engineers tuning for it, most workloads do not use FP8 or FP4 yet that is what Nvidia pushes. Blackwell decimates these mi300x results and will allegedly be shipping by years end. But again, supertuned. The mi325x will not win in throughput, it is expected to bring a 20-30% perf uplift but has a memory density advantage which will allow it to run more on single gpus and at higher precisions. 288GB HBM3e. Mi350x may be out by years end but is more likely shipping next year, and will bring FP4 support to AMD. I don't see how 'their claim of 35x inference improvement over mi300 will be true but I am guessing it has to do with memory constrained models.

Nvidia is king because they have a trapped ecosystem, but the industry is rebelling. There is very little that you cannot run on AMD mi300x's natively from hugging face. Almost all new models can be run natively without a hipify conversion. The memory advantage AMD has is pretty extreme, to the point that Meta has worked with AMD for day zero support of their insane model sizes.

So, why build a server that supports SXM when NVidia wants to take your customers and sell them DGX's ?
When you can build an OAM server that supports... Intels Gaudi and Max gpus, or AMD gpus or all the banned Chinese accelerators lol.
AMD is on the train, the limit is TSMC fab time. For everyone really.,

las · Aug 30, 2024

Patriot said:
El Capitan my Capitan. Yes nvidia leads in software development and pushing new non-industry standards that lock you into their ecosystem. Thankfully AMD has been fighting back with consortiums and uses standards like OAM so that you can use their gpus' in future systems they develop to compete against the DGX or in any partner system that uses OAM. Ala HPE, Dell, Supermicro... etc etc.

ML Perf is a bit cherry picked, it heavily favors nvidia as they have hundreds of engineers tuning for it, most workloads do not use FP8 or FP4 yet that is what Nvidia pushes. Blackwell decimates these mi300x results and will allegedly be shipping by years end. But again, supertuned. The mi325x will not win in throughput, it is expected to bring a 20-30% perf uplift but has a memory density advantage which will allow it to run more on single gpus and at higher precisions. 288GB HBM3e. Mi350x may be out by years end but is more likely shipping next year, and will bring FP4 support to AMD. I don't see how 'their claim of 35x inference improvement over mi300 will be true but I am guessing it has to do with memory constrained models.

Nvidia is king because they have a trapped ecosystem, but the industry is rebelling. There is very little that you cannot run on AMD mi300x's natively from hugging face. Almost all new models can be run natively without a hipify conversion. The memory advantage AMD has is pretty extreme, to the point that Meta has worked with AMD for day zero support of their insane model sizes.

So, why build a server that supports SXM when NVidia wants to take your customers and sell them DGX's ?
When you can build an OAM server that supports... Intels Gaudi and Max gpus, or AMD gpus or all the banned Chinese accelerators lol.
AMD is on the train, the limit is TSMC fab time. For everyone really.,

Yeah AMD likes to play the good guy, till they don't.

Nvidia is king because they deliver what companies actually look for. AMD don't, they just provide a GPU, with no CUDA support as Nvidia invented that. AMD has AI GPUs on paper but in reality, Nvidia stands for 90% of AI GPU shipments.

If AMD were actually competitive in AI, their valuation would have exploded like Nvidias.

Patriot · Aug 30, 2024

las said:
Yeah AMD likes to play the good guy, till they don't.

Nvidia is king because they deliver what companies actually look for. AMD don't, they just provide a GPU, with no CUDA support as Nvidia invented that. AMD has AI GPUs on paper but in reality, Nvidia stands for 90% of AI GPU shipments.

If AMD were actually competitive in AI, their valuation would have exploded like Nvidias.

This may shock you, but you don't need cuda to run workloads on a gpu. I snuck a little joke in the first line and it cleared the treetops it was so far over your head. El-capitan is set to be the first 2+ exaflop supercomputer running on mi300A apus. The current top supercomputer is frontier on mi250x's AMD is selling as many as they can make, the limit is TSMC not demand. In the past few years there has been a shift to hardware agnostic software, rather than cuda first, for those that still put cuda first, hipify exists to convert the code.

las · Sep 2, 2024

Patriot said:
This may shock you, but you don't need cuda to run workloads on a gpu. I snuck a little joke in the first line and it cleared the treetops it was so far over your head. El-capitan is set to be the first 2+ exaflop supercomputer running on mi300A apus. The current top supercomputer is frontier on mi250x's AMD is selling as many as they can make, the limit is TSMC not demand. In the past few years there has been a shift to hardware agnostic software, rather than cuda first, for those that still put cuda first, hipify exists to convert the code.

Keep believing that, meanwhile Nvidia sits on 98% of the AI market

Lets see if AMD releases something good before AI hype dies out

If AMD actually had something truly competitive in the AI and Enterprise market, their stock value would reflect it - Hint: Look at Nvidia stock

AMD admits its Instinct MI300X AI accelerator still can't quite beat Nvidia's H100 Hopper

On Wednesday, AMD released benchmarks comparing the performance of its MI300X with Nvidia's H100 GPU to showcase its Gen AI inference capabilities. For the LLama2-70B model, a...

www.techspot.com

Even AMD know they are way behind and H100 is old news

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	ASUS ROG Strix B450-E Gaming
Cooling	DeepCool Gammax L240 V2
Memory	2x 8GB G.Skill Sniper X
Video Card(s)	Palit GeForce RTX 2080 SUPER GameRock
Storage	Western Digital Black NVMe 512GB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

Processor	AMD Ryzen 7 5700X
Motherboard	ASUS ROG Strix B550-F Gaming Wifi II
Cooling	Noctua NH-U12S Redux
Memory	4x8G Teamgroup Vulcan Z DDR4; 3600MHz @ CL18
Video Card(s)	MSI Ventus 2X GeForce RTX 3060 12GB
Storage	WD_Black SN770, Leven JPS600, Toshiba DT01ACA
Display(s)	Samsung ViewFinity S6
Case	Fractal Design Pop Air TG
Power Supply	Corsair CX750M
Mouse	Corsair Harpoon RGB
Keyboard	Keychron C2 Pro
VR HMD	Valve Index

Processor	Intel® Core™ i7-13700K
Motherboard	Gigabyte Z790 Aorus Elite AX
Cooling	Noctua NH-D15
Memory	32GB(2x16) DDR5@6600MHz G-Skill Trident Z5
Video Card(s)	ZOTAC GAMING GeForce RTX 3080 AMP Holo
Storage	2TB SK Platinum P41 SSD + 4TB SanDisk Ultra SSD + 500GB Samsung 840 EVO SSD
Display(s)	Acer Predator X34 3440x1440@100Hz G-Sync
Case	NZXT PHANTOM410-BK
Audio Device(s)	Creative X-Fi Titanium PCIe
Power Supply	Corsair 850W
Mouse	Logitech Hero G502 SE
Software	Windows 11 Pro - 64bit
Benchmark Scores	30FPS in NFS:Rivals

System Name	Meh
Processor	7800X3D
Motherboard	MSI X670E Tomahawk
Cooling	Thermalright Phantom Spirit
Memory	32GB G.Skill @ 6000/CL30
Video Card(s)	Gainward RTX 4090 Phantom / Undervolt + OC
Storage	Samsung 990 Pro 2TB + WD SN850X 1TB + 64TB NAS/Server
Display(s)	27" 1440p IPS @ 360 Hz + 32" 4K/UHD QD-OLED @ 240 Hz + 77" 4K/UHD QD-OLED @ 144 Hz VRR
Case	Fractal Design North XL
Audio Device(s)	FiiO DAC
Power Supply	Corsair RM1000x / Native 12VHPWR
Mouse	Logitech G Pro Wireless Superlight + Razer Deathadder V3 Pro
Keyboard	Corsair K60 Pro / MX Low Profile Speed
Software	Windows 10 Pro x64

System Name	3 desktop systems: Gaming / Internet / HTPC
Processor	Ryzen 5 5500 / Ryzen 5 4600G / FX 6300 (12 years latter got to see how bad Bulldozer is)
Motherboard	MSI X470 Gaming Plus Max (1) / MSI X470 Gaming Plus Max (2) / Gigabyte GA-990XA-UD3
Cooling	Νoctua U12S / Segotep T4 / Snowman M-T6
Memory	32GB - 16GB G.Skill RIPJAWS 3600+16GB G.Skill Aegis 3200 / 16GB JUHOR / 16GB Kingston 2400MHz (DDR3)
Video Card(s)	ASRock RX 6600 + GT 710 (PhysX)/ Vega 7 integrated / Radeon RX 580
Storage	NVMes, ONLY NVMes/ NVMes, SATA Storage / NVMe boot(Clover), SATA storage
Display(s)	Philips 43PUS8857/12 UHD TV (120Hz, HDR, FreeSync Premium) ---- 19'' HP monitor + BlitzWolf BW-V5
Case	Sharkoon Rebel 12 / CoolerMaster Elite 361 / Xigmatek Midguard
Audio Device(s)	onboard
Power Supply	Chieftec 850W / Silver Power 400W / Sharkoon 650W
Mouse	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Keyboard	CoolerMaster Devastator III Plus / CoolerMaster Devastator / Logitech
Software	Windows 10 / Windows 10&Windows 11 / Windows 10

AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1

btarunr

Editor & Senior Moderator

yfn_ratchet

Prima.Vera

las

john_

W1zzard

Administrator

mb194dc

Assimilator

TheToi

New Member

ncrs

AnotherReader

evernessince

Tomorrow

W1zzard

Administrator

AnotherReader

Patriot

W1zzard

Administrator

Minus Infinity

las

Patriot

las

Patriot

las

AMD admits its Instinct MI300X AI accelerator still can't quite beat Nvidia's H100 Hopper

Processor	Ryzen 7 5700X
Memory	48 GB
Video Card(s)	RTX 4080
Storage	2x HDD RAID 1, 3x M.2 NVMe
Display(s)	30" 2560x1600 + 19" 1280x1024
Software	Windows 10 64-bit

System Name	Firelance.
Processor	Threadripper 3960X
Motherboard	ROG Strix TRX40-E Gaming
Cooling	IceGem 360 + 6x Arctic Cooling P12
Memory	8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s)	MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage	2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s)	3x AOC Q32E2N (32" 2560x1440 75Hz)
Case	Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply	Fractal Design Ion+ 2 Platinum 760W
Mouse	Logitech G602
Keyboard	Razer Pro Type Ultra
Software	Windows 10 Professional x64

Processor	Ryzen 7800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	32GB DDR5 6000 CL30
Video Card(s)	MSI RTX 4090 Trio
Storage	Too much
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	Topping DX5, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	G305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	[H]arbringer
Processor	4x 61XX ES @3.5Ghz (48cores)
Motherboard	SM GL
Cooling	3x xspc rx360, rx240, 4x DT G34 snipers, D5 pump.
Memory	16x gskill DDR3 1600 cas6 2gb
Video Card(s)	blah bigadv folder no gfx needed
Storage	32GB Sammy SSD
Display(s)	headless
Case	Xigmatek Elysium (whats left of it)
Audio Device(s)	yawn
Power Supply	Antec 1200w HCP
Software	Ubuntu 10.10
Benchmark Scores	http://valid.canardpc.com/show_oc.php?id=1780855 http://www.hwbot.org/submission/2158678 http://ww