Thursday, August 29th 2024

AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1

The MLCommons consortium on Wednesday posted MLPerf Inference v4.1 benchmark results for popular AI inferencing accelerators available in the market, across brands that include NVIDIA, AMD, and Intel. AMD's Instinct MI300X accelerators emerged competitive to NVIDIA's "Hopper" H100 series AI GPUs. AMD also used the opportunity to showcase the kind of AI inferencing performance uplifts customers can expect from its next-generation EPYC "Turin" server processors powering these MI300X machines. "Turin" features "Zen 5" CPU cores, sporting a 512-bit FPU datapath, and improved performance in AI-relevant 512-bit SIMD instruction-sets, such as AVX-512, and VNNI. The MI300X, on the other hand, banks on the strengths of its memory sub-system, FP8 data format support, and efficient KV cache management.

The MLPerf Inference v4.1 benchmark focused on the 70 billion-parameter LLaMA2-70B model. AMD's submissions included machines featuring the Instinct MI300X, powered by the current EPYC "Genoa" (Zen 4), and next-gen EPYC "Turin" (Zen 5). The GPUs are backed by AMD's ROCm open-source software stack. The benchmark evaluated inference performance using 24,576 Q&A samples from the OpenORCA dataset, with each sample containing up to 1024 input and output tokens. Two scenarios were assessed: the offline scenario, focusing on batch processing to maximize throughput in tokens per second, and the server scenario, which simulates real-time queries with strict latency limits (TTFT ≤ 2 seconds, TPOT ≤ 200 ms). This lets you see the chip's mettle in both high-throughput and low-latency queries.
AMD's first submission (4.1-0002) sees a server featuring 2P EPYC 9374F "Genoa" processors and 8x Instinct MI300X accelerators. Here, the machine clocks 21,028 tokens/sec in the server test, compared to 21,605 tokens/sec scored in an NVIDIA machine combining 8x NVIDIA DGX100 with a Xeon processor. In the offline test, the AMD machine scores 23,514 tokens/sec compared to 24,525 tokens/sec of the NVIDIA+Intel machine. AMD tested the 8x MI300X with a pair of EPYC "Turin" (Zen 5) processors of comparable core-counts, and gained on NVIDIA, with 22,021 server tokens/sec, and 24,110 offline tokens/sec. AMD claims that is achieving a near-linear scaling in performance between 1x MI300X and 8x MI300X, which speaks for AMD's platform I/O and memory management chops.

AMD's results bode well for future versions of the model, such as LLaMA 3.1 with its gargantuan 405 billion parameters. Here, the 192 GB of HBM3 with 5.3 TB/s of memory bandwidth come in really handy. This earned AMD a partnership with Meta to power LLaMa 3.1 405B. An 8x MI300X blade packs 1.5 TB of memory with over 42 TB/s of memory bandwidth, with Infinity Fabric handling the interconnectivity. A single server is able to accommodate the entire LLaMa 3.1 405B model using the FP16 data type.
Add your own comment

22 Comments on AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1

#1
yfn_ratchet
I think the main selling point here is going to be deployment + running costs. If this can consistently be cheaper to deploy and run than Nvidia proportionally, then there's definitely something here. If not, they're still chasing coattails as far as I'm concerned.
Posted on Reply
#2
Prima.Vera
Good. nGreedia's monopoly must be challenged.
Posted on Reply
#3
las
How does it fare vs Blackwell B200 tho? H100 is old news at this point
Posted on Reply
#4
john_
lasHow does it fare vs Blackwell B200 tho? H100 is old news at this point
From what I can understand, B200's advantage is FP4 support.
I have no idea about compute tasks, but I think this is the equivalent advantage to DLSS in gaming. I was reading that Nvidia says that their FP4 is very accurate thanks to their software.
Posted on Reply
#5
W1zzard
john_I was reading that Nvidia says that their FP4 is very accurate thanks to their software.
I don't think any FP4 is better than the other one? but still, having hardware support for it can be useful.

In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"
Posted on Reply
#6
mb194dc
Nvidia probably already sold enough ML hardware for the next 10 years or even longer. Given the lack of really decent use cases and fundamental flaws with the technology.

By the time AMD get them on the market it won't be there anymore?
Posted on Reply
#7
Assimilator
Weird AMD, why didn't you show us H100 running with AMD CPUs? And why did you test with H100 when B200 is available? It's almost like you're trying to skew this to make you look better... AGAIN.
W1zzardIn the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"
Oh boy...
Posted on Reply
#8
TheToi
I wonder why they use llama 2 on their benchmark, llama 3 was released a moment ago already and since a month we are at llama 3.1
Posted on Reply
#9
ncrs
TheToiI wonder why they use llama 2 on their benchmark, llama 3 was released a moment ago already and since a month we are at llama 3.1
It's because they are using the MLPerf Inference benchmark suite which specifies certain models at locked versions for reproducibility.
Posted on Reply
#10
AnotherReader
W1zzardI don't think any FP4 is better than the other one? but still, having hardware support for it can be useful.

In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"
The ISA reference for MI300 includes instructions that operate on BF16 data.

Posted on Reply
#11
evernessince
mb194dcNvidia probably already sold enough ML hardware for the next 10 years or even longer. Given the lack of really decent use cases and fundamental flaws with the technology.

By the time AMD get them on the market it won't be there anymore?
AI is used in the engineering, medical, and artistic fields and is already indispensable to them. TSMC and it's customers themselves use AI to improve photo-lithography masks and chip design is aided by AI.

The AI bubble may "pop" at some point similar to the dotcom bubble but what's left behind will still be significant just the same as the dotcom bubble.
Posted on Reply
#12
Tomorrow
lasHow does it fare vs Blackwell B200 tho? H100 is old news at this point
From Nvidia's own benchmarks the difference is like 20k vs 30k but B200 also uses 1000W compared to 700W for H100 (that MI300X matches according to Nvidia's slides) and 750W for MI300X itself.
AssimilatorWeird AMD, why didn't you show us H100 running with AMD CPUs? And why did you test with H100 when B200 is available? It's almost like you're trying to skew this to make you look better... AGAIN.
But, but why did Nvidia in their benchmarks use Xeon and not Epyc?
Could it be that they're NOT obliged to use competitors hardware, just like AMD?
It makes sense for AMD to test with their own CPU if they have the solution.
It's the same reason B200 has ARM and Nvidia fused together. Not Xeon and Nvidia.
Posted on Reply
#15
Patriot
lasHow does it fare vs Blackwell B200 tho? H100 is old news at this point
B100/200 should be faster than mi300... but this is old news to old news. Mi300 has been being deployed into el Capitan since june'23. Mi325x will be going against the b100/200 and should both show up this fall. I still expect b100/200 to win on fp4 inference workloads but mi325x will still be competitive overall given how much faster the mi300 was. Also Nvidia essentially gave up competing on FP64 workloads.
Posted on Reply
#16
W1zzard
AnotherReaderThanks for explaining the difference between these three formats. I believe you're correct about block float 16 being unsupported; there are no references to it in the MI300's ISA documentation.
It is quite exotic, but has interesting properties, and it was also an opportunity for AMD to talk more about formats, relevance, maybe some other innovations they've added .. but nope
Posted on Reply
#17
Minus Infinity
lasHow does it fare vs Blackwell B200 tho? H100 is old news at this point
AMD will have MI350 to compete against those soon enough. Performance to price ratio is far higher though for AMD and Intel.
Posted on Reply
#18
las
Minus InfinityAMD will have MI350 to compete against those soon enough. Performance to price ratio is far higher though for AMD and Intel.
Except that companies need a complete solution like Nvidia is providing, not just a GPU that performs well in a cherrypicked benchmark.

This is why AMD bought up ZT Systems for 5 billions, they want to provide a complete solution, right now they are just providing a GPU.

And this is why Nvidia is king of AI. Lets see if AMD gets on the train before it leaves.
Posted on Reply
#19
Patriot
lasExcept that companies need a complete solution like Nvidia is providing, not just a GPU that performs well in a cherrypicked benchmark.

This is why AMD bought up ZT Systems for 5 billions, they want to provide a complete solution, right now they are just providing a GPU.

And this is why Nvidia is king of AI. Lets see if AMD gets on the train before it leaves.
El Capitan my Capitan. Yes nvidia leads in software development and pushing new non-industry standards that lock you into their ecosystem. Thankfully AMD has been fighting back with consortiums and uses standards like OAM so that you can use their gpus' in future systems they develop to compete against the DGX or in any partner system that uses OAM. Ala HPE, Dell, Supermicro... etc etc.

ML Perf is a bit cherry picked, it heavily favors nvidia as they have hundreds of engineers tuning for it, most workloads do not use FP8 or FP4 yet that is what Nvidia pushes. Blackwell decimates these mi300x results and will allegedly be shipping by years end. But again, supertuned. The mi325x will not win in throughput, it is expected to bring a 20-30% perf uplift but has a memory density advantage which will allow it to run more on single gpus and at higher precisions. 288GB HBM3e. Mi350x may be out by years end but is more likely shipping next year, and will bring FP4 support to AMD. I don't see how 'their claim of 35x inference improvement over mi300 will be true but I am guessing it has to do with memory constrained models.

Nvidia is king because they have a trapped ecosystem, but the industry is rebelling. There is very little that you cannot run on AMD mi300x's natively from hugging face. Almost all new models can be run natively without a hipify conversion. The memory advantage AMD has is pretty extreme, to the point that Meta has worked with AMD for day zero support of their insane model sizes.

So, why build a server that supports SXM when NVidia wants to take your customers and sell them DGX's ?
When you can build an OAM server that supports... Intels Gaudi and Max gpus, or AMD gpus or all the banned Chinese accelerators lol.
AMD is on the train, the limit is TSMC fab time. For everyone really.,
Posted on Reply
#20
las
PatriotEl Capitan my Capitan. Yes nvidia leads in software development and pushing new non-industry standards that lock you into their ecosystem. Thankfully AMD has been fighting back with consortiums and uses standards like OAM so that you can use their gpus' in future systems they develop to compete against the DGX or in any partner system that uses OAM. Ala HPE, Dell, Supermicro... etc etc.

ML Perf is a bit cherry picked, it heavily favors nvidia as they have hundreds of engineers tuning for it, most workloads do not use FP8 or FP4 yet that is what Nvidia pushes. Blackwell decimates these mi300x results and will allegedly be shipping by years end. But again, supertuned. The mi325x will not win in throughput, it is expected to bring a 20-30% perf uplift but has a memory density advantage which will allow it to run more on single gpus and at higher precisions. 288GB HBM3e. Mi350x may be out by years end but is more likely shipping next year, and will bring FP4 support to AMD. I don't see how 'their claim of 35x inference improvement over mi300 will be true but I am guessing it has to do with memory constrained models.

Nvidia is king because they have a trapped ecosystem, but the industry is rebelling. There is very little that you cannot run on AMD mi300x's natively from hugging face. Almost all new models can be run natively without a hipify conversion. The memory advantage AMD has is pretty extreme, to the point that Meta has worked with AMD for day zero support of their insane model sizes.

So, why build a server that supports SXM when NVidia wants to take your customers and sell them DGX's ?
When you can build an OAM server that supports... Intels Gaudi and Max gpus, or AMD gpus or all the banned Chinese accelerators lol.
AMD is on the train, the limit is TSMC fab time. For everyone really.,
Yeah AMD likes to play the good guy, till they don't.

Nvidia is king because they deliver what companies actually look for. AMD don't, they just provide a GPU, with no CUDA support as Nvidia invented that. AMD has AI GPUs on paper but in reality, Nvidia stands for 90% of AI GPU shipments.

If AMD were actually competitive in AI, their valuation would have exploded like Nvidias.
Posted on Reply
#21
Patriot
lasYeah AMD likes to play the good guy, till they don't.

Nvidia is king because they deliver what companies actually look for. AMD don't, they just provide a GPU, with no CUDA support as Nvidia invented that. AMD has AI GPUs on paper but in reality, Nvidia stands for 90% of AI GPU shipments.

If AMD were actually competitive in AI, their valuation would have exploded like Nvidias.
This may shock you, but you don't need cuda to run workloads on a gpu. I snuck a little joke in the first line and it cleared the treetops it was so far over your head. El-capitan is set to be the first 2+ exaflop supercomputer running on mi300A apus. The current top supercomputer is frontier on mi250x's AMD is selling as many as they can make, the limit is TSMC not demand. In the past few years there has been a shift to hardware agnostic software, rather than cuda first, for those that still put cuda first, hipify exists to convert the code.
Posted on Reply
#22
las
PatriotThis may shock you, but you don't need cuda to run workloads on a gpu. I snuck a little joke in the first line and it cleared the treetops it was so far over your head. El-capitan is set to be the first 2+ exaflop supercomputer running on mi300A apus. The current top supercomputer is frontier on mi250x's AMD is selling as many as they can make, the limit is TSMC not demand. In the past few years there has been a shift to hardware agnostic software, rather than cuda first, for those that still put cuda first, hipify exists to convert the code.
Keep believing that, meanwhile Nvidia sits on 98% of the AI market

Lets see if AMD releases something good before AI hype dies out

If AMD actually had something truly competitive in the AI and Enterprise market, their stock value would reflect it - Hint: Look at Nvidia stock

www.techspot.com/news/104505-amd-admits-instinct-mi300x-ai-accelerator-cant-beat.html

Even AMD know they are way behind and H100 is old news
Posted on Reply
Add your own comment
Sep 3rd, 2024 11:28 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts