Thursday, August 29th 2024
AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1
The MLCommons consortium on Wednesday posted MLPerf Inference v4.1 benchmark results for popular AI inferencing accelerators available in the market, across brands that include NVIDIA, AMD, and Intel. AMD's Instinct MI300X accelerators emerged competitive to NVIDIA's "Hopper" H100 series AI GPUs. AMD also used the opportunity to showcase the kind of AI inferencing performance uplifts customers can expect from its next-generation EPYC "Turin" server processors powering these MI300X machines. "Turin" features "Zen 5" CPU cores, sporting a 512-bit FPU datapath, and improved performance in AI-relevant 512-bit SIMD instruction-sets, such as AVX-512, and VNNI. The MI300X, on the other hand, banks on the strengths of its memory sub-system, FP8 data format support, and efficient KV cache management.
The MLPerf Inference v4.1 benchmark focused on the 70 billion-parameter LLaMA2-70B model. AMD's submissions included machines featuring the Instinct MI300X, powered by the current EPYC "Genoa" (Zen 4), and next-gen EPYC "Turin" (Zen 5). The GPUs are backed by AMD's ROCm open-source software stack. The benchmark evaluated inference performance using 24,576 Q&A samples from the OpenORCA dataset, with each sample containing up to 1024 input and output tokens. Two scenarios were assessed: the offline scenario, focusing on batch processing to maximize throughput in tokens per second, and the server scenario, which simulates real-time queries with strict latency limits (TTFT ≤ 2 seconds, TPOT ≤ 200 ms). This lets you see the chip's mettle in both high-throughput and low-latency queries.AMD's first submission (4.1-0002) sees a server featuring 2P EPYC 9374F "Genoa" processors and 8x Instinct MI300X accelerators. Here, the machine clocks 21,028 tokens/sec in the server test, compared to 21,605 tokens/sec scored in an NVIDIA machine combining 8x NVIDIA DGX100 with a Xeon processor. In the offline test, the AMD machine scores 23,514 tokens/sec compared to 24,525 tokens/sec of the NVIDIA+Intel machine. AMD tested the 8x MI300X with a pair of EPYC "Turin" (Zen 5) processors of comparable core-counts, and gained on NVIDIA, with 22,021 server tokens/sec, and 24,110 offline tokens/sec. AMD claims that is achieving a near-linear scaling in performance between 1x MI300X and 8x MI300X, which speaks for AMD's platform I/O and memory management chops.
AMD's results bode well for future versions of the model, such as LLaMA 3.1 with its gargantuan 405 billion parameters. Here, the 192 GB of HBM3 with 5.3 TB/s of memory bandwidth come in really handy. This earned AMD a partnership with Meta to power LLaMa 3.1 405B. An 8x MI300X blade packs 1.5 TB of memory with over 42 TB/s of memory bandwidth, with Infinity Fabric handling the interconnectivity. A single server is able to accommodate the entire LLaMa 3.1 405B model using the FP16 data type.
The MLPerf Inference v4.1 benchmark focused on the 70 billion-parameter LLaMA2-70B model. AMD's submissions included machines featuring the Instinct MI300X, powered by the current EPYC "Genoa" (Zen 4), and next-gen EPYC "Turin" (Zen 5). The GPUs are backed by AMD's ROCm open-source software stack. The benchmark evaluated inference performance using 24,576 Q&A samples from the OpenORCA dataset, with each sample containing up to 1024 input and output tokens. Two scenarios were assessed: the offline scenario, focusing on batch processing to maximize throughput in tokens per second, and the server scenario, which simulates real-time queries with strict latency limits (TTFT ≤ 2 seconds, TPOT ≤ 200 ms). This lets you see the chip's mettle in both high-throughput and low-latency queries.AMD's first submission (4.1-0002) sees a server featuring 2P EPYC 9374F "Genoa" processors and 8x Instinct MI300X accelerators. Here, the machine clocks 21,028 tokens/sec in the server test, compared to 21,605 tokens/sec scored in an NVIDIA machine combining 8x NVIDIA DGX100 with a Xeon processor. In the offline test, the AMD machine scores 23,514 tokens/sec compared to 24,525 tokens/sec of the NVIDIA+Intel machine. AMD tested the 8x MI300X with a pair of EPYC "Turin" (Zen 5) processors of comparable core-counts, and gained on NVIDIA, with 22,021 server tokens/sec, and 24,110 offline tokens/sec. AMD claims that is achieving a near-linear scaling in performance between 1x MI300X and 8x MI300X, which speaks for AMD's platform I/O and memory management chops.
AMD's results bode well for future versions of the model, such as LLaMA 3.1 with its gargantuan 405 billion parameters. Here, the 192 GB of HBM3 with 5.3 TB/s of memory bandwidth come in really handy. This earned AMD a partnership with Meta to power LLaMa 3.1 405B. An 8x MI300X blade packs 1.5 TB of memory with over 42 TB/s of memory bandwidth, with Infinity Fabric handling the interconnectivity. A single server is able to accommodate the entire LLaMa 3.1 405B model using the FP16 data type.
22 Comments on AMD MI300X Accelerators are Competitive with NVIDIA H100, Crunch MLPerf Inference v4.1
I have no idea about compute tasks, but I think this is the equivalent advantage to DLSS in gaming. I was reading that Nvidia says that their FP4 is very accurate thanks to their software.
In the press call for this news I asked AMD about Block Float 16 support on MI300X, but they acted like I asked for something else and answered it in a general way, which to me seems they evaded the question, which means "not supported"
By the time AMD get them on the market it won't be there anymore?
The AI bubble may "pop" at some point similar to the dotcom bubble but what's left behind will still be significant just the same as the dotcom bubble.
Could it be that they're NOT obliged to use competitors hardware, just like AMD?
It makes sense for AMD to test with their own CPU if they have the solution.
It's the same reason B200 has ARM and Nvidia fused together. Not Xeon and Nvidia.
This is why AMD bought up ZT Systems for 5 billions, they want to provide a complete solution, right now they are just providing a GPU.
And this is why Nvidia is king of AI. Lets see if AMD gets on the train before it leaves.
ML Perf is a bit cherry picked, it heavily favors nvidia as they have hundreds of engineers tuning for it, most workloads do not use FP8 or FP4 yet that is what Nvidia pushes. Blackwell decimates these mi300x results and will allegedly be shipping by years end. But again, supertuned. The mi325x will not win in throughput, it is expected to bring a 20-30% perf uplift but has a memory density advantage which will allow it to run more on single gpus and at higher precisions. 288GB HBM3e. Mi350x may be out by years end but is more likely shipping next year, and will bring FP4 support to AMD. I don't see how 'their claim of 35x inference improvement over mi300 will be true but I am guessing it has to do with memory constrained models.
Nvidia is king because they have a trapped ecosystem, but the industry is rebelling. There is very little that you cannot run on AMD mi300x's natively from hugging face. Almost all new models can be run natively without a hipify conversion. The memory advantage AMD has is pretty extreme, to the point that Meta has worked with AMD for day zero support of their insane model sizes.
So, why build a server that supports SXM when NVidia wants to take your customers and sell them DGX's ?
When you can build an OAM server that supports... Intels Gaudi and Max gpus, or AMD gpus or all the banned Chinese accelerators lol.
AMD is on the train, the limit is TSMC fab time. For everyone really.,
Nvidia is king because they deliver what companies actually look for. AMD don't, they just provide a GPU, with no CUDA support as Nvidia invented that. AMD has AI GPUs on paper but in reality, Nvidia stands for 90% of AI GPU shipments.
If AMD were actually competitive in AI, their valuation would have exploded like Nvidias.
Lets see if AMD releases something good before AI hype dies out
If AMD actually had something truly competitive in the AI and Enterprise market, their stock value would reflect it - Hint: Look at Nvidia stock
www.techspot.com/news/104505-amd-admits-instinct-mi300x-ai-accelerator-cant-beat.html
Even AMD know they are way behind and H100 is old news