Thursday, November 14th 2024
NVIDIA B200 "Blackwell" Records 2.2x Performance Improvement Over its "Hopper" Predecessor
We know that NVIDIA's latest "Blackwell" GPUs are fast, but how much faster are they over the previous generation "Hopper"? Thanks to the latest MLPerf Training v4.1 results, NVIDIA's HGX B200 Blackwell platform has demonstrated massive performance gains, measuring up to 2.2x improvement per GPU compared to its HGX H200 Hopper. The latest results, verified by MLCommons, reveal impressive achievements in large language model (LLM) training. The Blackwell architecture, featuring HBM3e high-bandwidth memory and fifth-generation NVLink interconnect technology, achieved double the performance per GPU for GPT-3 pre-training and a 2.2x boost for Llama 2 70B fine-tuning compared to the previous Hopper generation. Each benchmark system incorporated eight Blackwell GPUs operating at a 1,000 W TDP, connected via NVLink Switch for scale-up.
The network infrastructure utilized NVIDIA ConnectX-7 SuperNICs and Quantum-2 InfiniBand switches, enabling high-speed node-to-node communication for distributed training workloads. While previous Hopper-based systems required 256 GPUs to optimize performance for the GPT-3 175B benchmark, Blackwell accomplished the same task with just 64 GPUs, leveraging its larger HBM3e memory capacity and bandwidth. One thing to look out for is the upcoming GB200 NVL72 system, which promises even more significant gains past the 2.2x. It features expanded NVLink domains, higher memory bandwidth, and tight integration with NVIDIA Grace CPUs, complemented by ConnectX-8 SuperNIC and Quantum-X800 switch technologies. With faster switching and better data movement with Grace-Blackwell integration, we could see even more software optimization from NVIDIA to push the performance envelope.
Sources:
MLCommons, via NVIDIA
The network infrastructure utilized NVIDIA ConnectX-7 SuperNICs and Quantum-2 InfiniBand switches, enabling high-speed node-to-node communication for distributed training workloads. While previous Hopper-based systems required 256 GPUs to optimize performance for the GPT-3 175B benchmark, Blackwell accomplished the same task with just 64 GPUs, leveraging its larger HBM3e memory capacity and bandwidth. One thing to look out for is the upcoming GB200 NVL72 system, which promises even more significant gains past the 2.2x. It features expanded NVLink domains, higher memory bandwidth, and tight integration with NVIDIA Grace CPUs, complemented by ConnectX-8 SuperNIC and Quantum-X800 switch technologies. With faster switching and better data movement with Grace-Blackwell integration, we could see even more software optimization from NVIDIA to push the performance envelope.
18 Comments on NVIDIA B200 "Blackwell" Records 2.2x Performance Improvement Over its "Hopper" Predecessor
I bet the MT65002 will be great!
Everything is going to be about the GPUs when it comes to Nvidia. It's the nature of the thing, when you're a GPU company. In this instance there's about 2 degrees to cover. Let's make it like that other game involving Kevin Bacon.
1) Nvidia produces a new Blackwell based A.I. accelerator.
2) The A.I. accelerator is run on the same lines as their other products.
3) The production of the A.I. accelerator is higher margin, and thus will decrease the amount of GPUs on the market.
Two leaps to get from an announced (presumably commercial or educational use) product to its direct impact on the cost of consumer GPUs. Oh, and scalping is a thing...right now the countries around China are scalping for them...and you know if scalpers get caught there is a penalty, though the up-side from scalping is huge profits and an artificially inflated cost for things that are knock-on or related. In this case scalping the Nvidia A.I. accelerators will drive people who cannot afford them to buy GPUs...which will price out consumers. Cool. That might be one jump.
In short, the price of tea in China does influence the price of tea in India. It's impossible to cross fingers and wish away that the things are linked, despite theoretically being in separate realms.
Aside from that, I'm curious to see how much of a die cut the 5090 is going to be from the full GB102. I expect something similar to what we saw with the 4090 w.r.t. the full AD102.
As for pricing/performance, a 3090 is almost as fast as a 4090 for LLM tasks (albeit way less efficient), given that it's memory speed is pretty much the same, hence why its priced similarly to the 4090 in many places.
The 3090(ti) also allows one to use NVLink still, offsetting the bottleneck in PCIe speeds for training models with layer-parallel approaches.
Ya I'll pass sorry Nvidia you care nothing about gamers and all about riping people off.
I would pay maybe up to 300 dollars for a video card but not 2500.00