Tuesday, March 19th 2024

Unwrapping the NVIDIA B200 and GB200 AI GPU Announcements

NVIDIA on Monday, at the 2024 GTC conference, unveiled the "Blackwell" B200 and GB200 AI GPUs. These are designed to offer an incredible 5X the AI inferencing performance gain over the current-gen "Hopper" H100, and come with four times the on-package memory. The B200 "Blackwell" is the largest chip physically possible using existing foundry tech, according to its makers. The chip is an astonishing 208 billion transistors, and is made up of two chiplets, which by themselves are the largest possible chips.

Each chiplet is built on the TSMC N4P foundry node, which is the most advanced 4 nm-class node by the Taiwanese foundry. Each chiplet has 104 billion transistors. The two chiplets have a high degree of connectivity with each other, thanks to a 10 TB/s custom interconnect. This is enough bandwidth and latency for the two to maintain cache coherency (i.e. address each other's memory as if they're their own). Each of the two "Blackwell" chiplets has a 4096-bit memory bus, and is wired to 96 GB of HBM3E spread across four 24 GB stacks; which totals to 192 GB for the B200 package. The GPU has a staggering 8 TB/s of memory bandwidth on tap. The B200 package features a 1.8 TB/s NVLink interface for host connectivity, and connectivity to another B200 chip.
NVIDIA also announced the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU that offers superior serial processing performance than x86-64 based CPUs from Intel or AMD; and an ISA that's highly optimized for NVIDIA's AI GPUs. The biggest advantage of the Grace CPU over an Intel Xeon Scalable or AMD EPYC has to be its higher bandwidth NVLink interconnect to the GPUs, compared to PCIe connections for x86-64 hosts. NVIDIA appears to be carrying over the Grace CPU from the GH200 Grace-Hopper Superchip.
NVIDIA did not disclose the counts of the various SIMD components such as streaming multiprocessors per chiplet, CUDA cores, Tensor cores, or on-die cache sizes, but made performance claims. Each B200 chip provides 20 PFLOPs (that's 20,000 TFLOPs) of AI inferencing performance. "Blackwell" introduces NVIDIA's 2nd generation Transformer engine, and 6th generation Tensor core, which supports FP4 and FP6. The 5th Gen NVLink interface not just scales up within the node, but also scales out to up to 576 GPUs. Among NVIDIA's performance claims for the GB200 are 20 PFLOPs FP4 Tensor (dense), and 40 PFLOPs FP4 Tensor (sparse); 10 PFLOPs FP8 Tensor (dense); and 20 PFLOPs FP8 Tensor (sparse); 5 PFLOPs Bfloat16 and FP16 (10 PFLOPs with sparsity); and 2.5 PFLOPs TF32 Tensor (dense) with 5 PFLOPs (sparse). As a high-precision compute accelerator (FP64), the B200 provides 90 TFLOPs, which is a 3x increase over that of the GH200 "Hopper."
NVIDIA is expected to ship the B100, B200, and GB200, and their first-party derivatives, such as the SuperPODs, later this year.
Source: Tom's Hardware
Add your own comment

27 Comments on Unwrapping the NVIDIA B200 and GB200 AI GPU Announcements

#26
JoeTheDestroyer
lemonadesodaFp4! I dread to think about the type 1 and type 2 errors that can occur with ultra-low precision nibble Artificial Inference. It is such a blunt tool. If it’s a nail, it will work. If it’s a screw it won’t. And will the “users” of the Ai output have any clue
4bit inference is old hat at this point, commonly used to get parameter sets of LLMs small enough to run on client gpus. The networks are fine-tuned (i.e., re-trained) to operate at this precision precisely to minimize additional error.

A recent paper making waves proposes "1.58 bit" inference (i.e., single digit ternary arithmetic).
Posted on Reply
#27
Tropick
TomorrowThe biggest surprise is the use of N4P node. I thought for sure Nvidia was going to use 3nm by now, at least for these 20k+ costing chips.
This does not bode well for RTX 5000 series. I very much doubt those will use 3nm either.
I don't know, I think they could still get some sizable gains out of reordering the architecture alone. The 780 Ti and 980 Ti shared the same TSMC 28nm node but GM200 easily bested GK110B. This launch could still be disappointing however there is precedent for Jensen's team pulling a rabbit out of their collective hat while using the same lithography.

780 Ti:


980 Ti:
Posted on Reply
Add your own comment
Dec 18th, 2024 08:58 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts