Thursday, July 1st 2021
AMD CDNA2 "Aldebaran" MI200 HPC Accelerator with 256 CU (16,384 cores) Imagined
AMD Instinct MI200 will be an important product for the company in the HPC and AI supercomputing market. It debuts the CDNA2 compute architecture, and is based on a multi-chip module (MCM) codenamed "Aldebaran." PC enthusiast Locuza, who conjures highly detailed architecture based on public information, imagined what "Aldebaran" could look like. The MCM contains two logic dies, and eight HBM2E stacks. Each of the two dies has a 4096-bit HBM2E interface, which talks to 64 GB of memory (128 GB per package). A silicon interposer provides microscopic wiring among the ten dies.
Each of the two logic dies, or chiplets, has sixteen shader engines that have 16 compute units (CU), each. The CDNA2 compute unit is capable of full-rate FP64, packed FP32 math, and Matrix Engines V2 (fixed function hardware for matrix multiplication, accelerating DNN building, training, and AI inference). With 128 CUs per chiplet, assuming the CDNA2 CU has 64 stream processors, one arrives at 8,192 SP. Two such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon. Each die further features its independent PCIe interface, and XGMI (AMD's rival to CXL), an interconnect designed for high-density HPC scenarios. A rudimentary VCN (Video CoreNext) component is also present. It's important to note here, that the CDNA2 CU, as well as the "Aldebaran" MCM itself, doesn't have a dual-use as a GPU, since it lacks much of the hardware needed for graphics processing. The MI200 is expected to launch later this year.
Source:
Locuza_ (Twitter)
Each of the two logic dies, or chiplets, has sixteen shader engines that have 16 compute units (CU), each. The CDNA2 compute unit is capable of full-rate FP64, packed FP32 math, and Matrix Engines V2 (fixed function hardware for matrix multiplication, accelerating DNN building, training, and AI inference). With 128 CUs per chiplet, assuming the CDNA2 CU has 64 stream processors, one arrives at 8,192 SP. Two such dies add up to a whopping 16,384, more than three times that of the "Navi 21" RDNA2 silicon. Each die further features its independent PCIe interface, and XGMI (AMD's rival to CXL), an interconnect designed for high-density HPC scenarios. A rudimentary VCN (Video CoreNext) component is also present. It's important to note here, that the CDNA2 CU, as well as the "Aldebaran" MCM itself, doesn't have a dual-use as a GPU, since it lacks much of the hardware needed for graphics processing. The MI200 is expected to launch later this year.
9 Comments on AMD CDNA2 "Aldebaran" MI200 HPC Accelerator with 256 CU (16,384 cores) Imagined
RDNA had extremely major changes: 32-wide native x 4 ALUs per WGP which executes 128 threads every 1 clock tick. In RDNA terms, they call the WGP a "dual-compute unit", because 128-threads per RDNA clock tick is kinda-sorta like 2x256-threads every 4 CDNA clock ticks.
--------
RDNA2 also has 1024 x 32-bit registers per ALU. CDNA only has 256 x 32bit registers per hardware thread (but given the 4x clock ticks for 4x different threads: its kinda-sorta like having 1024 registers across 4 different threads). There are similarities between the two because they're both made by AMD, but... the differences are quite striking and will probably lead to major performance differences between the two platforms.
RDNA2 quite possibly is faster in some scenarios, while CDNA is faster in other scenarios. Its really difficult to compare the two on any microarchitectural level. AMD really did make a huge number of changes.
www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/
www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
between the two sources the two state some things are clearly essentially the same as classical GCN but every element has been considered cut or improved and they are vague on some areas, I am only suggesting some minor updates could have been done with them having to retouch everything anyway.
cheers though more info is always nice
CDNA 1.0 has had its ISA released late last year. Its clear that AMD believes that GCN (16 SIMD-lanes x 4 clock ticks x 4 per compute unit) is a worthwhile architecture (even if RDNA / Graphics require something with lower latency). The ISA doc only has basic information on performance, and the ISA itself is almost identical to GCN documents from the past.
CDNA has new matrix multiplication instructions, and that's about it. Otherwise, its programming model is much more like GCN than RDNA.
MCM is somewhat confirmed, but I don't think any of the other specs are confirmed.