• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Releases its CDNA2 MI250X "Aldebaran" HPC GPU Block Diagram

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,891 (7.62/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD in its HotChips 22 presentation released a block-diagram of its biggest AI-HPC processor, the Instinct MI250X. Based on the CDNA2 compute architecture, at the heart of the MI250X is the "Aldebaran" MCM (multi-chip module). This MCM contains two logic dies (GPU dies), and eight HBM2E stacks, four per GPU die. The two GPU dies are connected by a 400 GB/s Infinity Fabric link. They each have up to 500 GB/s of external Infinity Fabric bandwidth for inter-socket communications; and PCI-Express 4.0 x16 as the host system bus for AIC form-factors. The two GPU dies together make up 58 billion transistors, and are fabricated on the TSMC N6 (6 nm) node.

The component hierarchy of each GPU die sees eight Shader Engines share a last-level L2 cache. The eight Shader Engines total 112 Compute Units, or 14 CU per engine. The CDNA2 compute unit contains 64 stream processors making up the Shader Core, and four Matrix Core Units. These are specialized hardware for matrix/tensor math operations. There are hence 7,168 stream processors per GPU die, and 14,336 per package. AMD claims a 100% increase in double-precision compute performance over CDNA (MI100). AMD attributes this to increases in frequencies, efficient data paths, extensive operand reuse and forwarding; and power-optimization enabling those higher clocks. The MI200 is already powering the Frontier supercomputer, and is working for more design wins in the HPC space. The company also dropped a major hint that the MI300, based on CDNA3, will be an APU. It will incorporate GPU dies, core-logic, and CPU CCDs onto a single package, in what is a rival solution to NVIDIA Grace Hopper Superchip.



View at TechPowerUp Main Site | Source
 
Joined
Jan 5, 2006
Messages
18,502 (2.72/day)
System Name AlderLake
Processor Intel i7 12700K P-Cores @ 5Ghz
Motherboard Gigabyte Z690 Aorus Master
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans
Memory 32GB DDR5 Corsair Dominator Platinum RGB 6000MT/s CL36
Video Card(s) MSI RTX 2070 Super Gaming X Trio
Storage Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p
Case Be quiet! Silent Base 600 - Window
Audio Device(s) Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply Seasonic Focus Plus Gold 750W
Mouse Logitech MX Anywhere 2 Laser wireless
Keyboard RAPOO E9270P Black 5GHz wireless
Software Windows 11
Benchmark Scores Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock
AMD: We are better
Nvidia: No we are!!
:nutkick::laugh:
 
Joined
Dec 26, 2020
Messages
376 (0.28/day)
System Name Incomplete thing 1.0
Processor Ryzen 2600
Motherboard B450 Aorus Elite
Cooling Gelid Phantom Black
Memory HyperX Fury RGB 3200 CL16 16GB
Video Card(s) Gigabyte 2060 Gaming OC PRO
Storage Dual 1TB 970evo
Display(s) AOC G2U 1440p 144hz, HP e232
Case CM mb511 RGB
Audio Device(s) Reloop ADM-4
Power Supply Sharkoon WPM-600
Mouse G502 Hero
Keyboard Sharkoon SGK3 Blue
Software W10 Pro
Benchmark Scores 2-5% over stock scores
In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
 
Joined
Oct 12, 2005
Messages
694 (0.10/day)
In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
That 400 GB could be lower power/latency. But yes that is strange. It's probably also why it's still recognized as 2 independent chip and not single chip (among other things like scheduler etc...)

Also, some feature could be enabled on the 400 GB that would require additional bandwidth for control. Still, they will have to improve that in the future because Apple and IBM have way better die to die interface than AMD right now.

The double (or half the HBM bandwidth per die) would have made more sense. From initial benchmark laying around the internet, they are super fast when your code can run independently on each tiles, but perf start to collapse if you need die to die access.
 
Joined
Jan 28, 2021
Messages
847 (0.65/day)
Pretty cool to see Nvidia and AMD going at (and AMD actually getting some wins) it in this space, going for sort of the same overall design but from each others opposite areas of expertise.
 
Joined
Jun 8, 2022
Messages
368 (0.46/day)
Location
Ohio, USA
System Name Trackstar
Processor AMD Ryzen 7 5800X3D -30 All Core CO (on Corsair XC5 block)
Motherboard Gigabyte B550 AORUS Elite V2 Rev 1.0 (F17 BIOS)
Cooling Corsair XD5 pump / Corsair XR5 1x 360mm (front) + 1x 420mm (top) rads
Memory 32GB G.Skill DDR4-3600 CL14 1:1 (F4-3600C14Q-32GVKA kit)
Video Card(s) ASRock RX 6950XT OC Formula (on Bykski A-AR6900XTOCF-X block)
Storage WD_BLACK SN850X 2TB w/HS (FW ver. 620361WD)
Display(s) Dell S3222DGM 32" 1440p/165Hz FreeSync
Case Fractal Design Meshify S2
Audio Device(s) Realtek ALC1200 Integrated Audio
Power Supply Super Flower Leadex Platinum SE 1200W on Liebert GXT4-1500RT120 UPS
Mouse Corsair Nightsword RGB
Keyboard Corsair K60 RGB PRO
VR HMD N/A
Software Windows 11 Pro 23H2 (Build 22631.3958)
Benchmark Scores https://www.3dmark.com/sw/1131940 https://www.3dmark.com/fs/29315810
Looks promising but those odd IF bandwidth numbers might point to some continuing inter-die latency issues, with higher external fabric speeds to compensate. Either way very nice to see team red get serious about HPC.
 
Joined
Feb 20, 2019
Messages
7,799 (3.89/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
Bodes well for RDNA3 which is also MCP and TSMC 6nm

In package Infinity Fabric slower than external? I expected the in package fabric to be faster than the HBM, but then again, it's compute.
HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
 
Joined
Nov 6, 2016
Messages
1,692 (0.60/day)
Location
NH, USA
System Name Lightbringer
Processor Ryzen 7 2700X
Motherboard Asus ROG Strix X470-F Gaming
Cooling Enermax Liqmax Iii 360mm AIO
Memory G.Skill Trident Z RGB 32GB (8GBx4) 3200Mhz CL 14
Video Card(s) Sapphire RX 5700XT Nitro+
Storage Hp EX950 2TB NVMe M.2, HP EX950 1TB NVMe M.2, Samsung 860 EVO 2TB
Display(s) LG 34BK95U-W 34" 5120 x 2160
Case Lian Li PC-O11 Dynamic (White)
Power Supply BeQuiet Straight Power 11 850w Gold Rated PSU
Mouse Glorious Model O (Matte White)
Keyboard Royal Kludge RK71
Software Windows 10
Bodes well for RDNA3 which is also MCP and TSMC 6nm


HBM2 is exceptionally wide but quite slow, so whilst the bandwidth HBM2 offers is very good, that bandwidth comes mostly from the bus width, meaning that latencies will likely be order(s) of magnitude higher than Infinity Fabric.
RDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
 
Joined
May 3, 2018
Messages
2,669 (1.16/day)
RDNA3 GCDs (graphics core dies) are 5nm while the cache dies and the IOD are 6nm, at least on Navi 31 and 32 (which each have their own unique GCD, in other words, Navi 31 is NOT just two Navi 32 GCDs like many initially believed), Navi 33 is monolithic and is on 6nm...at least according to the most recent, agreed upon leaks.

The tile structure should allow RDNA3 to be relatively much cheaper to manufacture than Nvidia's monolithic Lovelace. In the latest leaks, the RDNA3 GCDs for Navi 31. and 32 are really small, less than 250mm^2 if I remember correctly
Already RDNA3 for desktop is much cheaper for AMD to produce than Lovelace is for Nvidia. AMD won't be under any pressure on price, Nvidia will need to slash margins to compete on price.
 
Top