Sunday, October 6th 2024
AMD Granite Ridge "Zen 5" Processor Annotated
High-resolution die-shots of the AMD "Zen 5" 8-core CCD were released and annotated by Nemez, Fitzchens Fitz, and HighYieldYT. These provide a detailed view of how the silicon and its various components appear, particularly the new "Zen 5" CPU core with its 512-bit FPU. The "Granite Ridge" package looks similar to "Raphael," with up to two 8-core CPU complex dies (CCDs) depending on the processor model, and a centrally located client I/O die (cIOD). This cIOD is carried over from "Raphael," which minimizes product development costs for AMD at least for the uncore portion of the processor. The "Zen 5" CCD is built on the TSMC N4P (4 nm) foundry node.
The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.
The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.
The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.
Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
Sources:
Nemez (GNR overview), Nemez (annotations), Fitzchens Fits (die-shots), High Yield (YouTube)
The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.
The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.
The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.
Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
43 Comments on AMD Granite Ridge "Zen 5" Processor Annotated
They could well implement some IBM-like evict-to-other-chiplet virtual-L4-cache scheme, if they could do significantly better than memory latency with that. DRAM latency is only going to get worse.
Zen 6 will get a new IOD which will help the IO bottlenecks.
For me it does not make sense. I don’t think I saw a 7 core sku ever, and even if they eventually release one it will be for sure a 8 core CCD with one disabled.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review). Hopefully. That should be high on AMD's priority list. Another possibility for improvement would be a direct CCD to CCD connection in addition to the existing ones.
As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
A single (not wide) IFOP (GMI) interface in the Zen 4 architecture has:
- a 32-bit wide bus in each direction (single-ended, 1 wire per bit)
- 3000 MHz default clock in Ryzen CPUs (2250 MHz in Epyc CPUs)
- quad data rate transfers, which calculates to...
- 12 GT/s in Ryzen (9 GT/s in Epyc)
- 48 GB/s per direction in Ryzen (36 GB/s in Epyc)
- first of all, it looks like that High Yields provided an old image of Zen 2 communication lanes by adding the layer of old lanes onto Zen 5 photo
- we can clearly see this on CCD level, as GMIs were placed in the middle of CCD with two 4-core CCXs on Zen 2
- so, the left image is not a genuine Zen 5 communication diagram from the video, but it must have lanes positioned in a different way
- with that out the way, let's move on
- what do you mean by "wide GMI"? Both GMI ports?
- each CCD has two GMI ports and IOD also has the same two GMI ports. Each GMI port is '9-wide', which means each GMI PHY has nine logic areas. - each of the nine logic areas within GMI port translates into PHY that could get one or two communication lanes. This is visible.
- what we do not know is whether all those IF lanes are wired to one GMI port only; not visible on the image; topology documentation is scarce
- it'd be great to see EPYC tolopogy of IF lanes on CPUs that have 4 and fewer CCDs; this would bring us closer to the anwer - Zen 4 is 32B/cycle read, 16B/cycle write
- more on this here: chipsandcheese.com/p/amds-zen-4-part-3-system-level-stuff-and-igpu
- Gigabyte leak: chipsandcheese.com/p/details-on-the-gigabyte-leak
- read speeds are much faster than write speeds over IF
- read bandwidth does not increase much when two CCDs operate
- write bandwidth almost doubles with two CCDs
(taken from Tom's)
Also, you've mentioned 36 GB/s and 72 GB/s before, and 9 bits here. It's obvious that 9 bits include the parity bit, but I don't understand what numbers AMD took to calculate 36 and 72 GB/s - unles that includes parity too.
I am looking for those detailed die shots where we can see physical wiring, or a diagram that shows them all.
This would allow us to see how they wire two GMI ports from each CCD.
When AMD says "GMI-Narrow", do they mean one GMI port only? And "GMI-Wide" means both GMI ports? It would make sense.
The next question is whether they wire all nine logic parts of a single GMI port to Infinity Fabric. If so, how many single wires, how many double ones?
I do not know.
Of course, AMD does not want to give up some secrets about their IF sauce and wiring, such as the speed of the fabric itself and how they wire it to CCDs and IOD. This is beyond my pay grade.
What we know from AMD:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link
The answer we need here is how many IF links does one GMI port provide? Is it 9? There are 9 pieces of logic on die per GMI port. Are they all used on PHY level? If 9 links are used, the throughput would be 9x36 Gbps = 324 Gbps = 40.5 GB/s for one CCD, and 648 Gbps = 81 GB/s for "GMI-wide"
Chips&Cheese testing of IF on Ryzen:
3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
This shows the speed of 504 Gbps for one CCD and 616 Gbps for two CCDs.
- if only one GMI port is used on Ryzen and we assumed there are 9 links in each GMI port, this gives us 9x36 Gbps = 324 Gbps = 40.5 GB/s.
- the measured throughput by C&C was 63 GB/s on read speed, so more links would be needed on one CCD to achieve this throughput
- it seems physically impossible to use both GMI ports on Ryzen CCD and connect those to one GMI port on IOD.
- therefore, it could be the case that IF was configured to run faster on Ryzen CPUs
How does this sit?
"INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using a total of 16 36 Gb/s Infinity Fabric links."
Going back to our previous considerations. What we know from AMD about theoretical bandwidth:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
- 16 links x 36 Gbps = 576 Gbps = 72 GB/s
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link
- 16 links x 72 Gbps = 1152 Gbps = 144 GB/s
3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
So yes, it looks like one GMI port is used.