Sunday, October 6th 2024

AMD Granite Ridge "Zen 5" Processor Annotated

High-resolution die-shots of the AMD "Zen 5" 8-core CCD were released and annotated by Nemez, Fitzchens Fitz, and HighYieldYT. These provide a detailed view of how the silicon and its various components appear, particularly the new "Zen 5" CPU core with its 512-bit FPU. The "Granite Ridge" package looks similar to "Raphael," with up to two 8-core CPU complex dies (CCDs) depending on the processor model, and a centrally located client I/O die (cIOD). This cIOD is carried over from "Raphael," which minimizes product development costs for AMD at least for the uncore portion of the processor. The "Zen 5" CCD is built on the TSMC N4P (4 nm) foundry node.

The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.
Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.

The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.

The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.

Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
Sources: Nemez (GNR overview), Nemez (annotations), Fitzchens Fits (die-shots), High Yield (YouTube)
Add your own comment

43 Comments on AMD Granite Ridge "Zen 5" Processor Annotated

#26
Minus Infinity
btarunrThey created the cIOD so it spares them development costs for the uncore for at least 2 generations (worked for Ryzen 3000 and Ryzen 5000).

So, if they stick with AM5 for Zen 6, they might develop a new cIOD. Maybe switch to N5, give it an RDNA 3.5 iGPU, faster memory controllers, and maybe even an NPU.
Strix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.
Posted on Reply
#27
JWNoctis
Minus InfinityStrix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.
Strix Halo is arguably closer to a respectably capable GPU that happened to have CPU chiplets hanging off the bus, than a regular CPU. Zen 5 CCD-to-CCD latency regression is said to have been restored to Zen 4 level in the latest firmware, though it remains to be seen if Zen 6 would do better, without trade-offs elsewhere.

They could well implement some IBM-like evict-to-other-chiplet virtual-L4-cache scheme, if they could do significantly better than memory latency with that. DRAM latency is only going to get worse.
Posted on Reply
#28
mkppo
Minus InfinityStrix Halo is getting new cIOD made on 3nm. One would expect Zen 6 to do so too. Zen 6 is about about fixing all the failings with the current design, including high ccd-to-ccd core latency.
CCD to CCD latency doesn't really matter though and is always something you wish to avoid anyway, but CCD to IOD does matter. The former is already fixed and back to Zen 4 numbers and was a simple power saving thing they turned off.

Zen 6 will get a new IOD which will help the IO bottlenecks.
Posted on Reply
#29
Igb
In the image of the article I count 7 cores annotated. Is that… correct?

For me it does not make sense. I don’t think I saw a 7 core sku ever, and even if they eventually release one it will be for sure a 8 core CCD with one disabled.
Posted on Reply
#30
AusWolf
IgbIn the image of the article I count 7 cores annotated. Is that… correct?

For me it does not make sense. I don’t think I saw a 7 core sku ever, and even if they eventually release one it will be for sure a 8 core CCD with one disabled.
It's 8 cores, with the first core annotated. Here you go:
Posted on Reply
#31
Wirko
mkppoCCD to CCD latency doesn't really matter though and is always something you wish to avoid anyway, but CCD to IOD does matter. The former is already fixed and back to Zen 4 numbers and was a simple power saving thing they turned off.
CCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).
mkppoZen 6 will get a new IOD which will help the IO bottlenecks.
Hopefully. That should be high on AMD's priority list. Another possibility for improvement would be a direct CCD to CCD connection in addition to the existing ones.
Posted on Reply
#32
AusWolf
WirkoCCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).
I assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
Posted on Reply
#33
Wirko
AusWolfI assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
There are no direct CCD-to-CCD links in any of the AMD processors.
Posted on Reply
#34
AusWolf
WirkoThere are no direct CCD-to-CCD links in any of the AMD processors.
So what's this?



As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
Posted on Reply
#35
Wirko
Tek-CheckEPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.
That IFOP or GMI is still a bit of a mystery, with too little documentation available (some is here). May I ask you to do a tek-check of the data I compiled, calculated and listed here below?

A single (not wide) IFOP (GMI) interface in the Zen 4 architecture has:

- a 32-bit wide bus in each direction (single-ended, 1 wire per bit)
- 3000 MHz default clock in Ryzen CPUs (2250 MHz in Epyc CPUs)
- quad data rate transfers, which calculates to...
- 12 GT/s in Ryzen (9 GT/s in Epyc)
- 48 GB/s per direction in Ryzen (36 GB/s in Epyc)
Posted on Reply
#36
L'Eliminateur
AusWolfSo what's this?



As I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
indeed it is so, AFAIK client Ryzen does not use wide GMI, only one IFOP per die, i haven't seen any documentation that says they use wide GMI
Tek-CheckThere are no direct chiplet-to-chiplet interconnects, that is correct. Everything goes through IF/IOD. I should have been more explicit in wording, but replied quickly, on-the-go.

EPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.

We do not have a shot of a single chiplet CPU that exposes GMI link, but the principle should be the same, aka IF bandwidth should be 72 GB/s, like on EPYCs with four and fewer chiplets, and not 36 GB/s.


* from page 11
INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using 36 Gb/s Infinity Fabric links. (This is known internally as the Global Memory Interface [GMI] and is labeled this way in many figures.) In EPYC 9004 and 8004 Series processors with four or fewer CPU dies, two links connect to each CPU die for up to 72 Gb/s of connectivity
i don't think client ryzen uses wide GMI, that's only reserved for special EPYC
Posted on Reply
#37
Igb
AusWolfIt's 8 cores, with the first core annotated. Here you go:
Failed to see that. Should not check this with no coffee. Thanks!
Posted on Reply
#38
kapone32
WirkoCCD to CCD latency is fixed but still huge. It's really hard to understand where those ~80 ns come from. There must be some very complex switching logic and cache coherency logic on the IOD, or something.
CCD to IOD latency can't be observed directly but CCD to RAM latency seems fine, no issues here (68 ns in AIDA64 on launch day review).

Hopefully. That should be high on AMD's priority list. Another possibility for improvement would be a direct CCD to CCD connection in addition to the existing ones.
How long is 80 nanoseconds? How many nanos are 1 second?
Posted on Reply
#39
AusWolf
kapone32How long is 80 nanoseconds? How many nanos are 1 second?
It's 80 billionth of a second. It's apparently enough time for some folks to make breakfast or something.
Posted on Reply
#40
Tek-Check
AusWolfI assume that the second interconnect on the CCD is only used on Epyc, but not on Ryzen. So, inter-CCD communication on Ryzen is still done via the IO die.
AusWolfAs I understand it, there's one "IFOP PHY" responsible for communication with the IO die, and the other one is inactive on Ryzen. Or is it some kind of double-wide communication bus with the Epyc IO die?
L'Eliminateurindeed it is so, AFAIK client Ryzen does not use wide GMI, only one IFOP per die, i haven't seen any documentation that says they use wide GMI
L'Eliminateuri don't think client ryzen uses wide GMI, that's only reserved for special EPYC
Let's try to get to the bottom of this by focusing on what we know, what is visible on Ryzen die and what could be inferred.




Zen 5 image
Zen 2 image

- first of all, it looks like that High Yields provided an old image of Zen 2 communication lanes by adding the layer of old lanes onto Zen 5 photo
- we can clearly see this on CCD level, as GMIs were placed in the middle of CCD with two 4-core CCXs on Zen 2
- so, the left image is not a genuine Zen 5 communication diagram from the video, but it must have lanes positioned in a different way
- with that out the way, let's move on

- what do you mean by "wide GMI"? Both GMI ports?
- each CCD has two GMI ports and IOD also has the same two GMI ports. Each GMI port is '9-wide', which means each GMI PHY has nine logic areas. - each of the nine logic areas within GMI port translates into PHY that could get one or two communication lanes. This is visible.
- what we do not know is whether all those IF lanes are wired to one GMI port only; not visible on the image; topology documentation is scarce
- it'd be great to see EPYC tolopogy of IF lanes on CPUs that have 4 and fewer CCDs; this would bring us closer to the anwer
WirkoThat IFOP or GMI is still a bit of a mystery, with too little documentation available (some is here). May I ask you to do a tek-check of the data I compiled, calculated and listed here below? A single (not wide) IFOP (GMI) interface in the Zen 4 architecture has:

- a 32-bit wide bus in each direction (single-ended, 1 wire per bit)
- 3000 MHz default clock in Ryzen CPUs (2250 MHz in Epyc CPUs)
- quad data rate transfers, which calculates to...
- 12 GT/s in Ryzen (9 GT/s in Epyc)
- 48 GB/s per direction in Ryzen (36 GB/s in Epyc)
- Zen 4 is 32B/cycle read, 16B/cycle write
- more on this here: chipsandcheese.com/p/amds-zen-4-part-3-system-level-stuff-and-igpu
- Gigabyte leak: chipsandcheese.com/p/details-on-the-gigabyte-leak
- read speeds are much faster than write speeds over IF
- read bandwidth does not increase much when two CCDs operate
- write bandwidth almost doubles with two CCDs



Posted on Reply
#41
Wirko
Tek-Checkit'd be great to see EPYC tolopogy of IF lanes on CPUs that have 4 and fewer CCDs; this would bring us closer to the anwer
You mean this slide, or are you looking for something more detailed?


(taken from Tom's)

Also, you've mentioned 36 GB/s and 72 GB/s before, and 9 bits here. It's obvious that 9 bits include the parity bit, but I don't understand what numbers AMD took to calculate 36 and 72 GB/s - unles that includes parity too.
Posted on Reply
#42
Tek-Check
WirkoYou mean this slide, or are you looking for something more detailed?
I have this slide and entire presentation in .pdf. Ideal image or diagram would be EPYC with 4 or fewer CCDs.
I am looking for those detailed die shots where we can see physical wiring, or a diagram that shows them all.
This would allow us to see how they wire two GMI ports from each CCD.
When AMD says "GMI-Narrow", do they mean one GMI port only? And "GMI-Wide" means both GMI ports? It would make sense.
The next question is whether they wire all nine logic parts of a single GMI port to Infinity Fabric. If so, how many single wires, how many double ones?
I do not know.

Of course, AMD does not want to give up some secrets about their IF sauce and wiring, such as the speed of the fabric itself and how they wire it to CCDs and IOD. This is beyond my pay grade.

What we know from AMD:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link

The answer we need here is how many IF links does one GMI port provide? Is it 9? There are 9 pieces of logic on die per GMI port. Are they all used on PHY level? If 9 links are used, the throughput would be 9x36 Gbps = 324 Gbps = 40.5 GB/s for one CCD, and 648 Gbps = 81 GB/s for "GMI-wide"

Chips&Cheese testing of IF on Ryzen:
3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
This shows the speed of 504 Gbps for one CCD and 616 Gbps for two CCDs.

- if only one GMI port is used on Ryzen and we assumed there are 9 links in each GMI port, this gives us 9x36 Gbps = 324 Gbps = 40.5 GB/s.
- the measured throughput by C&C was 63 GB/s on read speed, so more links would be needed on one CCD to achieve this throughput
- it seems physically impossible to use both GMI ports on Ryzen CCD and connect those to one GMI port on IOD.
- therefore, it could be the case that IF was configured to run faster on Ryzen CPUs

How does this sit?
Posted on Reply
#43
Tek-Check
WirkoYou mean this slide, or are you looking for something more detailed?
We now have a confirmation from EPYC Zen 5 file:
"INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using a total of 16 36 Gb/s Infinity Fabric links."

Going back to our previous considerations. What we know from AMD about theoretical bandwidth:
1. CCD on >32C EPYC Zen4 was configured for 36 Gbps throughput per link on one GMI port to IF ("GMI-Narrow"). 36 Gbps = 4.5 GB/s per link
- 16 links x 36 Gbps = 576 Gbps = 72 GB/s
2. CCD on ≤32C EPYC Zen4 was configured for 36x2 Gbps throughput per link on two GMI ports to IF ("GMI-Wide"). 72 Gbps = 9 GB/s per dual link
- 16 links x 72 Gbps = 1152 Gbps = 144 GB/s

3. one CCD on Ryzen Zen4 has throughput of ~63 GB/s towards DDR5-6000 memory via IF; two CCDs ~77 GB/s
So yes, it looks like one GMI port is used.
Posted on Reply
Add your own comment
Nov 12th, 2024 12:22 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts