Sunday, October 6th 2024

AMD Granite Ridge "Zen 5" Processor Annotated

High-resolution die-shots of the AMD "Zen 5" 8-core CCD were released and annotated by Nemez, Fitzchens Fitz, and HighYieldYT. These provide a detailed view of how the silicon and its various components appear, particularly the new "Zen 5" CPU core with its 512-bit FPU. The "Granite Ridge" package looks similar to "Raphael," with up to two 8-core CPU complex dies (CCDs) depending on the processor model, and a centrally located client I/O die (cIOD). This cIOD is carried over from "Raphael," which minimizes product development costs for AMD at least for the uncore portion of the processor. The "Zen 5" CCD is built on the TSMC N4P (4 nm) foundry node.

The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.
Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.

The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.

The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.

Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
Sources: Nemez (GNR overview), Nemez (annotations), Fitzchens Fits (die-shots), High Yield (YouTube)
Add your own comment

43 Comments on AMD Granite Ridge "Zen 5" Processor Annotated

#1
TumbleGeorge
It's time for new cIOd, new IMC, new infinity fabric and faster and wider connection between cIOd and south bridge(MB chipset).
Posted on Reply
#2
btarunr
Editor & Senior Moderator
TumbleGeorgeIt's time for new cIOd, new IMC, new infinity fabric and faster and wider connection between cIOd and south bridge(MB chipset).
They created the cIOD so it spares them development costs for the uncore for at least 2 generations (worked for Ryzen 3000 and Ryzen 5000).

So, if they stick with AM5 for Zen 6, they might develop a new cIOD. Maybe switch to N5, give it an RDNA 3.5 iGPU, faster memory controllers, and maybe even an NPU.
Posted on Reply
#3
JWNoctis
Interesting how according to these, the CCD already has two IFoP PHY. Presumably enough to saturate the theoretical bandwidth of dual-channel/quad-sub-channel DDR5-8000, if both are implemented with current sweet spot IF frequency.

Though if things keep going on like this, Zen 6 desktop might well end up getting more than two memory channels if it gets another socket, as long as nature abhors mobile chips significantly more powerful than desktop ones in the same segment like it abhorred vacuum. That is a silver lining of the AI boom and mania.

A Zen 6 on AM5 that scales up to DDR5-8000 and faster would do just fine too. So would a new chipset that runs off PCIe 5.0.
Posted on Reply
#4
AusWolf
"Magical SRAM of mystery"... Love it! :roll:
Posted on Reply
#5
Quake
The video below shows how AMD optimized Zen 5 and that there could be improvements to X3D processors.
Posted on Reply
#6
LittleBro
Are there really 3x PCIe 5.0 x4? That would mean that Gen 5 CPU-Chipset interconnection is already ready CPU-side.
Posted on Reply
#7
Wirko
LittleBroAre there really 3x PCIe 5.0 x4? That would mean that Gen 5 CPU-Chipset interconnection is already ready CPU-side.
Yes, that has been the case since the beginning. The ports on the IOD are all Gen 5. But at the other end of the wires there's the 14nm/12nm chipset, and we have seen how great this node is at handling Gen 5. (Think of the Phison E26 SSD controller.)
Posted on Reply
#8
demu
I think that the original 3D V-cache supported up to 5 layers.
It has (to my knowledge) never been implemented, maybe because the increased costs and minor performance uplift.
Posted on Reply
#9
JWNoctis
WirkoYes, that has been the case since the beginning. The ports on the IOD are all Gen 5. But at the other end of the wires there's the 14nm/12nm chipset, and we have seen how great this node is at handling Gen 5. (Think of the Phison E26 SSD controller.)
Is that really a node problem or an (admittedly node-dictated) thermal problem, though? To be fair, that 16x PEG bus would mostly never be used to capacity until the next generation of video cards come out, either.

I do agree that there really should be a new chipset next generation.
demuI think that the original 3D V-cache supported up to 5 layers.
It has (to my knowledge) never been implemented, maybe because the increased costs and minor performance uplift.
Larger cache also means more latency, both at cycle-level and on maximum clock reduction for this sort of setup. I think anything more would probably be a net penalty for most workloads.
Posted on Reply
#10
Wirko
I'm amazed to see how much information these annotators dig up from who knows where. Sure it's possible to recognise repeating structures such as the L3, and count cores and PCIe PHY logic, and even estimate the number of transistors. But how do you identify the "Scaler Unit" or the "L2 iTLB", or even larger units like "Load/Store" without a lot of inside info? I think there's quite a bit of speculation necessary here (not that it hurts anyone).
JWNoctisInteresting how according to these, the CCD already has two IFoP PHY.
This has been inherited from the Zen 4 CCD too. It's for servers. The server I/O die does not have enough IFOP connections though, so a compromise had to be made:
The I/O die used in all 4th Gen AMD EPYC processors has 12 Infinity Fabric connections to CPU dies. Our CPU dies can support one or two connections to the I/O die. In processor models with four CPU dies, two connections can be used to optimize bandwidth to each CPU die. This is the case for some EPYC 9004 Series CPUs and all EPYC 8004 Series CPUs. In processor models with more than four CPU dies, such as in the EPYC 9004 Series, one Infinity Fabric connection ties each CPU die to the I/O die. - source
demuI think that the original 3D V-cache supported up to 5 layers.
It has (to my knowledge) never been implemented, maybe because the increased costs and minor performance uplift.
TSMC mentioned somewhere (I usually learned that sort of things from Anandtech, now no more) that their glue, the hybrid bonding technology, could be used to stack more than two dies. Memory manufacturers are planning to use it for HBM4 or maybe even HBM3.
Posted on Reply
#11
LittleBro
JWNoctisIs that really a node problem or an (admittedly node-dictated) thermal problem, though? To be fair, that 16x PEG bus would mostly never be used to capacity until the next generation of video cards come out, either.

I do agree that there really should be a new chipset next generation.
6/7 nm is much better in terms of efficiency and thus also thermals. Do you recall X570 with active cooling? That was first AMD's PCIe 4.0 x4 chipset made on 14 nm with around 12W TDP. So, I believe it's true, that with chipset supporting PCIe Gen 5.0 there might be thermal-related difficulties. Just look at those chunks of metal put onto X670(E)/B650(E) chipset to cool that 14W passively. From this point of view, it's better that they didn't release new chipset for X870(E)/B800 boards. From the another point of view, they had 7 nm process at their disposal and they had 2 years to invent a chipset with support for PCIe Gen 5.0 for at least CPU-Chipset interconnection. Yet, for the three generation of AMD chipsets (X570, X670(E), X870(E)), we haven't moved anywhere in terms of this interconnection capabilities. On the top of that, we have moved literally nowhere between X670(E) and X870(E).

As for the PEG bus, those 16 PCIe Gen 5.0 lanes are not strictly intended for GPU usage scenarios only. Another expansion cards benefit from this, e.g. x8 NVMe RAID cards. Or you may have 2 GPUs with unlimited bandwidth (bus-wise) even for upcoming few years. Although, having 2 GPUs is a luxury nowadays, especially in terms of requirements for power (PSU) and space (case).
JWNoctisLarger cache also means more latency, both at cycle-level and on maximum clock reduction for this sort of setup. I think anything more would probably be a net penalty for most workloads.
Not so much. Have a look at 7800X3D or 5800X3D. Their biggest penalty is not in latency but in lower clocks (compared to regular non-3DX counterparts). While those few hundred MHz lower clocks don't matter much at games, they have noticeably impact in applications.
Posted on Reply
#12
demu
WirkoI'm amazed to see how much information these annotators dig up from who knows where. Sure it's possible to recognise repeating structures such as the L3, and count cores and PCIe PHY logic, and even estimate the number of transistors. But how do you identify the "Scaler Unit" or the "L2 iTLB", or even larger units like "Load/Store" without a lot of inside info? I think there's quite a bit of speculation necessary here (not that it hurts anyone).


This has been inherited from the Zen 4 CCD too. It's for servers. The server I/O die does not have enough IFOP connections though, so a compromise had to be made:



TSMC mentioned somewhere (I usually learned that sort of things from Anandtech, now no more) that their glue, the hybrid bonding technology, could be used to stack more than two dies. Memory manufacturers are planning to use it for HBM4 or maybe even HBM3.
www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gaming
Posted on Reply
#13
Tek-Check
JWNoctisInteresting how according to these, the CCD already has two IFoP PHY. Presumably enough to saturate the theoretical bandwidth of dual-channel/quad-sub-channel DDR5-8000, if both are implemented with current sweet spot IF frequency.

Though if things keep going on like this, Zen 6 desktop might well end up getting more than two memory channels if it gets another socket, as long as nature abhors mobile chips significantly more powerful than desktop ones in the same segment like it abhorred vacuum. That is a silver lining of the AI boom and mania.

A Zen 6 on AM5 that scales up to DDR5-8000 and faster would do just fine too. So would a new chipset that runs off PCIe 5.0.
The second IF port is more for inter-chiplet communication.
Four channels are very unlikely for desktop.
They can introduce a new IOD and upgraded chipset, most probably will.

EDIT: To be more precise in wording, second GMI increases the bandwidth from 36 GB/s to 72 GB/s and thus allows more data to flow between chiplets via IF.
Posted on Reply
#14
LittleBro
Tek-CheckThe second IF port is for inter-chiplet communication.
Then, what is that 3rd PCIe Gen 5.0 x4 used for? Two are used for M.2 NVMe, that's pretty clear.

EDIT: Sorry, my bad, I read it as "for inter chipset communication". Everything clear now.
Posted on Reply
#15
Wirko
Tek-CheckThe second IF port is for inter-chiplet communication.
Are you sure about that? I remember one detail from Zen 4 Epyc block diagrams: there are no CCD-to-CCD interconnects in 8- and 12-CCD processors. One port from each CCD remains unused. I was wondering why AMD didn't use the remaining ports for what you're implying. I had to assume the CCDs don't include the switching logic to make use of that.
Posted on Reply
#16
NC37
Great but, at this point AMD has a problem because consumers are starting to get wise to their 3D cache releases. People aren't as interested in the initial release because they know the 3D cache version is coming which will blow it out of the water. They've been pumping a lot into trying to make the 9000 series seem interesting but the core issue is still there in the minds.
Posted on Reply
#17
AusWolf
LittleBroThen, what is that 3rd PCIe Gen 5.0 x4 used for? Two are used for M.2 NVMe, that's pretty clear.
For communication with the chipset.
Posted on Reply
#18
L'Eliminateur
WirkoAre you sure about that? I remember one detail from Zen 4 Epyc block diagrams: there are no CCD-to-CCD interconnects in 8- and 12-CCD processors. One port from each CCD remains unused. I was wondering why AMD didn't use the remaining ports for what you're implying. I had to assume the CCDs don't include the switching logic to make use of that.
Indeed it is so, AMD does NOt have inter-chiplet comms, everything passes through the IOD and IF.

and yes, i've always said that the client processors not using wide GMI3 is a waste of performance where it's most needed (as AMD is very sensitive to RAM BW), specially on single-die models they could use both IFOP links on the IOD and the CCD, and only one per CCD for the dual-die models, after all you already have the links there, BUT it would require a different substrate for both models...
Posted on Reply
#19
CosmicWanderer
I don't know why, but it's still so weird to me that the GPU is in the I/O die and separate from the CCD.

Brilliant stuff.
Posted on Reply
#20
Daven
It looks like changes to the TSVs will allow more and stacked cache. It is possible that AMD will move to an L3-cacheless CCD design with all of it coming from the stacked cache. If they can fit the TSVs into the dense versions, we are looking at a lot of freed up real estate on a dense CCD chiplet. AMD might be abandoning clock speed increases (even resetting them to below 5 GHz much like when Pentium M reset clocks after the Netburst era) and going for high core counts, large/stacked L3 cache sizes and continued IPC increases while maintaining power budgets at the same/current level.

I welcome this approach if that's what happens.
Posted on Reply
#21
L'Eliminateur
DavenIt looks like changes to the TSVs will allow more and stacked cache. It is possible that AMD will move to an L3-cacheless CCD design with all of it coming from the stacked cache. If they can fit the TSVs into the dense versions, we are looking at a lot of freed up real estate on a dense CCD chiplet. AMD might be abandoning clock speed increases (even resetting them to below 5 GHz much like when Pentium M reset clocks after the Netburst era) and going for high core counts, large/stacked L3 cache sizes and continued IPC increases while maintaining power budgets at the same/current level.

I welcome this approach if that's what happens.
I also always wondered with the chiplet design is that they could make what is essentially a full CCD-sized SRAM(or superfast DRAM) die and place it there as a monster L4/system cache, you could easily fit 512MB+ in that size, the problem is that it would be connected through IF which is "slow" and would need extra IF ports on the IOD....

food for thought
Posted on Reply
#22
thegnome
Very interesting, seems like they optimized the die's very well this time. only problem is the aged and slow way they are connected. Why are the two dies so far from eachother when it would seem be faster and more efficient to be close... On TR/Epyc it's acceptable because of the heat and the much more capable IO die.
Posted on Reply
#23
Tek-Check
WirkoAre you sure about that? I remember one detail from Zen 4 Epyc block diagrams: there are no CCD-to-CCD interconnects in 8- and 12-CCD processors. One port from each CCD remains unused. I was wondering why AMD didn't use the remaining ports for what you're implying. I had to assume the CCDs don't include the switching logic to make use of that.
L'EliminateurIndeed it is so, AMD does NOt have inter-chiplet comms, everything passes through the IOD and IF.

and yes, i've always said that the client processors not using wide GMI3 is a waste of performance where it's most needed (as AMD is very sensitive to RAM BW), specially on single-die models they could use both IFOP links on the IOD and the CCD, and only one per CCD for the dual-die models, after all you already have the links there, BUT it would require a different substrate for both models...
There are no direct chiplet-to-chiplet interconnects, that is correct. Everything goes through IF/IOD. I should have been more explicit in wording, but replied quickly, on-the-go.

EPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.

We do not have a shot of a single chiplet CPU that exposes GMI link, but the principle should be the same, aka IF bandwidth should be 72 GB/s, like on EPYCs with four and fewer chiplets, and not 36 GB/s.


* from page 11
INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using 36 Gb/s Infinity Fabric links. (This is known internally as the Global Memory Interface [GMI] and is labeled this way in many figures.) In EPYC 9004 and 8004 Series processors with four or fewer CPU dies, two links connect to each CPU die for up to 72 Gb/s of connectivity
Posted on Reply
#24
tfp
CosmicWandererI don't know why, but it's still so weird to me that the GPU is in the I/O die and separate from the CCD.

Brilliant stuff.
Reminds me of the old north bridges. What is old is new again.
Posted on Reply
#25
Wirko
thegnomeVery interesting, seems like they optimized the die's very well this time. only problem is the aged and slow way they are connected. Why are the two dies so far from eachother when it would seem be faster and more efficient to be close... On TR/Epyc it's acceptable because of the heat and the much more capable IO die.
Density of wires. You can see part of the complexity in the pic that @Tek-Check attached. Here you see one layer but the wires are spread across several layers.

That's also the probable reason why AMD couldn't move the IOD farther to the edge, and CCDs closer to the centre. There are just too many wires for signals running from the IOD to the contacts at the other side of the substrate. The 28 PCIe lanes take four wires each, for example.
Posted on Reply
Add your own comment
Nov 21st, 2024 04:18 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts