Sunday, October 6th 2024
AMD Granite Ridge "Zen 5" Processor Annotated
High-resolution die-shots of the AMD "Zen 5" 8-core CCD were released and annotated by Nemez, Fitzchens Fitz, and HighYieldYT. These provide a detailed view of how the silicon and its various components appear, particularly the new "Zen 5" CPU core with its 512-bit FPU. The "Granite Ridge" package looks similar to "Raphael," with up to two 8-core CPU complex dies (CCDs) depending on the processor model, and a centrally located client I/O die (cIOD). This cIOD is carried over from "Raphael," which minimizes product development costs for AMD at least for the uncore portion of the processor. The "Zen 5" CCD is built on the TSMC N4P (4 nm) foundry node.
The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.
The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.
The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.
Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
Sources:
Nemez (GNR overview), Nemez (annotations), Fitzchens Fits (die-shots), High Yield (YouTube)
The "Granite Ridge" package sees the up to two "Zen 5" CCDs snuck up closer to each other than the "Zen 4" CCDs on "Raphael." In the picture above, you can see the pad of the absent CCD behind the solder mask of the fiberglass substrate, close to the present CCD. The CCD contains 8 full-sized "Zen 5" CPU cores, each with 1 MB of L2 cache, and a centrally located 32 MB L3 cache that's shared among all eight cores. The only other components are an SMU (system management unit), and the Infinity Fabric over Package (IFoP) PHYs, which connect the CCD to the cIOD.Each "Zen 5" CPU core is physically larger than the "Zen 4" core (built on the TSMC N5 process), due to its 512-bit floating point data-path. The core's Vector Engine is pushed to the very edge of the core. On the CCD, these should be the edges of the die. FPUs tend to be the hottest components on a CPU core, so this makes sense. The innermost component (facing the shared L3 cache) is the 1 MB L2 cache. AMD has doubled the bandwidth and associativity of this 1 MB L2 cache compared to the one on the "Zen 4" core.
The central region of the "Zen 5" core has the 32 KB L1I cache, 48 KB L1D cache, the Integer Execution engine, and the all important front-end of the processor, with its Instruction Fetch & Decode, the Branch Prediction unit, micro-op cache, and Scheduler.
The 32 MB on-die L3 cache has rows of TSVs (through-silicon vias) that act as provision for stacked 3D V-cache. The 64 MB L3D (L3 cache die) connects with the CCD's ringbus using these TSVs, making the 64 MB 3D V-cache contiguous with the 32 MB on-die L3 cache.
Lastly, there's the client I/O die (cIOD). There's nothing new to report here, the chip is carried over from "Raphael." It is built on the TSMC N6 (6 nm) node. Nearly 1/3rd of the die-area is taken up by the iGPU and its allied components, such as the media acceleration engine, and display engine. The iGPU is based on the RDNA 2 graphics architecture, and has just one workgroup processor (WGP), for two compute units (CU), or 128 stream processors. Other key components on the cIOD are the 28-lane PCIe Gen 5 interface, the two IFoP ports for the CCDs, a fairly large SoC I/O consisting of USB 3.x and legacy connectivity, and the all important DDR5 memory controller with its dual-channel (four sub-channel) memory interface.
43 Comments on AMD Granite Ridge "Zen 5" Processor Annotated
So, if they stick with AM5 for Zen 6, they might develop a new cIOD. Maybe switch to N5, give it an RDNA 3.5 iGPU, faster memory controllers, and maybe even an NPU.
Though if things keep going on like this, Zen 6 desktop might well end up getting more than two memory channels if it gets another socket, as long as nature abhors mobile chips significantly more powerful than desktop ones in the same segment like it abhorred vacuum. That is a silver lining of the AI boom and mania.
A Zen 6 on AM5 that scales up to DDR5-8000 and faster would do just fine too. So would a new chipset that runs off PCIe 5.0.
It has (to my knowledge) never been implemented, maybe because the increased costs and minor performance uplift.
I do agree that there really should be a new chipset next generation. Larger cache also means more latency, both at cycle-level and on maximum clock reduction for this sort of setup. I think anything more would probably be a net penalty for most workloads.
As for the PEG bus, those 16 PCIe Gen 5.0 lanes are not strictly intended for GPU usage scenarios only. Another expansion cards benefit from this, e.g. x8 NVMe RAID cards. Or you may have 2 GPUs with unlimited bandwidth (bus-wise) even for upcoming few years. Although, having 2 GPUs is a luxury nowadays, especially in terms of requirements for power (PSU) and space (case). Not so much. Have a look at 7800X3D or 5800X3D. Their biggest penalty is not in latency but in lower clocks (compared to regular non-3DX counterparts). While those few hundred MHz lower clocks don't matter much at games, they have noticeably impact in applications.
Four channels are very unlikely for desktop.
They can introduce a new IOD and upgraded chipset, most probably will.
EDIT: To be more precise in wording, second GMI increases the bandwidth from 36 GB/s to 72 GB/s and thus allows more data to flow between chiplets via IF.
EDIT: Sorry, my bad, I read it as "for inter chipset communication". Everything clear now.
and yes, i've always said that the client processors not using wide GMI3 is a waste of performance where it's most needed (as AMD is very sensitive to RAM BW), specially on single-die models they could use both IFOP links on the IOD and the CCD, and only one per CCD for the dual-die models, after all you already have the links there, BUT it would require a different substrate for both models...
Brilliant stuff.
I welcome this approach if that's what happens.
food for thought
EPYC processors with 4 or fewer chiplets use both GMI links (wide GMI) to increase the bandwidth from 36 GB/s to 72 GB/s (page 11 of the file attached). By analogy, that is the case for Ryzen processors too. On the image below, both wide GMI3 links on both chiplets connect to two GMI ports on IOD, two links (wide GMI) from chiplet 1 to GMI3 port 0 and another two links (wide GMI) from chiplet 2 to GMI port 1 on IOD. We can see four clusters of links.
We do not have a shot of a single chiplet CPU that exposes GMI link, but the principle should be the same, aka IF bandwidth should be 72 GB/s, like on EPYCs with four and fewer chiplets, and not 36 GB/s.
* from page 11
INTERNAL INFINITY FABRIC INTERFACES connect the I/O die with each CPU die using 36 Gb/s Infinity Fabric links. (This is known internally as the Global Memory Interface [GMI] and is labeled this way in many figures.) In EPYC 9004 and 8004 Series processors with four or fewer CPU dies, two links connect to each CPU die for up to 72 Gb/s of connectivity
That's also the probable reason why AMD couldn't move the IOD farther to the edge, and CCDs closer to the centre. There are just too many wires for signals running from the IOD to the contacts at the other side of the substrate. The 28 PCIe lanes take four wires each, for example.