Wednesday, July 24th 2024
AMD Strix Point SoC Reintroduces Dual-CCX CPU, Other Interesting Silicon Details Revealed
Since its reveal last week, we got a slightly more technical deep-dive from AMD on its two upcoming processors—the "Strix Point" silicon powering its Ryzen AI 300 series mobile processors; and the "Granite Ridge" chiplet MCM powering its Ryzen 9000 desktop processors. We present a closer look into the "Strix Point" SoC in this article. It turns out that "Strix Point" takes a significantly different approach to heterogeneous multicore than "Phoenix 2." AMD gave us a close look at how this works. AMD built the "Strix Point" monolithic silicon on the TSMC N4P foundry node, with a die-area of around 232 mm².
The "Strix Point" silicon sees the company's Infinity Fabric interconnect as its omnipresent ether. This is a point-to-point interconnect, unlike the ringbus on some Intel processors. The main compute machinery on the "Strix Point" SoC are its two CPU compute complexes (CCX), each with a 32b (read)/16b (write) per cycle data-path to the fabric. The concept of CCX makes a comeback with "Strix Point" after nearly two generations of "Zen." The first CCX contains the chip's four full-sized "Zen 5" CPU cores, which share a 16 MB L3 cache among themselves. The second CCX contains the chip's eight "Zen 5c" cores that share a smaller 8 MB L3 cache. Each of the 12 cores has a 1 MB dedicated L2 cache.This approach to heterogeneous multicore is significantly different from "Phoenix 2," where the two "Zen 4" and four "Zen 4c" cores were part of a common CCX, with a common 16 MB L3 cache accessible to all six cores.
The "Zen 5" cores on "Strix Point" will be able to sustain high boost frequencies, in excess of 5.00 GHz, and should benefit from the larger 16 MB L3 cache that's shared among just four cores (similar L3 cache per core to "Granite Ridge"). The "Zen 5c" cores, on the other hand, operate at lower base- and boost frequencies than the "Zen 5" cores, and have lesser amounts of available L3 caches. For threads to migrate between the two core types, they will have to go through the fabric, and in some cases, even incur a round-trip to the main memory.
The Zen 5c core is about 25% smaller in die-area than the Zen 5 core. For reference, the Zen 4c core is about 35% smaller than a regular Zen 4 core. AMD has worked to slightly improve the maximum boost frequencies of the Zen 5c core compared to its predecessor, so the frequency band of the Zen 5c cores are a tiny bit closer. The lower maximum voltages and maximum boost frequencies of Zen 5c cores put them at a significant power efficiency advantage over the Zen 5 cores. AMD is continuing to rely on a software based scheduling solution that ensures the right kind of processing workload goes to the right kind of core. The company says that the software based solution lets it correct "scheduling mistakes" over time.
The iGPU is the most bandwidth-hungry device on the fabric, and gets its widest data-path—4x 32B/cycle. Based on the RDNA 3.5 graphics architecture, which retains the SIMD engine and IPC of RDNA 3, but with several improvements to the performance/Watt, this iGPU also features 8 workgroup processors (WGPs), compared to the 6 on the current "Phoenix" silicon. This works out to 16 CU, or 1,024 stream processors. The iGPU also features 4 render backends+, which work out to 16 ROPs.
The third most bandwidth-hungry device is the XDNA 2 NPU, with a 32B/cycle data-path that's of a comparable bandwidth to a CCX. The NPU features four blocks of 8 XDNA 2 arrays, and 32 AI engine tiles; for 50 TOPS of AI inferencing throughput, and can be overclocked. It also supports the Block FP16 data format (not to be confused with bfloat16), which offers the precision of FP16, with the performance of FP8.
Besides the three logic-heavy components, there are other accelerators that are fairly demanding on the bandwidth, such as the Video CoreNext engine that accelerates encoding and decoding; the audio coprocessor that processes the audio stack when the system is "powered down," so it can respond to voice commands; the display controller that handles the display I/O, including display stream compression, if called for; the SMU, Microsoft Pluton, TPM, and other manageability hardware.
The I/O interfaces of the "Strix Point" SoC include a memory controller that supports 128-bit LPDDR5, LPDDR5x, and dual-channel DDR5 (160-bit). The PCI-Express root complex is slightly truncated compared to the one "Phoenix" comes with. There are a total of 16 PCIe Gen 4 lanes. All 16 should be usable in notebooks that lack a discrete FCH chipset, but the usable lane count should drop to 12 when AMD eventually adapts this silicon to Socket AM5 for desktop APUs. On gaming notebooks that use Ryzen AI HX or H 300 series processors, discrete GPUs should have a Gen 4 x8 connection. USB connectivity includes a 40 Gbps USB4, or two 20 Gbps USB 3.2 Gen 2x2, two additional 10 Gbps USB 3.2 Gen 2, and three classic USB 2.0.
The "Strix Point" silicon sees the company's Infinity Fabric interconnect as its omnipresent ether. This is a point-to-point interconnect, unlike the ringbus on some Intel processors. The main compute machinery on the "Strix Point" SoC are its two CPU compute complexes (CCX), each with a 32b (read)/16b (write) per cycle data-path to the fabric. The concept of CCX makes a comeback with "Strix Point" after nearly two generations of "Zen." The first CCX contains the chip's four full-sized "Zen 5" CPU cores, which share a 16 MB L3 cache among themselves. The second CCX contains the chip's eight "Zen 5c" cores that share a smaller 8 MB L3 cache. Each of the 12 cores has a 1 MB dedicated L2 cache.This approach to heterogeneous multicore is significantly different from "Phoenix 2," where the two "Zen 4" and four "Zen 4c" cores were part of a common CCX, with a common 16 MB L3 cache accessible to all six cores.
The "Zen 5" cores on "Strix Point" will be able to sustain high boost frequencies, in excess of 5.00 GHz, and should benefit from the larger 16 MB L3 cache that's shared among just four cores (similar L3 cache per core to "Granite Ridge"). The "Zen 5c" cores, on the other hand, operate at lower base- and boost frequencies than the "Zen 5" cores, and have lesser amounts of available L3 caches. For threads to migrate between the two core types, they will have to go through the fabric, and in some cases, even incur a round-trip to the main memory.
The Zen 5c core is about 25% smaller in die-area than the Zen 5 core. For reference, the Zen 4c core is about 35% smaller than a regular Zen 4 core. AMD has worked to slightly improve the maximum boost frequencies of the Zen 5c core compared to its predecessor, so the frequency band of the Zen 5c cores are a tiny bit closer. The lower maximum voltages and maximum boost frequencies of Zen 5c cores put them at a significant power efficiency advantage over the Zen 5 cores. AMD is continuing to rely on a software based scheduling solution that ensures the right kind of processing workload goes to the right kind of core. The company says that the software based solution lets it correct "scheduling mistakes" over time.
The iGPU is the most bandwidth-hungry device on the fabric, and gets its widest data-path—4x 32B/cycle. Based on the RDNA 3.5 graphics architecture, which retains the SIMD engine and IPC of RDNA 3, but with several improvements to the performance/Watt, this iGPU also features 8 workgroup processors (WGPs), compared to the 6 on the current "Phoenix" silicon. This works out to 16 CU, or 1,024 stream processors. The iGPU also features 4 render backends+, which work out to 16 ROPs.
The third most bandwidth-hungry device is the XDNA 2 NPU, with a 32B/cycle data-path that's of a comparable bandwidth to a CCX. The NPU features four blocks of 8 XDNA 2 arrays, and 32 AI engine tiles; for 50 TOPS of AI inferencing throughput, and can be overclocked. It also supports the Block FP16 data format (not to be confused with bfloat16), which offers the precision of FP16, with the performance of FP8.
Besides the three logic-heavy components, there are other accelerators that are fairly demanding on the bandwidth, such as the Video CoreNext engine that accelerates encoding and decoding; the audio coprocessor that processes the audio stack when the system is "powered down," so it can respond to voice commands; the display controller that handles the display I/O, including display stream compression, if called for; the SMU, Microsoft Pluton, TPM, and other manageability hardware.
The I/O interfaces of the "Strix Point" SoC include a memory controller that supports 128-bit LPDDR5, LPDDR5x, and dual-channel DDR5 (160-bit). The PCI-Express root complex is slightly truncated compared to the one "Phoenix" comes with. There are a total of 16 PCIe Gen 4 lanes. All 16 should be usable in notebooks that lack a discrete FCH chipset, but the usable lane count should drop to 12 when AMD eventually adapts this silicon to Socket AM5 for desktop APUs. On gaming notebooks that use Ryzen AI HX or H 300 series processors, discrete GPUs should have a Gen 4 x8 connection. USB connectivity includes a 40 Gbps USB4, or two 20 Gbps USB 3.2 Gen 2x2, two additional 10 Gbps USB 3.2 Gen 2, and three classic USB 2.0.
16 Comments on AMD Strix Point SoC Reintroduces Dual-CCX CPU, Other Interesting Silicon Details Revealed
The green block 'CPU core' for the classic cores is transistor to transistor the same as the green block 'CPU core' for the compact core. AMD has reduced the spacing between transistors which limits the max clock frequency but maintains IPC.
We now have both pictures of the chip arrangement of Turin Epycs:
Turin 128 'Classic' Cores (256 threads)
Turin 192 'Compact' Cores (384 threads)
It will be funny if after Zen 4c being way stronger than Gracemont E, that Zen 5c is actually weaker than Skymont E, a strong possibility given Intel saying Skymont E is as strong as Raptor Cove P cores, and they clock to 4.7GHz
As for Zen 5c vs Skymont E, SMT will allow Zen 5c to keep up with the latter in many workloads even at lower clocks. Of course, we don't know yet what clocks Zen 5c will reach.
NPU = wasted die space... Fricken' M$ and their co-pilot BS. That die space would be far more useful as more PCIe lanes.