Wednesday, June 14th 2023
AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC
AMD on Tuesday (June 13) launched the EPYC 9004 "Bergamo" 128-core/256-thread high density compute server processor, and with it, debuted the new "Zen 4c" CPU microarchitecture. A lot had been made out about Zen 4c in the run up to yesterday's launch, such as rumors that it is a Zen 4 "lite" core that has lesser number-crunching muscle, and hence lower IPC, and that Zen 4c is AMD's answer to Intel's E-core architectures, such as "Gracemont" and "Crestmont." It turns out that it's neither a lite version of Zen 4, nor is it an E-core, but a physically compacted version of the Zen 4 core, with identical number crunching machinery.
First things first—Zen 4c has the same exact IPC as Zen 4 (that's performance at a given clock-speed). This is because its front-end, execution stage, load/store component, and internal cache hierarchy is exactly the same. It has the same 88-deep load queue, 64-deep store queue, the same 675,000 µop cache, the exact same INT+FP issue width of 10+6, the same exact INT register file, the same scheduler, and cache latencies. The L1I and L1D caches are the same 32 KB in size as "Zen 4," and so is the dedicated L2 cache, at 1 MB.The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD. While the regular 8-core "Zen 4" CCD has eight "Zen 4" cores sharing a 32 MB L3 cache, the new 16-core "Zen 4c" CCD AMD introduced with "Bergamo" sees the chiplet pack two 8-core CCX (CPU core complexes), each with 16 MB of L3 cache shared among the 8 cores of the CCX. In this respect, the last-level cache and CPU core organization of the "Zen 4c" CCD has some similarities to the "Zen 2" CCD (which used two 4-core CCXs).
What's interesting is that the 16-core "Zen 4c" CCD isn't AMD's first product from this generation with lower last-level cache per core. The "Phoenix" APU silicon used in Ryzen 7040 series mobile processors sees eight "Zen 4" cores share a 16 MB L3 cache. For math-heavy compute workloads with lesser memory footprint, "Zen 4c" offers identical performance to "Zen 4," however, the smaller L3 cache should impact performance in bandwidth-sensitive workloads with large data-sets.The Zen 4c CCD is built on the same exact TSMC 5 nm EUV foundry node that the company makes its regular 8-core Zen 4 CCD on, however, the Zen 4c CPU core is 35% smaller than the Zen 4 core, with a die area (per-core) of just 2.48 mm², compared to 3.84 mm². The die-size savings probably come from AMD "compacting" the various core components without reducing their form or function in any way. As we said earlier, the counts of the various core components remains the same, as do the sizes of the µ-op, L1, and L2 caches. EPYC 9004 "Bergamo" achieves its core-count of 128 using eight of these 16-core Zen 4c CCDs. In comparison, the regular "Genoa" processor achieves 96 cores over twelve 8-core Zen 4 CCDs.
First things first—Zen 4c has the same exact IPC as Zen 4 (that's performance at a given clock-speed). This is because its front-end, execution stage, load/store component, and internal cache hierarchy is exactly the same. It has the same 88-deep load queue, 64-deep store queue, the same 675,000 µop cache, the exact same INT+FP issue width of 10+6, the same exact INT register file, the same scheduler, and cache latencies. The L1I and L1D caches are the same 32 KB in size as "Zen 4," and so is the dedicated L2 cache, at 1 MB.The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD. While the regular 8-core "Zen 4" CCD has eight "Zen 4" cores sharing a 32 MB L3 cache, the new 16-core "Zen 4c" CCD AMD introduced with "Bergamo" sees the chiplet pack two 8-core CCX (CPU core complexes), each with 16 MB of L3 cache shared among the 8 cores of the CCX. In this respect, the last-level cache and CPU core organization of the "Zen 4c" CCD has some similarities to the "Zen 2" CCD (which used two 4-core CCXs).
What's interesting is that the 16-core "Zen 4c" CCD isn't AMD's first product from this generation with lower last-level cache per core. The "Phoenix" APU silicon used in Ryzen 7040 series mobile processors sees eight "Zen 4" cores share a 16 MB L3 cache. For math-heavy compute workloads with lesser memory footprint, "Zen 4c" offers identical performance to "Zen 4," however, the smaller L3 cache should impact performance in bandwidth-sensitive workloads with large data-sets.The Zen 4c CCD is built on the same exact TSMC 5 nm EUV foundry node that the company makes its regular 8-core Zen 4 CCD on, however, the Zen 4c CPU core is 35% smaller than the Zen 4 core, with a die area (per-core) of just 2.48 mm², compared to 3.84 mm². The die-size savings probably come from AMD "compacting" the various core components without reducing their form or function in any way. As we said earlier, the counts of the various core components remains the same, as do the sizes of the µ-op, L1, and L2 caches. EPYC 9004 "Bergamo" achieves its core-count of 128 using eight of these 16-core Zen 4c CCDs. In comparison, the regular "Genoa" processor achieves 96 cores over twelve 8-core Zen 4 CCDs.
153 Comments on AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC
Of course, the emulator was completely different to the one we have today, so you won't have the same success replicating that with a modern build of PCSX2. I suppose we shouldn't expect to see Zen 4c adopted on desktop, but you may see it in low-power, cost-efficient mobile SoCs eventually (such as a Mendocino/Zen 2 ULV replacement).
Despite taking approximate equal amounts of die space, the L3 temps are approximately 30% lower (indicating proportionally lower power consumption.) There are probably benchmarks of chips with v-cache out there against chips without it running at the same clocks and core voltages that would show minimal differences between the two, but I'm too lazy to look. We could also just look at the stock performance results of the 5800X3D and see that, despite having approximately 30% more silicon due to the slice of v-cache, it actually consumes considerably less power than the 5800X while only running at a 100-200 MHz deficit.
"lite" implies less performance - these are designed to be 100% the same performance, in memory intensive tasks where the cache doesnt help
It's very clear their goal is to have balanced CPU's, but then CPU's with a mix of 3D and C cores - 8 3D cores will provide top-tier gaming performance, while they can suddenly fit more C-cores in the same space (which makes them cheaper to produce) and have an 83D+12C setup out fairly easily with their chiplet designs The 5800x3D is the example, where the cache runs hotter and they have to be clocked down
Remember that by the time us home users run a test, we're already running them optimised - we cant run a 5800x3D at 5.05GHz and compare to a boosted 5800x
They managed to find a way to force windows to allow them to match intel with Async CPU designs with their drivers, so now they can get fancy and have 3D, normal and C cores and mix and match a dozen products from 3 parts
Voltage does nothing, it's amps that's the problem - and the problem is the heat from high amps kills the 3Dcache, at lower temps than the CPU's can safely run at
That would mean that I expect it to have similar heat/power/thermal constraints as any other logic-chip. Because SRAM IS logic.
AMD Quietly Introduces Ryzen 3 5100 Quad-Core Processor For AM4
source?
AM4!
I mean, i said new CPU's were coming to AM4 a while back but this isn't what i had in mind.
My guess is AM4 is their budget platform now - they want to keep selling A520 and B550 boards (and the chipsets) with some AM4 CPUs to the low end market, while AM5 matures and gets cheaper over time
It makes sense for them to sell one generation old + the new at the same time, so they can keep two production lines running at any given time.
That way their budget stuff isn't fighting for fab space of the high end parts
it spreads the risk out, and lets them get more products to market against all those seemingly random launch shortages we've suffered
That video explains what's been niggling in my mind with intels un-Efficiency cores: The AMD cores are using 1.5-2W each on those 96-128 core chips. Those arent even the new C cores.
~100W to 6 3D gaming P cores and 35W over 16 C cores?
Yeah, that would work wonders. Imagine office PC's and laptops at that power level, if they arent boosting them out of their efficiency curves.
It is a bit of neverending conundrum with chips - desired optimization points. As a manufacturer, do you want or can you sell efficiency as the main point? CPUs are a little trickier at that but GPUs might be an easier example - would you want an RTX 4090 at 300W power limit? How about 150W? Given that everything that would go into such product remains the same, meaning the cost would also be the same.
There is always possibility of limiting the larger CPU (or GPU) to the desired spot. AMD even has ECO mode. Both AMD and Intel (and Nvidia) have configurable power limits and depending on specific thing and needs also frequency limits. Basically, take a 7800X3D, limit its frequency to 3GHz and set the power limit at 24W and see where it leaves you and whether you would be willing to pay the cost for the results you get. Would be an interesting test, to be honest.
power efficiency is the key here - 16 core 32 thread laptops in a 45W power limit is entirely plausible, and that would shake up the market a lot
These are different in that even power limited and optimised they're well under the wattage you can achieve on anything else - played with a zen3+ DDR5 laptop and it was 3-4W per core in MT and 6W peak ST, 2-3x higher than these epyc cores, which again are not based on the C core design.
Dont you think the limited release 5600x3D seems like the perfect thing to pair up with a bunch of C-cores? Memory light tasks get the 3D cache, core/thread heavy tasks get the C-cores.
Big big deal here is that the C-cores are also physically smaller, they can get more in the same physical space and produce more per wafer. That helps out the bottom line a lot.