Wednesday, June 14th 2023

AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC

AMD on Tuesday (June 13) launched the EPYC 9004 "Bergamo" 128-core/256-thread high density compute server processor, and with it, debuted the new "Zen 4c" CPU microarchitecture. A lot had been made out about Zen 4c in the run up to yesterday's launch, such as rumors that it is a Zen 4 "lite" core that has lesser number-crunching muscle, and hence lower IPC, and that Zen 4c is AMD's answer to Intel's E-core architectures, such as "Gracemont" and "Crestmont." It turns out that it's neither a lite version of Zen 4, nor is it an E-core, but a physically compacted version of the Zen 4 core, with identical number crunching machinery.

First things first—Zen 4c has the same exact IPC as Zen 4 (that's performance at a given clock-speed). This is because its front-end, execution stage, load/store component, and internal cache hierarchy is exactly the same. It has the same 88-deep load queue, 64-deep store queue, the same 675,000 µop cache, the exact same INT+FP issue width of 10+6, the same exact INT register file, the same scheduler, and cache latencies. The L1I and L1D caches are the same 32 KB in size as "Zen 4," and so is the dedicated L2 cache, at 1 MB.
The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD. While the regular 8-core "Zen 4" CCD has eight "Zen 4" cores sharing a 32 MB L3 cache, the new 16-core "Zen 4c" CCD AMD introduced with "Bergamo" sees the chiplet pack two 8-core CCX (CPU core complexes), each with 16 MB of L3 cache shared among the 8 cores of the CCX. In this respect, the last-level cache and CPU core organization of the "Zen 4c" CCD has some similarities to the "Zen 2" CCD (which used two 4-core CCXs).

What's interesting is that the 16-core "Zen 4c" CCD isn't AMD's first product from this generation with lower last-level cache per core. The "Phoenix" APU silicon used in Ryzen 7040 series mobile processors sees eight "Zen 4" cores share a 16 MB L3 cache. For math-heavy compute workloads with lesser memory footprint, "Zen 4c" offers identical performance to "Zen 4," however, the smaller L3 cache should impact performance in bandwidth-sensitive workloads with large data-sets.
The Zen 4c CCD is built on the same exact TSMC 5 nm EUV foundry node that the company makes its regular 8-core Zen 4 CCD on, however, the Zen 4c CPU core is 35% smaller than the Zen 4 core, with a die area (per-core) of just 2.48 mm², compared to 3.84 mm². The die-size savings probably come from AMD "compacting" the various core components without reducing their form or function in any way. As we said earlier, the counts of the various core components remains the same, as do the sizes of the µ-op, L1, and L2 caches. EPYC 9004 "Bergamo" achieves its core-count of 128 using eight of these 16-core Zen 4c CCDs. In comparison, the regular "Genoa" processor achieves 96 cores over twelve 8-core Zen 4 CCDs.
Add your own comment

153 Comments on AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC

#126
londiste
From everything we know about AMD C-cores they should not have a power efficiency benefit. From what AMD says, they are very much the same core - with lower L3 cache and no way to add 3D V-Cache - but that was their choice. They have a die area benefit for a clock speed boost penalty. Since their optimization point of choice is at low frequencies the penalty really does not come into play.

AMD already has a 16-core mobile CPU as 7945HX - www.amd.com/en/product/13016 - with a slightly higher TDP range though. the lower end of the range at 55W should match pretty well to your 45W idea given the inherent efficiency handicap from using chiplets. Granted, the laptops with it usually run something like 88W power limit but surely that is configurable.

Regarding running stuff at more optimized settings - my post might have come out more critical than intended the same line of thought has been on my mind quite a number of times. I am currently running a 5800X3D limited to 76W and with -30 curve optimizer negative offset.
Posted on Reply
#127
Unregistered
londisteFrom everything we know about AMD C-cores they should not have a power efficiency benefit. From what AMD says, they are very much the same core - with lower L3 cache and no way to add 3D V-Cache - but that was their choice. They have a die area benefit for a clock speed boost penalty. Since their optimization point of choice is at low frequencies the penalty really does not come into play.
we already have data on that

www.phoronix.com/review/amd-epyc-9754-bergamo



Across all of these benchmarks carried out, the EPYC 9754 2P on average had a 385 Watt power draw... In comparison the EPYC 9654 2P had a 447 Watt average and the EPYC 9684X 2P had a 464 Watt average. And need we mention the Xeon Platinum 8490H 60-core processor consuming even more power with a 568 Watt average. The EPYC 9754 power consumption results surpassed my expectations in frankly not expecting Zen 4C to deliver such power efficiency improvements while still performing so well.
#128
londiste
That Phoronix review really does not tell us much about power efficiency of Zen4c vs Zen4. Considerably more cores at lower clocks and generally well-threaded tests...
9754 is 128c/256t at 2.25/3.1/3.1 GHz
9654 is 96c/192t at 2.4/3.55/3.7 GHz
Those few hundred MHz alone make a noticeable difference in efficiency.
Posted on Reply
#129
Unregistered
londisteThat Phoronix review really does not tell us much about power efficiency of Zen4c vs Zen4. Considerably more cores at lower clocks and generally well-threaded tests...
9754 is 128c/256t at 2.25/3.1/3.1 GHz
9654 is 96c/192t at 2.4/3.55/3.7 GHz
Those few hundred MHz alone make a noticeable difference in efficiency.
efficiency is not being calculated in relation to the clocks, but in relation to the work done
Posted on Edit | Reply
#130
londiste
M440efficiency is not being calculated in relation to the clocks, but in relation to the work done
Efficiency calculation includes power consumption that is absolutely affected by clocks. Power consumption figures around 2-3GHz are on a quite steep slope.

More cores at lower clocks will be more efficient. The details on that are hard to see from a general result like that - plus, we do not really know the clocks distribution across tests. For example, look at 9654 and 9554 in the same lineup where the former is ~20% faster at the same power draw. It is not quite the same level of difference as 9745 vs 9654 but still a noticeable efficiency difference (also, both of these are probably running at 3+GHz but we do not know exactly).

9654 is 96c/192t at 2.4/3.55/3.7 GHz
9554 is 64c/128t at 3.1/3.75/3.75 GHz

Edit:
I am not saying that Zen4c is not more efficient but this is not the data point that would show that in any clear manner, much less getting some idea how much more efficient.
Posted on Reply
#131
Valentin294
MusselsZen 3D - add more cache
This is Zen 1D, less cache!

(For certain workloads the cache matters less, so the product makes sense)
I myself thought about this aswell.
Since it seems as if Zen4c cores use 20-30% less power at the same core count, if they added 3dvcache to such a processor, it would result in 1,5 times the cache and better power efficiency which would appeal to everyone, but in order to work better than the 7900x3d and 7950x3d have resulting in worse perfomance in games than the 7800x3d, they should put it on both ccd`s. I doubt it would happen with Zen4 but rather with Zen5, thus ill use 8000 naming in my examples.
AMD EPYC™ 9754 128c 256t 360W TDP
AMD EPYC™ 9654n 96c 192t 360W TDP


Example( pure fiction ) (i have gone with 30% less power drawn)
  • Ryzen 5

    8600X TDP 65 W L3 32MB
    8600 TDP 65 W L3 32 MB
    8600C C-cores and 3D Vcache TDP 44 W L3 48 MB
  • Ryzen 7 (700&800)

    8700 X TDP 105 W L3 32MB
    8800X TDP 120 W L3 32MB
    8700 / 8800 TDP 65 W L3 32MB

    8700C TDP 75 W L3 48MB
    8800C TDP 85 W L3 48MB

    8800X3D TDP 120W L3 96MB
  • Ryzen 9 ( im only doing 900 here)

    8900X TDP 170 W L3 64MB
    8900 TDP 65 W L3 64MB

    8900X3d TDP 120 W L3 96 MB ->if they make the same configuration as with the 7900X3d
    8900C TDP 144 W L3 96 MB -> This one has on both CCDs 3D-V-Cache
If they scale aswell as i think then a 8900C would be a great choice for those who want great Gaming performance and more cores, with the added benefit of better power efficency and the others just aswell and in the case of the 8800 lineup it would be a middle ground between productivity cpu and gaming cpu with 3D V Cache, and so on.
Posted on Reply
#132
Unregistered
londisteEfficiency calculation includes power consumption that is absolutely affected by clocks. Power consumption figures around 2-3GHz are on a quite steep slope.

More cores at lower clocks will be more efficient. The details on that are hard to see from a general result like that - plus, we do not really know the clocks distribution across tests. For example, look at 9654 and 9554 in the same lineup where the former is ~20% faster at the same power draw. It is not quite the same level of difference as 9745 vs 9654 but still a noticeable efficiency difference (also, both of these are probably running at 3+GHz but we do not know exactly).

9654 is 96c/192t at 2.4/3.55/3.7 GHz
9554 is 64c/128t at 3.1/3.75/3.75 GHz

Edit:
I am not saying that Zen4c is not more efficient but this is not the data point that would show that in any clear manner, much less getting some idea how much more efficient.
efficiency is "work done" divided by wattage,
Valentin294I myself thought about this aswell.
Since it seems as if Zen4c cores use 20-30% less power at the same core count, if they added 3dvcache to such a processor, it would result in 1,5 times the cache and better power efficiency which would appeal to everyone, but in order to work better than the 7900x3d and 7950x3d have resulting in worse perfomance in games than the 7800x3d, they should put it on both ccd`s. I doubt it would happen with Zen4 but rather with Zen5, thus ill use 8000 naming in my examples.
AMD EPYC™ 9754 128c 256t 360W TDP
AMD EPYC™ 9654n 96c 192t 360W TDP


Example( pure fiction ) (i have gone with 30% less power drawn)
  • Ryzen 5

    8600X TDP 65 W L3 32MB
    8600 TDP 65 W L3 32 MB
    8600C C-cores and 3D Vcache TDP 44 W L3 48 MB
  • Ryzen 7 (700&800)

    8700 X TDP 105 W L3 32MB
    8800X TDP 120 W L3 32MB
    8700 / 8800 TDP 65 W L3 32MB

    8700C TDP 75 W L3 48MB
    8800C TDP 85 W L3 48MB

    8800X3D TDP 120W L3 96MB
  • Ryzen 9 ( im only doing 900 here)

    8900X TDP 170 W L3 64MB
    8900 TDP 65 W L3 64MB

    8900X3d TDP 120 W L3 96 MB ->if they make the same configuration as with the 7900X3d
    8900C TDP 144 W L3 96 MB -> This one has on both CCDs 3D-V-Cache
If they scale aswell as i think then a 8900C would be a great choice for those who want great Gaming performance and more cores, with the added benefit of better power efficency and the others just aswell and in the case of the 8800 lineup it would be a middle ground between productivity cpu and gaming cpu with 3D V Cache, and so on.
imo compact c-cores can't run 3dcache as of now, but hope somebody else can confirm

provisioning for TSV (vertical wiring vias for connecting 3D V-Cache) has been eliminated, which saved space on the chip
#133
AnotherReader
M440efficiency is "work done" divided by wattage,



imo compact c-cores can't run 3dcache as of now, but hope somebody else can confirm

provisioning for TSV (vertical wiring vias for connecting 3D V-Cache) has been eliminated, which saved space on the chip
The Zen 4c analysis by SemiAnalysis confirms that TSVs have been eliminated to save space.
The L3 also lacks the arrays of Through-Silicon Vias (TSV) for 3D V-Cache, giving a small area saving.
Posted on Reply
#134
londiste
M440efficiency is "work done" divided by wattage
Do you want to say that neither "work done" nor wattage is affected by clocks?
Posted on Reply
#135
Mussels
Freshwater Moderator
londisteDo you want to say that neither "work done" nor wattage is affected by clocks?
it's an entirely different thing, and clear distinctions need to be made

it's like saying an amps value, but refusing to state volts or watts - some metrics are a combination of others, but singular ones are often worthless without the other corresponding data


from the link @AnotherReader posted above
The trend is that performance per Watt in any given workload is the most important factor, and as such can command a significant price premium. Look no further than the AMD Milan to Genoa transition, where AMD was able to command an 80% price increase simply due to the increased deployment density and performance per watt.
performance per watt and overall efficiency is what AMD makes their money from, and we're going to see consumer products reflecting that.

laptops and OEM desktops will love the tits off this, because it means smaller lighter products that need less cooling, and that means more profits.


Looks like they already are
This means an identical IPC and ISA feature level, which simplifies integration on the client side. In fact, AMD’s is also silently swapping some Zen 4 cores with Zen 4c cores in its lower-end 4nm Ryzen 7000U “Phoenix” mobile processors. On Bergamo, Zen 4c allows AMD to increase core counts from 96 to 128 while saving on area and cost. This bifurcation in design philosophy will increase in future generations of hardware.
The further details on the current designs match something AMD said in TPU's interview recently, that they've got performance concerns passing a certain threshold without faster RAM to back it up - they've limited how many performance cores they can have with the current design (using 8 of 12 CCX links)
However, the truly stunning thing here is the die size. 16 Zen 4c cores are barely larger than 8 Zen 4 cores
This is where things will change, as the cores are individually slower they can slap in twice as many cores for an overall performance gain as well as an efficiency gain - and possibly use the unused CCX links.

256c cores in the server world is entirely plausible, and probably being worked on already.
Posted on Reply
#136
B_Bang
Does this mean they do not require Windows 11? Windows 10 should be fine then?
Posted on Reply
#137
Squared
B_BangDoes this mean they do not require Windows 11? Windows 10 should be fine then?
I haven't seen benchmarks but if you're talking about the Phoenix 2 die which maxes or at 2+4 cores (Zen 4+Zen4c), I would expect Windows 10 to do fine. Windows 10 is aware of preferred cores and I think that's all it needs for best results, and under laptop power limits, worst-case results will only be a little worse than best case.
Posted on Reply
#138
Mussels
Freshwater Moderator
B_BangDoes this mean they do not require Windows 11? Windows 10 should be fine then?
They arent like intel mixing two types of cores, so nothing special is needed at the OS scheduler level.
Posted on Reply
#139
AusWolf
B_BangDoes this mean they do not require Windows 11? Windows 10 should be fine then?
Considering that the only major difference between Zen 4 and Zen 4c is clock speed, which is something Intel had even back on 11th gen with Turbo Boost 3.0, I'd say, absolutely.
Posted on Reply
#140
Mussels
Freshwater Moderator
AusWolfConsidering that the only major difference between Zen 4 and Zen 4c is clock speed, which is something Intel had even back on 11th gen with Turbo Boost 3.0, I'd say, absolutely.
Well, cache.

Per CCX cache values, so combinations exist with dual CCX CPUs

16MB (Phoenix 'G' APU)
16MB (Dinoysus/Zen 4C)
32MB (per CCX) in regular Zen4 (Raphael)
96MB (x3D)

The Zen4C come across initially as being an APU without the APU, but they fit twice as many cores in the same space - mostly due to changing the SRAM used, it would seem.
In addition to the reduced core footprint, die space is further saved in the Zen 4c CCD via the use of denser 6T dual-port SRAM cells and an overall reduction of L3 cache to 16 MB per 8-core CCX. Zen 4c cores have the same sized L1 and L2 caches as Zen 4 cores but the cache die area in Zen 4c cores is lower due to using denser SRAM and slower cache
Using denser, slower L3 cache let them make it physically smaller and slap in double the cores, but since L1 and L2 are the same the basic performance matches Zen4 in general.
It's like a reversal of the x3D chips, since some tasks didnt benefit from the extra cache (rendering, extremely long workloads etc), they made a chip with less, slower cache to fit that need.


Cant wait for something with 8 3D cores and 32 C cores, that'll be the thing to blast every benchmark off the map
Posted on Reply
#141
Squared
MusselsUsing denser, slower L3 cache let them make it physically smaller and slap in double the cores, but since L1 and L2 are the same the basic performance matches Zen4 in general.
It's like a reversal of the x3D chips, since some tasks didnt benefit from the extra cache (rendering, extremely long workloads etc), they made a chip with less, slower cache to fit that need.
The only chip with both Zen 4 and Zen 4c cores is Phoenix 2, and in Phoenix 2 all cores share the same L3 cache. (The Zen 4c-only chip, Bergamo, does have denser and more widely shared L3 cache.)
Posted on Reply
#142
Mussels
Freshwater Moderator
SquaredThe only chip with both Zen 4 and Zen 4c cores is Phoenix 2, and in Phoenix 2 all cores share the same L3 cache. (The Zen 4c-only chip, Bergamo, does have denser and more widely shared L3 cache.)
I assume they'll make more hybrid designs in the future, they've barely begun on it.
Posted on Reply
#143
AusWolf
MusselsWell, cache.

Per CCX cache values, so combinations exist with dual CCX CPUs

16MB (Phoenix 'G' APU)
16MB (Dinoysus/Zen 4C)
32MB (per CCX) in regular Zen4 (Raphael)
96MB (x3D)

The Zen4C come across initially as being an APU without the APU, but they fit twice as many cores in the same space - mostly due to changing the SRAM used, it would seem.


Using denser, slower L3 cache let them make it physically smaller and slap in double the cores, but since L1 and L2 are the same the basic performance matches Zen4 in general.
It's like a reversal of the x3D chips, since some tasks didnt benefit from the extra cache (rendering, extremely long workloads etc), they made a chip with less, slower cache to fit that need.


Cant wait for something with 8 3D cores and 32 C cores, that'll be the thing to blast every benchmark off the map
Yeah, but the L3 cache is just one single entity shared across the whole CPU, so the scheduler doesn't need to do anything special to account for it. L1 and L2 are the same across Zen 4 and Zen 4c cores.
Posted on Reply
#144
AnotherReader
MusselsWell, cache.

Per CCX cache values, so combinations exist with dual CCX CPUs

16MB (Phoenix 'G' APU)
16MB (Dinoysus/Zen 4C)
32MB (per CCX) in regular Zen4 (Raphael)
96MB (x3D)

The Zen4C come across initially as being an APU without the APU, but they fit twice as many cores in the same space - mostly due to changing the SRAM used, it would seem.


Using denser, slower L3 cache let them make it physically smaller and slap in double the cores, but since L1 and L2 are the same the basic performance matches Zen4 in general.
It's like a reversal of the x3D chips, since some tasks didnt benefit from the extra cache (rendering, extremely long workloads etc), they made a chip with less, slower cache to fit that need.


Cant wait for something with 8 3D cores and 32 C cores, that'll be the thing to blast every benchmark off the map
Given that 32 MB of L3 in a Zen 4c die takes about the same die space as in a Zen 4 die, it's rather unlikely that it's any denser or slower than the L3 in regular Zen 4. For large SRAM arrays, wire delay contributes significantly to the access time so a smaller array should be a little faster than a large array. Of course, larger wires can be used for the larger array to decrease wire delay. Another example is the extra cache in the 7950X3D which is denser than regular SRAM and is on a different die, but it only incurs 4 more cycles of latency.
Posted on Reply
#145
Mussels
Freshwater Moderator
AusWolfYeah, but the L3 cache is just one single entity shared across the whole CPU, so the scheduler doesn't need to do anything special to account for it. L1 and L2 are the same across Zen 4 and Zen 4c cores.
100% agreed

They used higher density cache (and less of it) which is something the OS doesnt know or care about, so all those core types appear the same.
The only thing needed is something the chipset driver already does, with a way to push games onto cores with higher cache if available
AnotherReaderGiven that 32 MB of L3 in a Zen 4c die takes about the same die space as in a Zen 4 die, it's rather unlikely that it's any denser or slower than the L3 in regular Zen 4. For large SRAM arrays, wire delay contributes significantly to the access time so a smaller array should be a little faster than a large array. Of course, larger wires can be used for the larger array to decrease wire delay. Another example is the extra cache in the 7950X3D which is denser than regular SRAM and is on a different die, but it only incurs 4 more cycles of latency.
Zen4C fits twice as many cores in the same space - they stick 16 cores where 8 Zen4 cores fit. They are a LOT denser.
Posted on Reply
#146
Squared
MusselsThey used higher density cache (and less of it) which is something the OS doesnt know or care about, so all those core types appear the same.
The only thing needed is something the chipset driver already does, with a way to push games onto cores with higher cache if available
The OS certainly has reason to care about L3 cache. If one application has 2 threads which frequently communicate and share data, then it's hugely beneficial for those two threads to share the same L3 cache pool. This would be a concern in Zen and Zen 2 where every four cores has a separate L3 cache and there's a long latency penalty between blocks. And I'm not aware of the chipset playing a role in this. The OS chooses which core a thread will run on.
MusselsZen4C fits twice as many cores in the same space - they stick 16 cores where 8 Zen4 cores fit. They are a LOT denser.
The article says that the cores themselves are 35% denser. The rest of the density increase comes from using half the L3 cache per core (but twice as many cores) with a denser cache design. That's for Bergamo.

Where Zen 4 and Zen 4c are used together is Phoenix 2, and in Phoenix 2 both types of cores share the same pool of L3 cache, so in Phoenix 2 there is literally no difference between the cores with respect to cache.
Posted on Reply
#147
Mussels
Freshwater Moderator
SquaredThe OS certainly has reason to care about L3 cache. If one application has 2 threads which frequently communicate and share data, then it's hugely beneficial for those two threads to share the same L3 cache pool. This would be a concern in Zen and Zen 2 where every four cores has a separate L3 cache and there's a long latency penalty between blocks. And I'm not aware of the chipset playing a role in this. The OS chooses which core a thread will run on.

The article says that the cores themselves are 35% denser. The rest of the density increase comes from using half the L3 cache per core (but twice as many cores) with a denser cache design. That's for Bergamo.

Where Zen 4 and Zen 4c are used together is Phoenix 2, and in Phoenix 2 both types of cores share the same pool of L3 cache, so in Phoenix 2 there is literally no difference between the cores with respect to cache.
Good points - there are a variety of designs there and that does make it more confusing when talking about it.

Summary: scheduling changes aren't needed at the OS level to use these CPUs. Hybrid designs need something to push programs to the best choice, but the 'worst case' wont be like on Intels E-cores where programs can outright crash or have massive performance losses.
Posted on Reply
#148
Squared
MusselsHybrid designs need something to push programs to the best choice, but the 'worst case' wont be like on Intels E-cores where programs can outright crash or have massive performance losses.
I've never heard of crashes being caused by Intel E-core scheduling. Any app should work properly even if the OS moves it to an E-core, unless the app was designed to fail specifically in this case, like anti-cheat software. But the potential performance hit from a scheduling mistake could indeed be a lot worse than for Zen 4/4c.
Posted on Reply
#149
AnotherReader
Mussels100% agreed

They used higher density cache (and less of it) which is something the OS doesnt know or care about, so all those core types appear the same.
The only thing needed is something the chipset driver already does, with a way to push games onto cores with higher cache if available


Zen4C fits twice as many cores in the same space - they stick 16 cores where 8 Zen4 cores fit. They are a LOT denser.
It's the cores that are much denser. The L3 seems to be the same.
Posted on Reply
#150
AusWolf
Mussels100% agreed

They used higher density cache (and less of it) which is something the OS doesnt know or care about, so all those core types appear the same.
The only thing needed is something the chipset driver already does, with a way to push games onto cores with higher cache if available
That's the thing - all cores have the same amount of cache. The only thing that differs is the density of circuits and the resulting clock speed difference. The scheduler only needs to know which cores are faster, which it already does since preferred cores were invented with Intel's 11th gen and Zen 3.
Posted on Reply
Add your own comment
Nov 21st, 2024 12:55 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts