Wednesday, June 14th 2023

AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC

AMD on Tuesday (June 13) launched the EPYC 9004 "Bergamo" 128-core/256-thread high density compute server processor, and with it, debuted the new "Zen 4c" CPU microarchitecture. A lot had been made out about Zen 4c in the run up to yesterday's launch, such as rumors that it is a Zen 4 "lite" core that has lesser number-crunching muscle, and hence lower IPC, and that Zen 4c is AMD's answer to Intel's E-core architectures, such as "Gracemont" and "Crestmont." It turns out that it's neither a lite version of Zen 4, nor is it an E-core, but a physically compacted version of the Zen 4 core, with identical number crunching machinery.

First things first—Zen 4c has the same exact IPC as Zen 4 (that's performance at a given clock-speed). This is because its front-end, execution stage, load/store component, and internal cache hierarchy is exactly the same. It has the same 88-deep load queue, 64-deep store queue, the same 675,000 µop cache, the exact same INT+FP issue width of 10+6, the same exact INT register file, the same scheduler, and cache latencies. The L1I and L1D caches are the same 32 KB in size as "Zen 4," and so is the dedicated L2 cache, at 1 MB.
The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD. While the regular 8-core "Zen 4" CCD has eight "Zen 4" cores sharing a 32 MB L3 cache, the new 16-core "Zen 4c" CCD AMD introduced with "Bergamo" sees the chiplet pack two 8-core CCX (CPU core complexes), each with 16 MB of L3 cache shared among the 8 cores of the CCX. In this respect, the last-level cache and CPU core organization of the "Zen 4c" CCD has some similarities to the "Zen 2" CCD (which used two 4-core CCXs).

What's interesting is that the 16-core "Zen 4c" CCD isn't AMD's first product from this generation with lower last-level cache per core. The "Phoenix" APU silicon used in Ryzen 7040 series mobile processors sees eight "Zen 4" cores share a 16 MB L3 cache. For math-heavy compute workloads with lesser memory footprint, "Zen 4c" offers identical performance to "Zen 4," however, the smaller L3 cache should impact performance in bandwidth-sensitive workloads with large data-sets.
The Zen 4c CCD is built on the same exact TSMC 5 nm EUV foundry node that the company makes its regular 8-core Zen 4 CCD on, however, the Zen 4c CPU core is 35% smaller than the Zen 4 core, with a die area (per-core) of just 2.48 mm², compared to 3.84 mm². The die-size savings probably come from AMD "compacting" the various core components without reducing their form or function in any way. As we said earlier, the counts of the various core components remains the same, as do the sizes of the µ-op, L1, and L2 caches. EPYC 9004 "Bergamo" achieves its core-count of 128 using eight of these 16-core Zen 4c CCDs. In comparison, the regular "Genoa" processor achieves 96 cores over twelve 8-core Zen 4 CCDs.
Add your own comment

153 Comments on AMD Zen 4c Not an E-core, 35% Smaller than Zen 4, but with Identical IPC

#26
chrcoluk
MusselsIt's all good, I just loathe intels E-cores because they used a name that is the exact opposite of the product to mislead people about them

They're more efficient at single threaded tasks, and then intel uses them exclusively for multi threaded tasks.
Just... Ugh.
With my experimenting, (albeit on windows 10 which doesnt have intels pre configured scheduler).

By default in Windows 10 e-cores are heavily favoured, pretty much all single threaded tasks are loaded on to them and p-cores are parked, this even happens if parking is disabled in the power profile. (ultimate performance). park control also cant override this behaviour.

If I adjust the hetergeneous thread scheduling policy, I can manipulate this behaviour, its a hidden setting in windows. Setting it to either "all processors" or "performant" starts letting p-cores to be used, the latter however almost blocks use of e-cores so not ideal if you still want them to be used. But would be a quick and dirty fix e.g. if you want to fire up a single threaded game, it would give you a almost certainty it would use a p-core and not have to worry about affinity settings. Could use with something like 'AutoPowerOptionsOk' to automate the solution. Setting it to all processors would likely require using something like process hacker to get things working in a optimal way with automation so e.g. affinity for svchost and browsers to e-cores and affinity for games to p-cores (good for security as well as e-cores dont have htt). Both of these schedule options still automatically favour the fastest two p-cores for single threaded cinebench which is nice, on my ryzen cpu's this doesnt happen. It also doesnt happen on my 9900k, a reason why I went to all core clock speed on 9900k. But my testing on ryzen and 9900k was done on 1809, whilst on the 13700k was on 21H2, so its possible 1809 has no programming for "favoured cores" as that was introduced later I think.

AMD of course have this problem as well, with some of their processors for different reasons.

I assume the improvements in Windows 11 are just a better default behaviour when specific cpu's are recognised. For better OOB experience.
Posted on Reply
#27
tabascosauz
At least it's nice for them to finally have a real name. These Zen 4c cores are literally just APU grade Zen 4, jammed into a chiplet. Better than having to call them "reduced-cache Zen" every time to distinguish them.
TheoneandonlyMrKA Zen4c does everything a zen4 does just probably a bit slower.
Yes, much more capable than an E-core, but in heterogeneous applications it requires no less scheduling optimization. Half-cache Zen 2 and Zen 3 perform slightly worse per clock in productivity and significantly worse in games/cache heavy workloads.
Posted on Reply
#28
Denver
It seems to me that the simplification of the design has the weakness of not reaching clocks as high as Zen4. But this is not a problem on CPUs intended for servers...

"The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD."

If that's true why the slide says 35% smaller comparing just core+L2 ???
Posted on Reply
#29
londiste
at 35% smaller,
TheoneandonlyMrKSo all that's left unknown is the effect on max boost clock's.

I don't think the enterprise version ever needed the high frequency capability that zen has so these probably cannot run as fast.

But it's intriguing.
Unless it is very heavy clock deficit, there should be no reason for keeping plain old Zen4 around any more.
Posted on Reply
#30
Oberon
MusselsDid you pull that from the 35% decrease in size, and just hoped the math is the same?

Cause uh, halving the cache likely decreases those quite a bit
Cache actually doesn't consume much energy, so it doesn't have a large effect on temps. The bigger contributor to lower temps will be the reduced clockspeed.
Posted on Reply
#31
R0H1T
As long as they're making regular zen4 based chips for AM5 this will continue, also just had a look at fleabay recently awesome value on some of those previous gen EPYC's o_O
Posted on Reply
#32
AnotherReader
Denver"The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD."

If that's true why the slide says 35% smaller comparing just core+L2 ???
Last week, TechPowerUp reported on an analysis by SemiAnalysis that went over how AMD made Zen 4c smaller. While it's behind a paywall, the first part covering the physical design is free to read. The core sans the L2 cache is 44% smaller, i.e. nearly half the size of a Zen 4 core. It's an impressive feat of physical design.
TLDR:
  1. reducing the number of timing critical regions to just 4 from well over 10 in Zen 4 as seen in the diagram below: this sacrifices clock speed for density
  2. a new SRAM bitcell developed by TSMC for memories outside L2. As a 6T design, it saves area compared to the usual 8T designs
  3. lower clock speed target allows denser circuits
  4. The L3 also lacks the arrays of Through-Silicon Vias (TSV) for 3D V-Cache, giving a small area saving. This means that there's no possibility of a stacked L3 cache for Zen 4c.
Posted on Reply
#33
Denver
AnotherReaderLast week, TechPowerUp reported on an analysis by SemiAnalysis that went over how AMD made Zen 4c smaller. While it's behind a paywall, the first part covering the physical design is free to read. The core sans the L2 cache is 44% smaller, i.e. nearly half the size of a Zen 4 core. It's an impressive feat of physical design.
TLDR:
  1. reducing the number of timing critical regions to just 4 from well over 10 in Zen 4 as seen in the diagram below: this sacrifices clock speed for density
  2. a new SRAM bitcell developed by TSMC for memories outside L2. As a 6T design, it saves area compared to the usual 8T designs
  3. lower clock speed target allows denser circuits
  4. The L3 also lacks the arrays of Through-Silicon Vias (TSV) for 3D V-Cache, giving a small area saving. This means that there's no possibility of a stacked L3 cache for Zen 4c.
Thank you, That is a much clearer and more detailed explanation. Zen4c would be a much better efficiency core if AMD decides to beat intel at its own game.
Posted on Reply
#34
NeuralNexus
MusselsIt's all good, I just loathe intels E-cores because they used a name that is the exact opposite of the product to mislead people about them

They're more efficient at single threaded tasks, and then intel uses them exclusively for multi threaded tasks.
Just... Ugh.
It's kind of funny because people have brought into the marketing fluff when it comes to their desktop product stack. Efficiency cores are just as bloated as the Performance cores. Because they are using skylake architecture for those cores.
Posted on Reply
#35
Patriot
DenverIt seems to me that the simplification of the design has the weakness of not reaching clocks as high as Zen4. But this is not a problem on CPUs intended for servers...

"The only thing that's changed is that the effective L3 cache per core has been reduced to 2 MB, from 4 MB on the 8-core "Zen 4" CCD."

If that's true why the slide says 35% smaller comparing just core+L2 ???
Architectural change that effects performance...
It has been power and density optimized allowing for 16 cores per ccx... this means 16 cores share the same cache that...8 shared before...
Now you get 128 core with 8 ccx/core chiplets vs 12 for 96 on zen4
Posted on Reply
#36
tfp
I thought at one point AMD had slides to show that the "c" version would just be cache reduced but when it will be implemented on the consumer LITTLE.big side they will combine the prior generation "c" cores with the next generation "p" cores. I might be miss remembering or it was just a rumor because I can't find definitive information on this with a quick google.

If intel would need to use their server Xeon Phi atom cores they could get on "feature parity" with their "p" cores at least when it comes to hyperthreading and AVX-512. The ATOM cores would still be a heck of a lot slower.
ark.intel.com/content/www/us/en/ark/products/128694/intel-xeon-phi-processor-7235-16gb-1-3-ghz-64-core.html
Posted on Reply
#37
AnotherReader
tfpIf intel would need to use their server Xeon Phi atom cores they could get on "feature parity" with their "p" cores at least when it comes to hyperthreading and AVX-512. The ATOM cores would still be a heck of a lot slower.
ark.intel.com/content/www/us/en/ark/products/128694/intel-xeon-phi-processor-7235-16gb-1-3-ghz-64-core.html
The Xeon Phi atom cores are much slower than the Gracemont cores used along side Golden Cove and Raptor Cove. Their only saving grace is AVX-512.
Posted on Reply
#38
dyonoctis
MusselsIt's all good, I just loathe intels E-cores because they used a name that is the exact opposite of the product to mislead people about them

They're more efficient at single threaded tasks, and then intel uses them exclusively for multi threaded tasks.
Just... Ugh.
Looking at how Sapphire rapids struggle againt zen3 TR at equal core count while using more power, I'm really not surprised that they are being used in that manner. If RPL is already digusting when it comes to power draw, A 16 P-core i9 might have been uglier to witness on conssumers platforms. A 65w locked 7950x is still faster than golden cove going at 200 watts. (Note that Puget is enforcing PL1 125w and PL2 253w on the core i9 since those are the reference value set by Intel, and it's still faster than the xeon)
Posted on Reply
#39
ValenOne
TheoneandonlyMrKApple's and orange's, there is a bigger gap between the e cores and p then this.

E cores are single threaded and have fewer resources and less capability And a reduced ISA no AVX for example.

So yes Intel do smaller but they are also weaker less capable and actually require process scheduler interaction.

A Zen4c does everything a zen4 does just probably a bit slower.
Intel's E-Cores has AVX2 via three 128-bit SIMD units, hence they are closer to AMD's Zen 1.x's quad 128-bit SIMD units.

Intel's E-Cores do not have AVX-512.
Posted on Reply
#40
tfp
AnotherReaderThe Xeon Phi atom cores are much slower than the Gracemont cores used along side Golden Cove and Raptor Cove. Their only saving grace is AVX-512.
The point is they could bolt on HT and AVX-512 as they have before and they have the "roadmap" on how to do it in the next version if they wanted. Being Intel they won't until they are forced too by AMD.
Posted on Reply
#41
ValenOne
dyonoctisLooking at how Sapphire rapids struggle againt zen3 TR at equal core count while using more power, I'm really not surprised that they are being used in that manner. If RPL is already digusting when it comes to power draw, A 16 P-core i9 might have been uglier to witness on conssumers platforms. A 65w locked 7950x is still faster than golden cove going at 200 watts. (Note that Puget is enforcing PL1 125w and PL2 253w on the core i9 since those are the reference value set by Intel, and it's still faster than the xeon)
Cinebench R23 doesn't use AVX-512.









Are you using Cinema 4D R25 or Blender 3.x?

-------------------


After 10 minute run, Intel Core i9 13900KS's scores are lower.
Posted on Reply
#42
Tomorrow
Od1sseasIntel can pack 4 E-Cores in the same size as 1 P-Core. What about AMD? How many Zen4c cores for one Zen 4 core?
Still one as 4c is ~35% smaller. In order to pack two 4c cores in the same area 4c would need to be half the size as regular 4.

However i could see a possible two chiplet AM5 version where one chiplet uses 8 Zen 4 cores and another uses 16 Zen 4c cores giving a total of 24c/48t albeit with a reduced total L3 compared to 7950X (and 7950X3D).

Not sure there is market for such a chip as it would be multi-threaded focused product that would likely suffer the same or worse problems in games as 7950X does and would lose to X3D parts for sure. However there is an argument to be made that a regular 7950X could be replaced by this with small performance hit in cache sensitive workloads. Because 7950X buyers likely care more about core counts rather than cache.

Also im not sure if it's viable to make a model that has two chiplets with different core counts because correct me if im wrong but thus far all AMD models that have used two chiplets have used the same core counts on each chiplet?
Posted on Reply
#44
dyonoctis
ValenOneCinebench R23 doesn't use AVX-512.









Are you using Cinema 4D R25 or Blender 3.x?

-------------------


After 10 minute run, Intel Core i9 13900KS's scores are lower.
I'm using a mix of both, but the point that I was trying to make is that the e-core are being used for MT on the conssumer platform because a 16 P-core i9 wouldn't have been competitive against Ryzen, especially with intel 7 having to carry intel until late 2024.
The e-cores are not just marketing, It's literally what allows Intel to stay relevant on the conssumer side for people who are not just gaming. Them lacking AVX512 isn't ideal, but it's either that, or let the competition take the performance and efficiency crown across the board
Posted on Reply
#45
chrcoluk
kondaminSo what is the catch?
Looking at the posts in the thread, lower clocks and no 3d cache.

That is absolutely fine for server type usage.
dyonoctisI'm using a mix of both, but the point that I was trying to make is that the e-core are being used for MT on the conssumer platform because a 16 P-core i9 wouldn't have been competitive against Ryzen, especially with intel 7 having to carry intel until late 2024.
The e-cores are not just marketing, It's literally what allows Intel to stay relevant on the conssumer side for people who are not just gaming. Them lacking AVX512 isn't ideal, but it's either that, or let the competition take the performance and efficiency crown across the board
Yep the e-cores are keeping intel in the game on production type workloads, like software encoding, compressing, and compiling software. So absolutely used to keep multithreading competitive with AMD.

The p-cores keep them ahead on typical consumer use like gaming, office apps, web browsing, media playback.
Posted on Reply
#46
R0H1T
chrcolukThat is absolutely fine for server type usage.
Absolutely fine for regular desktops as well, I'd rather get a 5GHz chip with 10% less ST performance than 7950x & 50-100% more cores. I bet if they decided to release a full lineup they could wipe Intel clean across lots of segments with their massive price & (MT) performance advantage! The catch for consumers though is that they make less through desktops so they won't concentrate on this for probably at least half a year.
Posted on Reply
#47
TheoneandonlyMrK
ValenOneIntel's E-Cores has AVX2 via three 128-bit SIMD units, hence they are closer to AMD's Zen 1.x's quad 128-bit SIMD units.

Intel's E-Cores do not have AVX-512.
Didn't know that.
Posted on Reply
#48
Tek-Check
Od1sseasIntel can pack 4 E-Cores in the same size as 1 P-Core. What about AMD? How many Zen4c cores for one Zen 4 core?
It is true that those are small enough, but the main issue with Atom e-cores is that those cores do not support hyper-threading and AVX512, which is one of reasons there was a complete mess with AVX512 on Alder Lake and Raptor Lake CPU. Hence, Intel nerfed AVX512 and owners cannot benefit from it.

Zen4 c-cores are fully capable cores with smaller L3 cache. c-cores support HT and AVX512 workloads; perfect for cloud.

So, Sierra Forest CPU next year will have 144 Atom cores: 144C/114T. Bergamo CPU has 128C/256T. It's a monster chip for cloud computing, trumping both Intel and ARM solutions by several times, while easily slotted in the same socket 6096. Data centre partners will not need to buy new server motherboads either.

Next year, Turin Zen5 c-cores should bring another evolution in design in 16-core chiplets, namely current two 8-core CCX will be unified into 16-core CCX/CCD. If they want to increase core count to 192 c-cores, they will have to change packaging and I/O in order to place additional two chiplets, as there is no space left on current package due to communication pathways. That's why 16-core chiplets on Bergamo are placed apart and not jointly near each other.
Posted on Reply
#49
AnotherReader
chrcolukYep the e-cores are keeping intel in the game on production type workloads, like software encoding, compressing, and compiling software. So absolutely used to keep multithreading competitive with AMD.
As with the P cores, Intel has clocked the E cores too high. Clocking them closer to 3 Ghz would make them true E cores: more efficient than P cores. Chips and Cheese found Gracemont to be more efficient than Golden Cove at a variety of tasks if clock speeds were kept in check. Notably, these more efficient clock speeds were lower than Intel's default for the 12900k.

Posted on Reply
#50
R0H1T
The delta doesn't seem to be that much, plus with xC chips launching their "low power" cores are in a real tough spot. If Intel doesn't score some major wins in the server space they'd probaby have to drop E cores for good, I doubt they can afford to develop two (major) uarches the way things are progressing these days. Chiplets & more/less cache is the way forward.
Posted on Reply
Add your own comment
May 17th, 2024 11:28 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts