Tuesday, December 5th 2023

Intel "Emerald Rapids" Die Configuration Leaks, More Details Appear

Thanks to the leaked slides obtained by @InstLatX64, we have more details and some performance estimates about Intel's upcoming 5th Generation Xeon "Emerald Rapids" CPUs, boasting a significant performance leap over its predecessors. Leading the Emerald Rapids family is the top-end SKU, the Xeon 8592+, which features 64 cores and 128 threads, backed by a massive 480 MB L3 cache pool. The upcoming lineup shifts from a 4-tile to a 2-tile design to minimize latency and improve performance. The design utilizes the P-Core architecture under the Raptor Cove ISA and promises up to 40% faster performance than the current 4th Generation "Sapphire Rapids" CPUs in AI applications utilizing Intel AMX engine. Each chiplet has 35 cores, three of which are disabled, and each tile has two DDR5-5600 MT/s memory controllers, which operate two memory channels each and translating that into eight-channel design. There are three PCIe controllers per die, making it six in total.

Newer protocols and AI accelerators also back the upcoming lineup. Now, the Emerald Rapids family supports the Compute Express Link (CXL) Types 1/2/3 in addition to up to 80 PCIe Gen 5 lanes and enhanced Intel Ultra Path Interconnect (UPI). There are four UPI controllers spread over two dies. Moreover, features like the four on-die Intel Accelerator Engines, optimized power mode, and up to 17% improvement in general-purpose workloads make it seem like a big step up from the current generation. Much of this technology is found on the existing Sapphire Rapids SKUs, with the new generation enhancing the AI processing capability further. You can see the die configuration below. The 5th Generation Emerald Rapids designs are supposed to be official on December 14th, just a few days away.
Sources: European Southern Observatory Presentation, via @InstLatX64 (X/Twitter)
Add your own comment

28 Comments on Intel "Emerald Rapids" Die Configuration Leaks, More Details Appear

#1
TumbleGeorge
Hmm, 4 channels RAM but only up to 5600 DDR5. How to feeding 64 cores with this?
Posted on Reply
#2
AleksandarK
News Editor
TumbleGeorgeHmm, 4 channels RAM but only up to 5600 DDR5. How to feeding 64 cores with this?
My understanding is that each controller operates two channels each, which gives 8-channel memory in total.
Posted on Reply
#3
unwind-protect
All this silicon going into subunits where I will never use it...
Posted on Reply
#4
TumbleGeorge
AleksandarKMy understanding is that each controller operates two channels each, which gives 8-channel memory in total.
In text 2 chiplets(or "tiles")*2 controllers
Posted on Reply
#5
AleksandarK
News Editor
TumbleGeorgeIn text 2 chiplets(or "tiles")*2 controllers
Fixed
each tile has two DDR5-5600 MT/s memory controllers, which operate two memory channels each and translating that into eight-channel design
Posted on Reply
#6
TumbleGeorge
AleksandarKFixed
I believe that is 256 bit bus and is equal to AMD Threadripper X 7000 series. Yes theoretically is 8 channels because each DDR5 module has 2*32bit inner bus... but... will compete with AMD Epyc(?) How? Epyc already has 12 channels.
Posted on Reply
#7
Daven
I wonder what went wrong with the four tile SPR configuration that made them drop down to two tiles with more cores.
Posted on Reply
#8
TumbleGeorge
DavenI wonder what went wrong with the four tile SPR configuration that made them drop down to two tiles with more cores.
AleksandarKto minimize latency
Posted on Reply
#9
DavidC1
TumbleGeorgeI believe that is 256 bit bus and is equal to AMD Threadripper X 7000 series. Yes theoretically is 8 channels because each DDR5 module has 2*32bit inner bus... but... will compete with AMD Epyc(?) How? Epyc already has 12 channels.
Sigh.

Emerald Rapids is a drop-in-socket replacement to Sapphire Rapids, the current gen Xeon. Sapphire Rapids is 8-channels. Disregard the DDR5 "channel" term, it just confuses the lesser knowledgeable.

DDR5 confuses people. Each DIMMs are always 64-bit. So eight channels mean 512-bit. They called it "dual channel" originally because you only needed one DIMM for maximum performance before and after that you needed two, hence "dual". So dual channels are always 128-bit. Beyond that pseudo-marketing term, you refer to bit width, so you get away from the DDR5 shenanigans.

I doubt DDR5's "dual" channel really did anything other than try to compensate for increased latency over DDR4. Every DDR generation talks about various talking points but in the end it's only the MT/s metric that matters.

Back to the CPU.

From an engineering standpoint, it's a very good work. The turnaround time is very short and since they optimized the space taken up by the die, it's good and they improved perf/W on the same process.

From a product standpoint, it's better than Sapphire Rapids, which is not much. Just like the predecessor, it'll live based on Intel selling low and people that needs accelerators. They got lucky betting with AMX, since big companies are using it for deep learning acceleration.
Posted on Reply
#10
TumbleGeorge
DavidC1So eight channels mean 512-bit
No. This isn't explain two channels per tile/chiplet whatever is right terminology.
Posted on Reply
#11
DavidC1
TumbleGeorgeNo. This isn't explain two channels per tile/chiplet whatever is right terminology.
Emerald Rapids has:
-Two tiles
-Each tile has two memory tiles
-Each memory tile is a 2-channel device

2x2x2 = 8

Keep insisting, but you are wrong. EMR is 512-bit, just like the predecessor.

I'm telling you, DDR5 messed up people's minds on what "channels" mean for memory. Channels = 64-bit.
Posted on Reply
#12
TumbleGeorge
DavidC12x2x2* = 8
Inner channels in DDR5 module are 32bit or if you make sum your 8 individual channels math say 8*32bit.
Posted on Reply
#13
DavidC1
Intel claims 40% gain in AI workloads over Sapphire Rapids. The core count increase is relatively small, so the low level changes such as the tile removal must be contributing to it.
TumbleGeorgeInner channels in DDR5 module are 32bit or if you make sum your 8 individual channels math say 8*32bit.
No. Do research, and then think, rather than repeat yourself. Why would Intel cut down memory channels in half for a successor? Why would Intel state DDR5 inner channels, when it doesn't matter at all?

I'm talking about you when referring to DDR5 confusing people.
Posted on Reply
#14
TumbleGeorge
DavidC1Intel claims 40% gain in AI workloads over Sapphire Rapids. The core count increase is relatively small, so the low level changes such as the tile removal must be contributing to it.

No. Do research, and then think, rather than repeat yourself. Why would Intel cut down memory channels in half for a successor? Why would Intel state DDR5 inner channels, when it doesn't matter at all?

I'm talking about you when referring to DDR5 confusing people.
Looks like we'll have to wait for better leaks or official details on what the architecture will be. And why wouldn't Intel act logically from the user's point of view?...They can count on the reduced latency and the increased cache size (cache hitt>>cache miss) to mask the fact that they will sell us a crippled line in terms of RAM processors. It may be that production is cheaper and they hope for an increased profit margin. How do I know for sure at this point?
Posted on Reply
#15
Crackong
Is the architecture still golden cove ?

We just had a Xeon W5 server recently and found out that
Under normal TDP settings, it performs just like a second gen EPYC, which is really, really disappointing.
Posted on Reply
#16
Squared
One RAM channel is always 64 bits, or 2 × 32 bits in the case of DDR5. So 8 channels is 8 × 2 × 32 bits, for 512 bits total.
CrackongIs the architecture still golden cove ?

We just had a Xeon W5 server recently and found out that
Under normal TDP settings, it performs just like a second gen EPYC, which is really, really disappointing.
That's surprising to me; Golden Cove in desktop and laptop CPUs usually outperforms Zen 3, or matches it at the same power consumption. Emerald Rapids is "Raptor Cove" which I understand to be exactly like Golden Cove but with more cache.
Posted on Reply
#17
TumbleGeorge
Yes i read a article from before one year for Emerald Rapids and there is explain more understandable. 16 dimms per socket.
Posted on Reply
#18
Crackong
SquaredThat's surprising to me; Golden Cove in desktop and laptop CPUs usually outperforms Zen 3, or matches it at the same power consumption.
Our main use case is VM hosting.
So only multithread performance matters.
Under server TDP limits (225W) these CPUs just don't have the room to boost itself to reasonable frequencies.
The cores usually sits below 2 GHz for most of the time for the high core count models.

(edited for more details)
For the model I have on hand it was a Xeon w5-2465x with 240W TDP
The comparison was a 3.5 year-old TR3955wx with 280W TDP
In all core workload the Xeon struggled to keep its base frequency of 3.1GHz, sometimes dip below 2.8
While the TR stays at its 3.85GHz all the time.
Those frequencies differences makes up for the architectural benefits and they just performed almost the same.


In Desktop & Laptop, manufacturers usually 'cheat' the TDP limit by adding higher PL2
This isn't the case in server.
Server TDP limits are straight.
Posted on Reply
#19
Squared
CrackongOur main use case is VM hosting.
So only multithread performance matters.
Under server TDP limits (225W) these CPUs just don't have the room to boost itself to reasonable frequencies.
The cores usually sits below 2 GHz for most of the time.

In Desktop & Laptop, manufacturers usually 'cheat' the TDP limit by adding higher PL2
This isn't the case in server.
Server TDP limits are straight.
I wonder if Golden Cove is less efficient than Zen3 when at low clock speeds. This wouldn't really ever effect desktop and wouldn't hurt benchmarks on laptops, but would partly explain the poor battery life of Alder Lake laptops.
Posted on Reply
#20
Crackong
SquaredI wonder if Golden Cove is less efficient than Zen3 when at low clock speeds. This wouldn't really ever effect desktop and wouldn't hurt benchmarks on laptops, but would partly explain the poor battery life of Alder Lake laptops.
I've updated some comparison details on hand, please check.

For my use case it was two workstation CPUs with lower core counts.
I choose them specifically for their relatively higher base clock...so to give snappier responses in VM
It isn't Golden Cove is less efficient than Zen3 (Or Zen2) when at low clock speeds
It is just Golden Cove needs more juice to get those clock speeds in the first place.

So the Zen series will always get more Frequencies out of the straight limited power budget, and that frequencies out weights the architectural benefits that goldencove has
Posted on Reply
#21
Squared
Crackong(edited for more details)
For the model I have on hand it was a Xeon w5-2465x with 240W TDP
The comparison was a 3.5 year-old TR3955wx with 280W TDP
I'm not sure how the TDP conventions here compare but you're comparing a TSMC N7 CPU to an Intel 7 CPU, and the Xeon here appears to be performing the same with a 14% lower thermal limit. That's not a bad showing for a "7nm" CPU against another. But I get that it's disappointing that Intel can't do better 3.5 years later, when AMD can with 4th-generation Epyc.

I think it's quite likely that Emerald Rapids will do better than Sapphire Rapids, but probably not 14% better except maybe in really cache-sensitive workloads.
Posted on Reply
#22
kaamraan
I heard this one has a battle frontier and animated sprites
Posted on Reply
#23
Cipher908
kaamraanI heard this one has a battle frontier and animated sprites
They usually say Emerald is the best one, lets hope that its the case here too.
Posted on Reply
#24
DavidC1
SquaredI wonder if Golden Cove is less efficient than Zen3 when at low clock speeds. This wouldn't really ever effect desktop and wouldn't hurt benchmarks on laptops, but would partly explain the poor battery life of Alder Lake laptops.
It is true. www.reddit.com/media?url=https://i.redd.it/7fw8a6w4qkj81.jpg

On 22nm the curve was steeper, thus Ivy Bridge lost significantly on the higher frequencies. Great on the Atoms though. Steeper curve = better frequencies at lower power, but doesn't improve as much when you juice it up.

14nm they changed it up a bit, but that went full steam on the 10nm processes. You can see Alderlake beats AMD counterparts in perf/W at higher power, but not at lower power. Recent parts are more pronounced in this regard.

Battery life is another matter though. Alder/Raptor has a difficult time keeping idle power low. It seems it can sometimes, but not as well as the predecessors. Since battery life is bursty workloads, idle power being low is what determines battery life for the most part.

50W chip being on for 1% of the time = 0.5W
1W idle for 99% = 1W
Total = ~1.5W

20W chip being on for 1% of the time = 0.2W
2W idle for 99% = 2W
Total = ~2.2W

In theory, Meteorlake should do better. The LP E-cores will force tasks off compute tile for bursty workloads and reduce SoC power. The Intel 4 process has a steeper curve, so while it won't do as well on higher power, it'll do quite well on the lower end. Hopefully, whatever low-level changes Alder/Raptor had that made it regress is addressed on Meteorlake too.
SquaredI'm not sure how the TDP conventions here compare but you're comparing a TSMC N7 CPU to an Intel 7 CPU, and the Xeon here appears to be performing the same with a 14% lower thermal limit. That's not a bad showing for a "7nm" CPU against another. But I get that it's disappointing that Intel can't do better 3.5 years later, when AMD can with 4th-generation Epyc.

I think it's quite likely that Emerald Rapids will do better than Sapphire Rapids, but probably not 14% better except maybe in really cache-sensitive workloads.
Emerald Rapids cost more than Sapphire Rapids to produce, because it has two tiles using 1490mm2 while Sapphire Rapids uses 1510mm2 over four tiles. According to Semianalysis, with a perfect defect density rate, the amount of CPUs that can be made per wafer is 34 on EMR vs 37 on SPR. Since EMR is a 700mm2 die versus 400mm2 one, the differences will be likely even better since in practice there is some defect rate and larger dies have a more chance of having defects.

Therefore, it's latency that Intel aims to reduce with EMR. With two tiles, there's a lot less data hopping than on four with SPR. Intel claims 17% improved performance/watt. Being that it's only two tiles, they can also beef up the bandwidth between tiles in addition to lowering latency.

www.computerbase.de/2023-09/intel-emerald-rapids-flotter-refresh-ueberall-sichtbar-aber-doch-noch-nicht-da/
CrackongFor the model I have on hand it was a Xeon w5-2465x with 240W TDP
The comparison was a 3.5 year-old TR3955wx with 280W TDP
In all core workload the Xeon struggled to keep its base frequency of 3.1GHz, sometimes dip below 2.8
While the TR stays at its 3.85GHz all the time.
Those frequencies differences makes up for the architectural benefits and they just performed almost the same.
Laptops can't cheat for too long as it's a thermally constrained chassis.

Based on your numbers, it's possible that Sierra Forest might equal Sapphire Rapids even on per thread performance due to higher clocks.
Posted on Reply
#25
Squared
DavidC1According to Semianalysis, with a perfect defect density rate, the amount of CPUs that can be made per wafer is 34 on EMR vs 37 on SPR.
Does Semianalyis compare the cost of the EMIB dies and packaging? Sapphire Rapids maxes out at 10 EMIB interconnects whereas Emerald Rapids maxes out at 3. That brings the silicon parts down from 14 to 5. I wonder if the reduced EMIB die cost and packaging cost could more than offset the additional CPU die cost.
DavidC1In theory, Meteorlake should do better. The LP E-cores will force tasks off compute tile for bursty workloads and reduce SoC power. The Intel 4 process has a steeper curve, so while it won't do as well on higher power, it'll do quite well on the lower end. Hopefully, whatever low-level changes Alder/Raptor had that made it regress is addressed on Meteorlake too.
I was thinking what you were saying about the frequency curves should mean that Alder Lake's laptop efficiency problem is solved in Meteor Lake. I would think that it'd actually be best if Intel 4 focused on high frequency efficiency because the near-idle CPU usage will be on cores in the SoC tile built with TSMC N6, and they will be low-frquency optimized (to an extent), so there's not as much need for for the CPU tile to be efficient at low frequencies.
Posted on Reply
Add your own comment
Nov 21st, 2024 11:37 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts