Tuesday, December 5th 2023
Intel "Emerald Rapids" Die Configuration Leaks, More Details Appear
Thanks to the leaked slides obtained by @InstLatX64, we have more details and some performance estimates about Intel's upcoming 5th Generation Xeon "Emerald Rapids" CPUs, boasting a significant performance leap over its predecessors. Leading the Emerald Rapids family is the top-end SKU, the Xeon 8592+, which features 64 cores and 128 threads, backed by a massive 480 MB L3 cache pool. The upcoming lineup shifts from a 4-tile to a 2-tile design to minimize latency and improve performance. The design utilizes the P-Core architecture under the Raptor Cove ISA and promises up to 40% faster performance than the current 4th Generation "Sapphire Rapids" CPUs in AI applications utilizing Intel AMX engine. Each chiplet has 35 cores, three of which are disabled, and each tile has two DDR5-5600 MT/s memory controllers, which operate two memory channels each and translating that into eight-channel design. There are three PCIe controllers per die, making it six in total.
Newer protocols and AI accelerators also back the upcoming lineup. Now, the Emerald Rapids family supports the Compute Express Link (CXL) Types 1/2/3 in addition to up to 80 PCIe Gen 5 lanes and enhanced Intel Ultra Path Interconnect (UPI). There are four UPI controllers spread over two dies. Moreover, features like the four on-die Intel Accelerator Engines, optimized power mode, and up to 17% improvement in general-purpose workloads make it seem like a big step up from the current generation. Much of this technology is found on the existing Sapphire Rapids SKUs, with the new generation enhancing the AI processing capability further. You can see the die configuration below. The 5th Generation Emerald Rapids designs are supposed to be official on December 14th, just a few days away.
Sources:
European Southern Observatory Presentation, via @InstLatX64 (X/Twitter)
Newer protocols and AI accelerators also back the upcoming lineup. Now, the Emerald Rapids family supports the Compute Express Link (CXL) Types 1/2/3 in addition to up to 80 PCIe Gen 5 lanes and enhanced Intel Ultra Path Interconnect (UPI). There are four UPI controllers spread over two dies. Moreover, features like the four on-die Intel Accelerator Engines, optimized power mode, and up to 17% improvement in general-purpose workloads make it seem like a big step up from the current generation. Much of this technology is found on the existing Sapphire Rapids SKUs, with the new generation enhancing the AI processing capability further. You can see the die configuration below. The 5th Generation Emerald Rapids designs are supposed to be official on December 14th, just a few days away.
28 Comments on Intel "Emerald Rapids" Die Configuration Leaks, More Details Appear
Emerald Rapids is a drop-in-socket replacement to Sapphire Rapids, the current gen Xeon. Sapphire Rapids is 8-channels. Disregard the DDR5 "channel" term, it just confuses the lesser knowledgeable.
DDR5 confuses people. Each DIMMs are always 64-bit. So eight channels mean 512-bit. They called it "dual channel" originally because you only needed one DIMM for maximum performance before and after that you needed two, hence "dual". So dual channels are always 128-bit. Beyond that pseudo-marketing term, you refer to bit width, so you get away from the DDR5 shenanigans.
I doubt DDR5's "dual" channel really did anything other than try to compensate for increased latency over DDR4. Every DDR generation talks about various talking points but in the end it's only the MT/s metric that matters.
Back to the CPU.
From an engineering standpoint, it's a very good work. The turnaround time is very short and since they optimized the space taken up by the die, it's good and they improved perf/W on the same process.
From a product standpoint, it's better than Sapphire Rapids, which is not much. Just like the predecessor, it'll live based on Intel selling low and people that needs accelerators. They got lucky betting with AMX, since big companies are using it for deep learning acceleration.
-Two tiles
-Each tile has two memory tiles
-Each memory tile is a 2-channel device
2x2x2 = 8
Keep insisting, but you are wrong. EMR is 512-bit, just like the predecessor.
I'm telling you, DDR5 messed up people's minds on what "channels" mean for memory. Channels = 64-bit.
I'm talking about you when referring to DDR5 confusing people.
We just had a Xeon W5 server recently and found out that
Under normal TDP settings, it performs just like a second gen EPYC, which is really, really disappointing.
So only multithread performance matters.
Under server TDP limits (225W) these CPUs just don't have the room to boost itself to reasonable frequencies.
The cores usually sits below 2 GHz for most of the time for the high core count models.
(edited for more details)
For the model I have on hand it was a Xeon w5-2465x with 240W TDP
The comparison was a 3.5 year-old TR3955wx with 280W TDP
In all core workload the Xeon struggled to keep its base frequency of 3.1GHz, sometimes dip below 2.8
While the TR stays at its 3.85GHz all the time.
Those frequencies differences makes up for the architectural benefits and they just performed almost the same.
In Desktop & Laptop, manufacturers usually 'cheat' the TDP limit by adding higher PL2
This isn't the case in server.
Server TDP limits are straight.
For my use case it was two workstation CPUs with lower core counts.
I choose them specifically for their relatively higher base clock...so to give snappier responses in VM
It isn't Golden Cove is less efficient than Zen3 (Or Zen2) when at low clock speeds
It is just Golden Cove needs more juice to get those clock speeds in the first place.
So the Zen series will always get more Frequencies out of the straight limited power budget, and that frequencies out weights the architectural benefits that goldencove has
I think it's quite likely that Emerald Rapids will do better than Sapphire Rapids, but probably not 14% better except maybe in really cache-sensitive workloads.
On 22nm the curve was steeper, thus Ivy Bridge lost significantly on the higher frequencies. Great on the Atoms though. Steeper curve = better frequencies at lower power, but doesn't improve as much when you juice it up.
14nm they changed it up a bit, but that went full steam on the 10nm processes. You can see Alderlake beats AMD counterparts in perf/W at higher power, but not at lower power. Recent parts are more pronounced in this regard.
Battery life is another matter though. Alder/Raptor has a difficult time keeping idle power low. It seems it can sometimes, but not as well as the predecessors. Since battery life is bursty workloads, idle power being low is what determines battery life for the most part.
50W chip being on for 1% of the time = 0.5W
1W idle for 99% = 1W
Total = ~1.5W
20W chip being on for 1% of the time = 0.2W
2W idle for 99% = 2W
Total = ~2.2W
In theory, Meteorlake should do better. The LP E-cores will force tasks off compute tile for bursty workloads and reduce SoC power. The Intel 4 process has a steeper curve, so while it won't do as well on higher power, it'll do quite well on the lower end. Hopefully, whatever low-level changes Alder/Raptor had that made it regress is addressed on Meteorlake too. Emerald Rapids cost more than Sapphire Rapids to produce, because it has two tiles using 1490mm2 while Sapphire Rapids uses 1510mm2 over four tiles. According to Semianalysis, with a perfect defect density rate, the amount of CPUs that can be made per wafer is 34 on EMR vs 37 on SPR. Since EMR is a 700mm2 die versus 400mm2 one, the differences will be likely even better since in practice there is some defect rate and larger dies have a more chance of having defects.
Therefore, it's latency that Intel aims to reduce with EMR. With two tiles, there's a lot less data hopping than on four with SPR. Intel claims 17% improved performance/watt. Being that it's only two tiles, they can also beef up the bandwidth between tiles in addition to lowering latency.
www.computerbase.de/2023-09/intel-emerald-rapids-flotter-refresh-ueberall-sichtbar-aber-doch-noch-nicht-da/ Laptops can't cheat for too long as it's a thermally constrained chassis.
Based on your numbers, it's possible that Sierra Forest might equal Sapphire Rapids even on per thread performance due to higher clocks.