Wednesday, September 20th 2023

Intel 288 E-core Xeon "Sierra Forest" Out to Eat AMD EPYC Bergamo's Lunch

Intel at the 2023 InnovatiON event unveiled a 288-core extreme core-count variant of the Xeon "Sierra Forest" processor for high-density servers for scale-out, cloud-native environments. It succeeds the current 144-core model. "Sierra Forest" is a server processor based entirely on efficiency cores, or E-cores, based on the "Sierra Glen" core microarchitecture, a server-grade derivative of "Crestmont," Intel's second-generation E-core that's making a client debut with "Meteor Lake."

Xeon "Sierra Forest" is a chiplet-based processor, much like "Meteor Lake" and the upcoming "Emerald Rapids" server processor. It features a total of five tiles—two Compute tiles, two I/O tiles, and a base tile (interposer). Each of the two Compute tiles is built on the Intel 3 foundry node, a more advanced node than Intel 4, featuring higher-density libraries, and an undisclosed performance/Watt increase. Each tile has 36 "Sierra Glen" E-core clusters, 108 MB of shared L3 cache, 6-channel (12 sub-channel) DDR5 memory controllers, and Foveros tile-to-tile interfaces.
Each "Sierra Glen" E-core cluster features four CPU cores that share a 4 MB local L2 cache, and a 3 MB segment contributing to the tile's 108 MB L3 cache. Unlike the "Meteor Lake" Compute tile that uses a ringbus to connect its E-core clusters and P-cores, the Compute tile uses a Mesh topology interconnect for the large array of 36 E-core clusters. With 144 cores per tile, in its maximum configuration with three such tiles, "Sierra Forest" achieves 288 cores. "Sierra Glen" lacks SMT, just like "Crestmont," and so the OS only has 288 logical processors to address.
Besides the two Compute tiles, the processor has two I/O tiles. Unlike the similarly named "I/O tile" of the client "Meteor Lake" processor, the ones on "Sierra Forest" serve the functions of both the SoC and I/O PHY. With the memory controllers located on the Compute tiles, in its maximum 288-core variant, "Sierra Forest" features a 12-channel DDR5 memory interface.
The I/O tile is left with the UPI interconnect for 2P servers; application-specific accelerators, a 68-lane PCI-Express Gen 5 root complex that's flexible between PCIe Gen 5 and CXL 2.0, and the I/O Fabric. Despite being based on an advanced node like Intel 3, each of the two Compute tiles is an enormous 578 mm² in die-area, while each of the two I/O tiles is 241 mm².

The up to 12-channel memory interface of "Sierra Forest" comes with native support for ECC DDR5-6400 speed. The accelerators are carried over from the current "Granite Rapids" processor, and provide speed ups for popular cryptography, file-streaming, and and data-compression operations.

When it arrives in the first half of 2024, Xeon "Sierra Forest" will square off against AMD's EPYC "Bergamo" processor. "Bergamo" is based on a slightly different philosophy than "Sierra Forest." It is a 128-core/256-thread processor based on "Zen 4c" cores that don't quite qualify as E-cores, and have an identical IPC to regular "Zen 4" cores, an identical ISA, and SMT.
Source: Tom's Hardware
Add your own comment

40 Comments on Intel 288 E-core Xeon "Sierra Forest" Out to Eat AMD EPYC Bergamo's Lunch

#26
AnotherReader
Patriotwww.servethehome.com/intel-announces-288-e-core-sierra-forest-variant-at-innovation-2023/
Nevermind, looks like they have a dual die version with 2x 144e cores per socket.
www.intc.com/news-events/press-releases/detail/1648/intel-innovation-2023-empowering-developers-to-bring-ai
The slide referenced at hot chips was accurate, Intel was keeping the 288c dual die hidden and it will be rare and probably very low clocked.

I guess now the question is... since they said >205w /socket. If 205w is default 144c wattage... how low are clocks going to be, and how high of power are 2 dies going to need?
So the 288 E cores will have 12 channels of DDR5 to feed them? If that's correct, then this is going to be bottlenecked by memory bandwidth for a lot of workloads, but if the clocks are low enough, then it might not matter.
Posted on Reply
#27
Blaeza
288 is still a great deal of cores. I'm interested in the Cinebench results... I know it's a stupid thing but I also am stupid. You brainiacs understanding all this can be as technical as you like lol.
Posted on Reply
#28
Patriot
Blaeza288 is still a great deal of cores. I'm interested in the Cinebench results... I know it's a stupid thing but I also am stupid. You brainiacs understanding all this can be as technical as you like lol.
No no I understand lol... I helped a buddy with his twin 64c epyc build and seeing all 256 threads hit load in task manager and hit turbo boost was... very satisfying. That said they were his real world workloads but... just seeing that many threads just go full tilt... mmmmm
Posted on Reply
#29
trsttte
not_my_real_nameAnd huge tiles, EMIB, it won't be cheap...
AMD by using smaller chiplets will probably have much better margins for a price war. This has already been true with previous epyc generations with single slot epyc running circles around dual socket Xeon
fevgatosThat's your definition of a core?? A non atom type? Okay buddy great definition. I say a real core is a non ryzen type.
I thought you were being pedantic but it seems you're being a senseless fanboy instead... cool story

An e core - as defined and implemented by Intel - has a reduced instruction set compared to P cores of the same generation, misses features like SMT, has lower clocks, lower dedicated L1 and L2 cache and lower IPC. It has been mocked as not a real core by enthusiasts because the roll out of this hybrid architecture has been a complete mess, often being more beneficial to disable e cores all together in real world application.

On the opposite side, AMD zen4c implements the same instructions as the bigger zen4, has the same L1 and L2, about the same IPC, maintains support for SMT and just looses on max clocks and L3 (though the loss in L3 is because there's double the cores in the same chiplet).

In servers - that already prioritize lower power and more stable clocks and where even the previous so called "slow" ryzen cores were already beating intel "regular" (before the e core p core distinction) cores - AMD is set to demolish Intel's solution unless something goes terribly wrong
Posted on Reply
#30
Wirko
trsttteAn e core - as defined and implemented by Intel - has a reduced instruction set compared to P cores of the same generation, misses features like SMT, has lower clocks, lower dedicated L1 and L2 cache and lower IPC. It has been mocked as not a real core by enthusiasts because the roll out of this hybrid architecture has been a complete mess, often being more beneficial to disable e cores all together in real world application.

On the opposite side, AMD zen4c implements the same instructions as the bigger zen4, has the same L1 and L2, about the same IPC, maintains support for SMT and just looses on max clocks and L3 (though the loss in L3 is because there's double the cores in the same chiplet).

In servers - that already prioritize lower power and more stable clocks and where even the previous so called "slow" ryzen cores were already beating intel "regular" (before the e core p core distinction) cores - AMD is set to demolish Intel's solution unless something goes terribly wrong
Agreed on the hybrid mess (but I'd buy a P+E CPU specifically to see that mess in action, analyse MT performance, and try to get the best out of the cores). However, this news is about servers. E cores could make a lot of sense in servers, in certain applications.

As for performance, a fair comparison would be one between units that run two threads. Hence, one Zen 4 vs. one Zen 4c vs. one P vs. two E cores. Two E cores are much closer, performance-wise and area-wise, to the other three, especially if you keep in mind that SMT drags down the performance of all of them - except E cores.
Posted on Reply
#31
theouto
E-Cores I recall were not good for latency and not as fast as a proper core, I expect latency galore with this, and for a server? No clue if it's what you'd want
Posted on Reply
#32
trsttte
WirkoAgreed on the hybrid mess (but I'd buy a P+E CPU specifically to see that mess in action, analyse MT performance, and try to get the best out of the cores). However, this news is about servers. E cores could make a lot of sense in servers, in certain applications.
I was answering a specific question ;)
WirkoAs for performance, a fair comparison would be one between units that run two threads. Hence, one Zen 4 vs. one Zen 4c vs. one P vs. two E cores. Two E cores are much closer, performance-wise and area-wise, to the other three, especially if you keep in mind that SMT drags down the performance of all of them - except E cores.
I think a fair comparison is a product vs another product, whatever they decide to throw in the ring. Intel wants to throw 288 single threaded e cores with limited ISA's into the ring against a 128 zen4c cores with 256 threads and the full isa available that's their problem not mine. Naturally we can only speculate at this point but I don't see Intel winning this fight, not even close which is the context of this news piece - "Intel Sierra Forest Out to Eat AMD EPYC Bergamo's Lunch" hmm I doubt it.
Posted on Reply
#33
Wirko
trsttteI think a fair comparison is a product vs another product, whatever they decide to throw in the ring. Intel wants to throw 288 single threaded e cores with limited ISA's into the ring against a 128 zen4c cores with 256 threads and the full isa available that's their problem not mine. Naturally we can only speculate at this point but I don't see Intel winning this fight, not even close which is the context of this news piece - "Intel Sierra Forest Out to Eat AMD EPYC Bergamo's Lunch" hmm I doubt it.
Sure, what matters is the product, but we are the TPU and have an OCD on technical details such as the square root of XYZ buffer half-life times bandwidth. If that product can find or create its own market niche then fine, and if that niche is 3% of the server market, it's not automatically a failed product.
Back to technical details, I just don't think that lack of SMT is a deficiency here. Just look at how small the E core is.
Posted on Reply
#34
AnotherReader
WirkoSure, what matters is the product, but we are the TPU and have an OCD on technical details such as the square root of XYZ buffer half-life times bandwidth. If that product can find or create its own market niche then fine, and if that niche is 3% of the server market, it's not automatically a failed product.
Back to technical details, I just don't think that lack of SMT is a deficiency here. Just look at how small the E core is.
SMT would improve performance for many low ILP workloads, but eliminating it decreases the attack surface.
Posted on Reply
#35
trsttte
WirkoI just don't think that lack of SMT is a deficiency here. Just look at how small the E core is
I'd agree, the only problem is the competition has that and more. We'll need to wait for more details but previous benchmarks have shown e cores to not be particularly efficient for the amount of work they're able to do, sum it all up against a more efficient and full featured zen4c (it's not just smt, it's also avx 512 and more cache for example) and I don't see it ending well for Intel.
Posted on Reply
#36
Wirko
I still believe Intel didn't fail at designing the efficient core; they just factory overclocked it. Here are some numbers for comparison: the Raptor Lake desktop E core goes ridiculously far, to 4.7 GHz, and to 3.9 GHz in notebook chips. A the other end, the Sapphire Rapids server P core only goes up to 4.2 GHz turbo. The Bergamo 4c core reaches 3.1 GHz. I expect the Sierra Forest E core to reach somewhere between 3.1 and 3.4 GHz, and operate in a truly efficient manner. But it remains to be proven.

There are other possible bottlenecks in the architecture, memory bandwidth primarily. 288 / 12 = 24 cores per (64-bit) memory channel ... uh-huh. That MCR multiplexing scheme is quickly becoming a necessity.
Posted on Reply
#37
AnotherReader
WirkoI still believe Intel didn't fail at designing the efficient core; they just factory overclocked it. Here are some numbers for comparison: the Raptor Lake desktop E core goes ridiculously far, to 4.7 GHz, and to 3.9 GHz in notebook chips. A the other end, the Sapphire Rapids server P core only goes up to 4.2 GHz turbo. The Bergamo 4c core reaches 3.1 GHz. I expect the Sierra Forest E core to reach somewhere between 3.1 and 3.4 GHz, and operate in a truly efficient manner. But it remains to be proven.

There are other possible bottlenecks in the architecture, memory bandwidth primarily. 288 / 12 = 24 cores per (64-bit) memory channel ... uh-huh. That MCR multiplexing scheme is quickly becoming a necessity.
Desktop SKUs like the 12700K clock the E cores way out of their efficiency sweet spot. Chips and Cheese found Gracemont to be more efficient than Golden Cove at integer code when running below 3.1 Ghz.

Posted on Reply
#38
DavidC1
qcmadnessThe total multi-thread performance of this 288-core one is probably lower than that of 128-core Epyc.
Sierra Forest is going to be much better than some of you believe. In fact, the 144 core version should be competitive with 128 core Bergamo in performance/watt, partly because Bergamo only has top Turbo of 3.1GHz.

Whether it's lower performance and quite a bit lower power or similar power levels and similar performance, that's yet to be determined.

Intel themselves claim 2.4x performance/watt over Sapphire Rapids. Since it has 2.4x the amount of cores as the 60 core SPR, and SPR has the advantage of having hyperthreading that's responsible for 20-30% gain, that's pretty impressive, especially considering the Golden Cove core in SPR is 40-50% faster than Sierra Glen(server grace Gracemont) E cores in SRF.

Because of that 2.4x the cores if the two CPUs have the same clocks would result in only maybe 30% gain, yet SRF has 2.4x perf/watt. This means few possibilities:
-SRF: 205W, 144 cores that perform 30-40% higher than SPR: 350W, 60 cores
-Sierra Forest is at 270W, but clocks 40% higher than SPR, and is nearly 90% faster than SPR, essentially, the clock increase makes up for architectural differences. 4.2GHz all core versus 2.9GHz all core.

According to SpecCPU tests, the 40-50% advantage Golden Cove and Zen 4 has over Gracemont is split as 20-25% in Integer and 60-65% in floating point. Golden Cove is few low single digit % faster than Zen 4, by the way.

Since Bergamo and Sierra Forest is aimed at cloud workloads, and even most non-HPC server is all integer works, that means Gracemont may be far more competent than on PCs. Then Sierra Forest would only need 3.6GHz to perform 90% faster than Sapphire Rapids, but use only 270W. There's even a possibility they could clock SRF all the way to 4.5GHz so a 2.4x perf/watt also ends up being 2.4x the performance, but at 350W TDP.

So now the real deal. Let's assume 350W Sierra Forest at 4.5GHz. Since Bergamo is also 350W but peaks at 3.1GHz, the reality is the all-core Turbo is probably about 3GHz. This means at the end of the day, it's a core count battle, and SRF has a slight edge at 144 cores versus 128.

But the "real competitor is Turin Dense" you say. You are right, maybe. According to earlier leaked roadmaps, Bergamo was supposed to be very early Q1 of this year, like Dec-Jan. Instead, it came out June of this year. The same roadmap has Turin Dense firmly at Q2 of next year. Best case scenario is that Turin Dense comes a month after SRF, the worst case scenario is that it comes 5-6 months later, meaning some sort of a leapfrog. Hence the existence of 288-core SRF. Looks like Intel wants to be at minimum, competitive in the worst case.

Let's analyze 288-core SRF vs 192 core Turin Dense.

Turin Dense:
-192 cores(1.5x)
-Zen 5(Let's say 1.2x)
-500W
-80% faster than Bergamo

288-core SRF: I speculate roughly 40W of the 350W is taken up by the IO tiles, leaving 310W for Compute. Assuming 144-core SRF is at 4.5GHz, with minimal voltage reductions, we can get a 3.6GHz SRF at 500W. Or 3.4GHz without touching the voltage.
-3.6GHz: 60% faster than Sierra Forest
-3.4GHz: 50% faster than Sierra Forest

Since the assumption is 144-core SRF is a wee bit faster than Bergamo, it looks like 288-core will be competitive. You can see 5% here, and 5% there will swing the favor to either party. But it's nothing like the bloodbath between Genoa and Sapphire Rapids.
Posted on Reply
#39
unwind-protect
How much is a Windows version fo a 288-core system, anyway?

(yes I know they will usually run Linux but I'm curious)
Posted on Reply
#40
TumbleGeorge
unwind-protectHow much is a Windows version fo a 288-core system, anyway?

(yes I know they will usually run Linux but I'm curious)

Unlimited OSE $110880; limited to 2 OSE cost $19296. Calculator is non official!
Posted on Reply
Add your own comment
Sep 26th, 2024 18:57 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts