Friday, December 11th 2020

Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Dec 11th, 2020 01:32 Discuss (83 Comments)

Today, thanks to the ServeTheHome forum member "111alan", we have the first pictures of the alleged Intel Sapphire Rapids Xeon processor. Pictured is what appears to be a dual-die design similar to Cascade Lake-SP design with 56 cores and 112 threads that uses two dies. The Sapphire Rapids is a 10 nm SuperFin design that allegedly comes even in the dual-die configuration. To host this processor, the motherboard needs an LGA4677 socket with 4677 pins present. The new LGA socket, along with the new 10 nm Sapphire Rapids Xeon processors are set for delivery in 2021 when Intel is expected to launch its new processors and their respective platforms.

The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.

Source: ServeTheHome Forums

Add your own comment

83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

#76

Tech00

TumbleGeorgeIs CPU's using math tricks to shortcut calculations? Or this is possible only for analog systems like human brain?

Modern CPUs are using some math tricks but not the once you are referring to.

Instead of doing the full calculation the "proper" way (like the once we learn in school):
1) for certain operations (e.g. square root, divisions) the CPU could use table look up to get to a close enough answer and then use an algorithm (like a few iterations of Newton Raphson Method) to get to the final result with enough significant digits. This is faster (potentially much faster) but requires more silicon to do. All is done in silicon (tables, Newton Raphson).
2) there are specific instructions that do a sloppier (but good enough and very, very fast) job with lower precision, e.g. VRCP14PS that would Compute Approximate Reciprocals of Packed Float Values with and error of +/- 2^-14, i.e. individually invert each of the 16 floats in a vector in a total of 2 clock cycles (throughput) - that is fast! = 8 float reciprocals per clock! Good luck trying to get that with any other CPU! This can be very useful when you do not need very high precision and 2^-14 will do just fine.
3) using various combinations of Taylor polynomial expansions, table look ups and Newton Raphson Method to calculate transcendent functions like e^x, log x etc fast (given the precision you want different combinations are used) - all in silicon and very fast!
BTW all of these techniques can and are also used in s/w by the end programmer, hand tuned (assembly coded) libraries (and sometimes even by the compiler!) to improve results that does not have good enough precision the fastest way possible.

Modern super scalar CPUs also uses other tricks:
Each core keep a pipe of instructions and see what can be executed in parallel, i.e. sequential instructions can be executed in parallel if
1) there are enough free useful ports (say there are two free ALU port) AND
2) they are independent (one instruction is not depended on the result of a previous instruction that has not finished yet)
say
a = c*b;
f = c*e;
d = a*e;
3rd instruction (d=a*e) is depended on that the 1st is finished BUT the 2nd instruction is independent on 1st, so 1st and 2nd can be executed in parallel if there are two free ALU ports.

Reordering:
Take the same example above but swap the last two rows
a = c*b;
d = a*e;
f = c*e;
if you do this sequentially, 2nd instruction is waiting for the 1st to finish BUT a modern core can see that there is a 3rd instruction coming after that is independent of 1 and 2 so if will reorder the pipeline and execute 3 ahead of 2 while 2 is waiting. This is called reordering. and is a great speed up - also helps the programmer so he/she does not need to think of reordering the instructions at the micro level (if they are close). compilers also help to do this in a bigger window of instructions and can even un-roll loops to decouple dependencies, all helping the programmer.

Loop cache (where the core keeps the whole inner loop in dedicated loop cache) so it does not need to go out to lower slower levels of cache. If you can fit the entire inner loop in loop cache it is extremely fast- I think some implementation of prime95 are able to do this... hand coded assembly. This way the core has full visibility on what's going on end to end and can do very good choices of reordering, predict branches and keep all ports as utilized as possible.

There are many more tricks of course, like read ahead, branch prediction etc.

#77

efikkan

Tech00In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.

This depends on whether you are talking as a prospective buyer looking to buy the right piece of hardware, or if you're talking as a developer looking to write a piece of software.

I disagree about the importance of IPC. While SIMD is certainly very important for any heavy workload, most improvements which improves IPC also improves SIMD throughput, like larger instruction windows, better prefetching, larger caches, larger cache bandwidth, larger register files, etc. If we are to get even more throughput and potentially more AVX units in the future, all of these things needs to continue scaling to keep them fed.
Plus there is the added bonus of these things helping practically all code, which is why IPC improvements generally improve "everything", makes systems more responsive, etc. IPC also helps heavily multithreaded workloads scale even further, so much higher IPC is something everyone should want.

Tech00Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.

Yes, many struggle to understand this.
Multiple cores are better for independent work chunks, so parallelization on a high level.
SIMD is better for repeating logic over parallel data, so parallelization on a low level.
Many applications use both, like 7zip, winrar, video encoders, blender etc., to get the maximum performance. One approach doesn't solve everything optimally. Engineering is mostly about using the right tool for the job, not using one tool for everything.

Tech00If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy).

Yes, the reason why Skylake-X added more L2 cache and redesigned the L3 cache is for the increased throughput of AVX-512. AVX-512 is a beast, and it probably still stuggles to keep it saturated.

Tech00So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?

Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:

Well, beyond a slow rollout of AVX-512 from Intel, this probably comes down to the general slow adoption of all SIMD. It seems like AVX2 now is finally getting traction, now that we should be focusing on AVX-512.
Software is always slow at adoption. There are to my knowledge no substantial client software using AVX-512 yet, but once something like Phoshop, Blender, etc. does it, it will suddenly become a must for prosumers. Luckily, once software is vectorized, converting it to a new AVX version isn't hard.

To make things worse, Intel have been lacking AVX/FMA support for their Celeron and Pentium CPUs, which also affects which ISA features developers will prioritize.

Tech004) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.

I disagree here.
APIs and complete algorithms is generally not useful for adoption of AVX in software (beyond some more pure math workloads for academic purposes, which is where MKL is often used).
What developers need is something better than the intrinsic macros we have right now, something which enables us to write clean readable C code while getting optimal AVX-code after compilation.
The only code I want them to optimize for us would be the C standard library and existing libraries and OS core, which is something they kind of already have done for the "Intel Clear Linux" distro, but very little of this has been adopted by the respective software projects.

Tech00…
Reordering:
Take the same example above but swap the last two rows
a = c*b;
d = a*e;
f = c*e;
if you do this sequentially, 2nd instruction is waiting for the 1st to finish BUT a modern core can see that there is a 3rd instruction coming after that is independent of 1 and 2 so if will reorder the pipeline and execute 3 ahead of 2 while 2 is waiting. This is called reordering. and is a great speed up - also helps the programmer so he/she does not need to think of reordering the instructions at the micro level (if they are close). compilers also help to do this in a bigger window of instructions and can even un-roll loops to decouple dependencies, all helping the programmer.

Todays high-performance x86 CPUs are huge superscalar out-of-order machines. You know Itanium(EPIC) tried to solve this by being an in-order with explicitly parallel instructions, and failed miserably doing so.
From the looks of it, Sapphire Rapids will continue the current trend and be even significantly larger than Sunny Cove.
Just looking at the instruction window, the trend has been:
Nehalem: 128, Sandy Bridge: 168 (+31%), Haswell: 192 (+14%), Skylake: 224 (+17%), Sunny Cove: 352 (+57%), Golden Cove: 600(?) (+70%)
There will probably be further improvements in micro-op cache, load/stores, decoding and possibly more execution ports, to squeeze out as much parallelization as possible.

#78

HansRapad

Super XPIrrelevant when AMDs EPYC are beating anything Intel has to offer in the data centre and server market. The upgrade cycle is a long one, and Intel has entrenched themselves in that market due to years of domination. I believe Intel owns over 90% of this market. If AMD can even capture say 10% or even 15%, that is more than enough to keep AMD pumping out faster, efficient and highly innovative processors. A slight increase for AMD in % is a huge win for the company. That's how much Intel controls that market. But its slowly turning to AMDs favour. And the industry sees AMD now as a compelling & proven alternative. Things are going to look quite interesting in 2-5 years from now.

AMD's ZEN :D caught Intel with its knickers in a knot o_O And they haven't recovered, so you need to ask yourself why they created a job position named "Marketing Strategist" i.e.: Damage Control :kookoo:
That last time I've witnessed Intel do weird things because they are behind innovatively and technologically was back in the day, with its Legendary Athlon 64 days when AMD took the Price/Performance crown.

I don't see Company other than Cloud Computing or IT Company switching anytime soon

No one want to invest Revalidating production Software, end to end on AMD hardware, especially when the business process being so complicated

#79

Super XP

HansRapadI don't see Company other than Cloud Computing or IT Company switching anytime soon

No one want to invest Revalidating production Software, end to end on AMD hardware, especially when the business process being so complicated

They will if the price is right. But I see smaller companies moving towards AMD hardware. AMD's Worldwide revenues for servers running AMD CPUs were up 112.4% year over year. Will they topple Intel in this market, probably not, but they will continue to eat away at it slowly but surely.

#80

HansRapad

Super XPThey will if the price is right. But I see smaller companies moving towards AMD hardware. AMD's Worldwide revenues for servers running AMD CPUs were up 112.4% year over year. Will they topple Intel in this market, probably not, but they will continue to eat away at it slowly but surely.

Sorry I don't think you get it, it was never about price, it was about the complexity of custom software they use, Big company pay any price, and do not want to deal with risk involved by switching Platform just to save some money on server infrastructure.

Smaller company has smaller business process, use less complex software, in fact they may even use Cloud server instead, Big company? nope

To answer Switching to AMD Platform is worth it, you need to answer several question

How large the scope involved
How much the cost for revalidating end to end business application involved
How much the downtime we expect during transition
How long the transition period
What are the mitigation plan for issue during transition
How large the development timeline impacted to adapt with this new requirement
What are the transition strategy
What are the risk involved
How much the loss during the transition period

it wasn't as simple as, oh this hardware cheaper and faster, let's retest, all of our software for AMD hardware, done clear, happy ending? No not at all, they need to retest everything, Integration, of all software, All Module, all databases, all everything. and if we talking about Non IT company, (Bank, FMCG, ect) They don't want to deal with this

#81

InVasMani

Not disagreeing with what you say about some of these big companies that's certainly valid and true, but at the same big companies have switched and/or will switch if the reasons are compelling enough to. It's defiantly a case by case basis just look at CUDA and how that's evolved over time they pretty much created a new market segment that mostly didn't exist within that scope it does today it's growth substantially in scale over the last decade or so.

#82

Super XP

HansRapadSorry I don't think you get it, it was never about price, it was about the complexity of custom software they use, Big company pay any price, and do not want to deal with risk involved by switching Platform just to save some money on server infrastructure.

Smaller company has smaller business process, use less complex software, in fact they may even use Cloud server instead, Big company? nope

To answer Switching to AMD Platform is worth it, you need to answer several question
How large the scope involved
How much the cost for revalidating end to end business application involved
How much the downtime we expect during transition
How long the transition period
What are the mitigation plan for issue during transition
How large the development timeline impacted to adapt with this new requirement
What are the transition strategy
What are the risk involved
How much the loss during the transition period
it wasn't as simple as, oh this hardware cheaper and faster, let's retest, all of our software for AMD hardware, done clear, happy ending? No not at all, they need to retest everything, Integration, of all software, All Module, all databases, all everything. and if we talking about Non IT company, (Bank, FMCG, ect) They don't want to deal with this

I understand exactly what you are saying. Which is why I said Smaller Companies tend to lean towards going with AMD, a cheaper, better power efficient and faster option. I am fully aware of the complexities involved with rewriting custom software and how much issues and risk will be involved by switching platforms. But its not just for the savings.

My entire point was AMD is increasing market share in this space that's more than 95% controlled by Intel. AMD is in fact gaining market share, a slow and painful process, but they are even with larger companies that will concede in taking risks for custom complex software platforms. I mean, for AMD to be up say about 1% in this space is huge. But its a very slow climb up because AMD can't possibly make enough capacity to anyways to feed this industry. Will they ever overtake Intel in this space? Absolutely Not IMO. lol

#83

yeeeeman

Vya DomusStill not quite competitive enough with Milan for a 10nm product and Zen 3 Epyc is just months away. Ain't looking good, especially since it's rumored that AMD will remain on 64 core configurations. In other words they think these are no threat.

How can you tell? What do you know about golden cove core that is inside sapphire rapids? FYI zen 3 and willow cove ipc are 5-10% apart so if golden cove brings 20% improvent then those 9 fewer cores aren't that important anymore. Try and keep an open mind about this.

Add your own comment

Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Related News

83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts