Friday, December 11th 2020
Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased
Today, thanks to the ServeTheHome forum member "111alan", we have the first pictures of the alleged Intel Sapphire Rapids Xeon processor. Pictured is what appears to be a dual-die design similar to Cascade Lake-SP design with 56 cores and 112 threads that uses two dies. The Sapphire Rapids is a 10 nm SuperFin design that allegedly comes even in the dual-die configuration. To host this processor, the motherboard needs an LGA4677 socket with 4677 pins present. The new LGA socket, along with the new 10 nm Sapphire Rapids Xeon processors are set for delivery in 2021 when Intel is expected to launch its new processors and their respective platforms.
The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.
Source:
ServeTheHome Forums
The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.
83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased
Instead of doing the full calculation the "proper" way (like the once we learn in school):
1) for certain operations (e.g. square root, divisions) the CPU could use table look up to get to a close enough answer and then use an algorithm (like a few iterations of Newton Raphson Method) to get to the final result with enough significant digits. This is faster (potentially much faster) but requires more silicon to do. All is done in silicon (tables, Newton Raphson).
2) there are specific instructions that do a sloppier (but good enough and very, very fast) job with lower precision, e.g. VRCP14PS that would Compute Approximate Reciprocals of Packed Float Values with and error of +/- 2^-14, i.e. individually invert each of the 16 floats in a vector in a total of 2 clock cycles (throughput) - that is fast! = 8 float reciprocals per clock! Good luck trying to get that with any other CPU! This can be very useful when you do not need very high precision and 2^-14 will do just fine.
3) using various combinations of Taylor polynomial expansions, table look ups and Newton Raphson Method to calculate transcendent functions like e^x, log x etc fast (given the precision you want different combinations are used) - all in silicon and very fast!
BTW all of these techniques can and are also used in s/w by the end programmer, hand tuned (assembly coded) libraries (and sometimes even by the compiler!) to improve results that does not have good enough precision the fastest way possible.
Modern super scalar CPUs also uses other tricks:
Each core keep a pipe of instructions and see what can be executed in parallel, i.e. sequential instructions can be executed in parallel if
1) there are enough free useful ports (say there are two free ALU port) AND
2) they are independent (one instruction is not depended on the result of a previous instruction that has not finished yet)
say
a = c*b;
f = c*e;
d = a*e;
3rd instruction (d=a*e) is depended on that the 1st is finished BUT the 2nd instruction is independent on 1st, so 1st and 2nd can be executed in parallel if there are two free ALU ports.
Reordering:
Take the same example above but swap the last two rows
a = c*b;
d = a*e;
f = c*e;
if you do this sequentially, 2nd instruction is waiting for the 1st to finish BUT a modern core can see that there is a 3rd instruction coming after that is independent of 1 and 2 so if will reorder the pipeline and execute 3 ahead of 2 while 2 is waiting. This is called reordering. and is a great speed up - also helps the programmer so he/she does not need to think of reordering the instructions at the micro level (if they are close). compilers also help to do this in a bigger window of instructions and can even un-roll loops to decouple dependencies, all helping the programmer.
Loop cache (where the core keeps the whole inner loop in dedicated loop cache) so it does not need to go out to lower slower levels of cache. If you can fit the entire inner loop in loop cache it is extremely fast- I think some implementation of prime95 are able to do this... hand coded assembly. This way the core has full visibility on what's going on end to end and can do very good choices of reordering, predict branches and keep all ports as utilized as possible.
There are many more tricks of course, like read ahead, branch prediction etc.
I disagree about the importance of IPC. While SIMD is certainly very important for any heavy workload, most improvements which improves IPC also improves SIMD throughput, like larger instruction windows, better prefetching, larger caches, larger cache bandwidth, larger register files, etc. If we are to get even more throughput and potentially more AVX units in the future, all of these things needs to continue scaling to keep them fed.
Plus there is the added bonus of these things helping practically all code, which is why IPC improvements generally improve "everything", makes systems more responsive, etc. IPC also helps heavily multithreaded workloads scale even further, so much higher IPC is something everyone should want. Yes, many struggle to understand this.
Multiple cores are better for independent work chunks, so parallelization on a high level.
SIMD is better for repeating logic over parallel data, so parallelization on a low level.
Many applications use both, like 7zip, winrar, video encoders, blender etc., to get the maximum performance. One approach doesn't solve everything optimally. Engineering is mostly about using the right tool for the job, not using one tool for everything. Yes, the reason why Skylake-X added more L2 cache and redesigned the L3 cache is for the increased throughput of AVX-512. AVX-512 is a beast, and it probably still stuggles to keep it saturated. Well, beyond a slow rollout of AVX-512 from Intel, this probably comes down to the general slow adoption of all SIMD. It seems like AVX2 now is finally getting traction, now that we should be focusing on AVX-512.
Software is always slow at adoption. There are to my knowledge no substantial client software using AVX-512 yet, but once something like Phoshop, Blender, etc. does it, it will suddenly become a must for prosumers. Luckily, once software is vectorized, converting it to a new AVX version isn't hard.
To make things worse, Intel have been lacking AVX/FMA support for their Celeron and Pentium CPUs, which also affects which ISA features developers will prioritize. I disagree here.
APIs and complete algorithms is generally not useful for adoption of AVX in software (beyond some more pure math workloads for academic purposes, which is where MKL is often used).
What developers need is something better than the intrinsic macros we have right now, something which enables us to write clean readable C code while getting optimal AVX-code after compilation.
The only code I want them to optimize for us would be the C standard library and existing libraries and OS core, which is something they kind of already have done for the "Intel Clear Linux" distro, but very little of this has been adopted by the respective software projects. Todays high-performance x86 CPUs are huge superscalar out-of-order machines. You know Itanium(EPIC) tried to solve this by being an in-order with explicitly parallel instructions, and failed miserably doing so.
From the looks of it, Sapphire Rapids will continue the current trend and be even significantly larger than Sunny Cove.
Just looking at the instruction window, the trend has been:
Nehalem: 128, Sandy Bridge: 168 (+31%), Haswell: 192 (+14%), Skylake: 224 (+17%), Sunny Cove: 352 (+57%), Golden Cove: 600(?) (+70%)
There will probably be further improvements in micro-op cache, load/stores, decoding and possibly more execution ports, to squeeze out as much parallelization as possible.
No one want to invest Revalidating production Software, end to end on AMD hardware, especially when the business process being so complicated
Smaller company has smaller business process, use less complex software, in fact they may even use Cloud server instead, Big company? nope
To answer Switching to AMD Platform is worth it, you need to answer several question
- How large the scope involved
- How much the cost for revalidating end to end business application involved
- How much the downtime we expect during transition
- How long the transition period
- What are the mitigation plan for issue during transition
- How large the development timeline impacted to adapt with this new requirement
- What are the transition strategy
- What are the risk involved
- How much the loss during the transition period
it wasn't as simple as, oh this hardware cheaper and faster, let's retest, all of our software for AMD hardware, done clear, happy ending? No not at all, they need to retest everything, Integration, of all software, All Module, all databases, all everything. and if we talking about Non IT company, (Bank, FMCG, ect) They don't want to deal with thisMy entire point was AMD is increasing market share in this space that's more than 95% controlled by Intel. AMD is in fact gaining market share, a slow and painful process, but they are even with larger companies that will concede in taking risks for custom complex software platforms. I mean, for AMD to be up say about 1% in this space is huge. But its a very slow climb up because AMD can't possibly make enough capacity to anyways to feed this industry. Will they ever overtake Intel in this space? Absolutely Not IMO. lol