Friday, December 11th 2020
Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased
Today, thanks to the ServeTheHome forum member "111alan", we have the first pictures of the alleged Intel Sapphire Rapids Xeon processor. Pictured is what appears to be a dual-die design similar to Cascade Lake-SP design with 56 cores and 112 threads that uses two dies. The Sapphire Rapids is a 10 nm SuperFin design that allegedly comes even in the dual-die configuration. To host this processor, the motherboard needs an LGA4677 socket with 4677 pins present. The new LGA socket, along with the new 10 nm Sapphire Rapids Xeon processors are set for delivery in 2021 when Intel is expected to launch its new processors and their respective platforms.
The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.
Source:
ServeTheHome Forums
The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.
83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased
its good also that intel finaly get soon its 10nm cpu out.
when that happend,its soon ,then we can reality compare and see how 'good' amd ryzen 7nm really are.
its just useless to compare 14nm cpu against 7nm cpu...sure amd want it that way..
think we seen final battle next year when 10nm adler lake coming,its showing where we go cpu world.
but bfore that intel release is out,march 2021 release is answer for amd vermeer cpu,its rocket lake,still 14nm,but last one that kind.
its should beat easily amd vermeer, and i mean gaming performance.
but ,then,june 2021 incoming adler lake, it is 10nm tech maded and also hydrib cpu,
so,its finally show is it intel step up top of cpu world.
Still, I wonder how much it matters when Zen 3 is so hard to find, while Comet Lake is in plenty supply. It's really hard to tell how large quantities are shipping when it's sold out everywhere. But I saw one shop estimates the next batch in April, so it makes me wonder what's going on…
wccftech.com/amd-3rd-gen-epyc-milan-cpu-specs-benchmarks-leak-out-up-to-64-cores-280w/
I see some discussion about AVX-512 but that's not the "new thing" with Sapphire Rapids, AMX is!
In short AMX is the advanced matrix extension. It basically gives you 8KB of new registers files space in the form of 8 separate dynamically (size) configurable (2 dimensional) matrix registers (tmm0-tmm7) compared to AVX-512's 32 separate (1-dimensional) vector 64-byte registers (zmm0-zmm31 ) = 2KB. That's 4x additional register files space!
Those who (me) actually work with and use AVX-512 in programming can see speed ups of 2x to 4x (I've seen even higher in special cases) compared to AVX2. It does require you to have a good grasp of linear algebra and multivariate calculus, SIMD algorithms, because the compilers of today have a really hard time automatically translating high level programming languages to high performing vectorized AVX-512 machine code.
To get the value out of AVX-512 the designer/programmer needs to vectorized the algorithms, not just write sequential programs like usual and hope that the compiler somehow figures it out (it won't). This requires theoretical knowledge of math and computer science, but if you learan how to do it right the reward is great!
An 18 core consumer skylake/cascade lake like 10980XE has 32 FMA's (2 per core on port 0 and port 5) and if you know what you are doing you can execute 32*16 = 512 single precision mult+add flops (or 32-bit integer ops) every 1 CPU cycle (Throughput) or so. That comes in very handy if you know what you are doing and are comfortable vectorizing your algorithms.
So moving to AMX will require even more knowledge to really capitalize and make full use of the extra register capacity and instruction set.
a few points about that AVX-512 and AMX is "only" AI and neural nets: It is true that NN and AI benefits greatly from AVX-512, not only because it comes natural to vectorize and translate to AVX-512 registers (zmm0-zmm31), but also because AVX-512 provides versatility, generality and richness in the instruction set you can use on the zmm## registers, and integrates seamlessly with the rest of your x64 code.
However, it is far from the only application for AVX-512. Again if you know what you are doing as a architect/designer/programmer and spend a bit of effort verctorizing the algorithms and the code you can get similar benefits in the inner loops of other classes of problems/applications (image processing, sound processing etc), anything really that involves processing lots of data applying floating point or integer calculations in some sort of uniform pattern (loops). This is an advantage AVX-512 based general CPUs have over specialized GPUs (fewer but more general flops/iops vs. more but specialized flops/iops)
A lot of weight is put on 10-20 % IPC improvement of the same code (gen over gen AMD or Intel - doesn't matter) which is nice, but if you know what you are doing and vectorize your algorithms you can get 200% - 400% improvements with AVX-512 today and maybe another ???% with AMX - we'll see... I have high hopes... AMX do look great on paper. Intel® Architecture Instruction Set Extensions Programming Reference
Other big company that isn't tech company does not have luxury, and they already paid another IT Vendor, to write a custom software that run on Xeon hardware, something like McD, Unilever, Coca Cola, PnG, Nestle, Bank Industry and so on, Requesting developing new software to work with whole new architecture and even new hardware vendor is not cheap, even application themselves is not centralized, the transition to new system, possibility downtime, losing sales, ect, not to mention they have to Retest all of their smaller application, that interfacing with these application that run on Server,
these company are not made their own software, they rather pay Intel Tax. and pay additional redundant security measure rather to have Downtime during transition and possibility future unforeseen downtime in future due to new software on new architecture, possibility disrupting business process, or even losing sales, for a company these scale, losing sales even 1 day can equal tons of money, and they rather not take a risk, even the cost of changing software requirment, Testing of all software module, is not cheap, and for multinational company, each country need to do the testing as well since each country business requirement may vary
For now to AMD to completely take over Server market is by All of these Big company is moving to cloud, and that seems unlikely happen very soon, I don't even happen in next 20 years, especially when it involving bank industry
That last time I've witnessed Intel do weird things because they are behind innovatively and technologically was back in the day, with its Legendary Athlon 64 days when AMD took the Price/Performance crown.
Vectorizing the data is always a good start, but it may not be enough. Whenever you see a loop iterating over a dense array, there is some theoretical potential there, but the solution may not be obvious. Often, the solution is to restructure the data with similar data grouped together (the data oriented approach), rather than the typical "world modelling", then potential for SIMD often will become obvious. But as we know, a compiler can never do this, the developer still has to do the ground work. I haven't had time to look into the details of AMX yet, but my gut feeling tells me that any 2D data should have some potential here, like images, video, etc.
But the compilers are able to do much more, as they have more time and more context. I don't know how much tricks like these are done by compilers though, since many such integer tricks have a lot of assumptions the compiler might not know. Nevertheless, many of these operations have dedicated instructions that are branchless and faster anyway.
Still, the compilers do things like if a variable is set to 0 (for integers), they will convert this to xor itself, because it's faster.
Compilers like GCC offers flags like -ffast-math, which will do a lot of tricks with floats which may sacrifice some precision or compliance with IEEE 754 in order to gain some extra performance.
I read more about math tricks. Its are hundreds maybe thousands. LoL! Maybe has many of them useful to implement inside CPU architecture. Maybe the CPU arhitects and programmers need of help for this at high educated mathematicians with lot of experience?
I do wonder what kind of tricks you are talking about though. Like replacing a single multiplication with a shift operation, or more like an approximation like the fast inverse square root?
There are many transformations that are correct algorithmically, but may lead to errors when implemented in fixed precision. The CPU can't do any such things in real time, and never will, but the compiler can, if you ask it to.
Also keep in mind that a CPU decodes 5-6 instructions per clock cycle, so anything optimized in real time needs to be very simple and hardcoded logic. That's not to say someone can't come up with some smarter way of implementing stuff. It's not that many years ago someone came up with a way to do multiplication with less transistors.
In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.
Performance come from:
1) understanding the nature of the problem your are trying to solve -> options for organizing data and what algorithms to use. Vectorizing the problem quickly becomes the best way to get stepwise improvement: e.g. 2x+ improvement. Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.
2) understanding the instruction set of the processor you are targeting -> focus on the instruction set that can do the job done fastest (currently that is AVX-512)
3) understanding caching and memory access and how data should be organized and accessed to maximize data throughput to feed the CPU optimally
If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy). Focus shifts to how you read and write data fastest possibly, maximizing use of the different cache levels doing as many ops as possible on the data while it's still hot in cache before you move on the the next sub set of data (fitted to cache sizes).
Designing the data structures and algorithms around optimal date flow becomes the way you get max performance.
The point is that a Cascadelake-X core i9-10980XE CPU is not the bottle neck, the memory subsystem is - you just can't feed this beast fast enough - it chews thru all you can through at it quickly utilizing the full memory band width - and that is as good as it gets! Assuming you vectorize data and algorithms :)
I'll try to make the point here again using the 10980XE as an example: Due to the super scalar nature of the 10980XE where each core can reorder (from independent instructions in the pipeline) and execute instructions in parallel (10 parallel but differently capable execution ports) each core has incredible parallelism as it is. BUT most importantly for what I talk about here, a 10980XE has 36 vector FMAs/ALUs (2 per core). Each core has effectively two 16 float wide (or 16 32-bit integer wide) vector ALU/FMA that can perform 2 SIMD instructions in parallel (if programmed correctly and that is the big if). with propper programming you effectively have 36 FMA or ALU vector cores (depending on float or integer).
From a pure number crunching perspective that makes the 10980 a 36 core 16 float wide number crunching CPU (not 18)!
So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?
Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:
1) They are under too much budget pressure so they cannot go back and redesign their inefficient sequential code (commercial dev houses). This is inexcusable poor management! Why? Well if the is a business case for faster CPU (and that's the whole model for AMD and Intel so yeah there is) then there is also a BC for faster s/w.
if this is the case dev houses are basically relying on that gen over gen 10%-20% IPC improvement to continue, giving users a minuscule, imperceptible performance increases gen over gen. btw it is really hard to shave off another 1/5 fraction of clock cycle of a each instruction when you have already been doing that gen over gen for decades.
2) Incompetent s/w developers (no a python class does not make you a Computer Scientist)
3) Lazy developers and NO, a compiler will NOT compensate for laziness
4) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.
This problem will be even bigger with AMX because if AVX-512 is a s/w redesign then AMX is even more of a redesign: You really need to "vectorize/matricize" and put data into vectors and/or matrices to get the max performance and throughput for the Sapphire Rapids. It will be even harder for a compiler to this (compared to AVX-512) - not going to happen!
In summary it will be down to the individual programmer to get 2x-4x performance out of the next gen of AVX-512 / AMX Intel CPUs. If not we will just get another 10%-20% IPC improvement as usual and have lots of wasted silicon with AVX-512 and AMX.
The bigger point I'm making is that design and programming has to change to "vectorize/matricize" data and algorithms by default NOT as some sort of obscure after thought.