Friday, December 11th 2020

Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

Today, thanks to the ServeTheHome forum member "111alan", we have the first pictures of the alleged Intel Sapphire Rapids Xeon processor. Pictured is what appears to be a dual-die design similar to Cascade Lake-SP design with 56 cores and 112 threads that uses two dies. The Sapphire Rapids is a 10 nm SuperFin design that allegedly comes even in the dual-die configuration. To host this processor, the motherboard needs an LGA4677 socket with 4677 pins present. The new LGA socket, along with the new 10 nm Sapphire Rapids Xeon processors are set for delivery in 2021 when Intel is expected to launch its new processors and their respective platforms.

The processor pictured is clearly a dual-die design, meaning that Intel used some of its Multi-Chip Package (MCM) technology that uses EMIB to interconnect the silicon using an active interposer. As a reminder, the new 10 nm Sapphire Rapids platform is supposed to bring many new features like a DDR5 memory controller paired with Intel's Data Streaming Accelerator (DSA); a brand new PCIe 5.0 standard protocol with a 32 GT/s data transfer rate, and a CXL 1.1 support for next-generation accelerators. The exact configuration of this processor is unknown, however, it is an engineering sample with a clock frequency of a modest 2.0 GHz.
Source: ServeTheHome Forums
Add your own comment

83 Comments on Alleged Intel Sapphire Rapids Xeon Processor Image Leaks, Dual-Die Madness Showcased

#51
Ruru
S.T.A.R.S.
3rold
And Core 2 Quad..
Posted on Reply
#52
Vayra86
fancuckerLot of negative energy ITT. We need an Intel resurgence to keep AMD pricing in check.
Need to have negativity or you cant see whats positive anymore either. Intels now on the shitlist until they get serious again. Times change eh
Posted on Reply
#53
InVasMani
HD64GMax 96 cores per die most possibly.
That or stick to 64 cores and reduce the number of CCX on those designs to cut down on latency and leaving more headroom for higher clock speeds per core. I mean it doesn't matter too much at this point for AMD they are competing with themselves basically though they should keep their foot on the gas a bit for certain Intel will find it's footing eventually. I suppose they could also keep it 64 cores and reduce the CCX, but throw in some APU tech for good measure.
Posted on Reply
#54
Spencer LeBlanc
Just 8 more cores intel and you'll be able to compete again!
Posted on Reply
#55
TumbleGeorge
Spencer LeBlancJust 8 more cores intel and you'll be able to compete again!
Intel needs of private detective to find lost cores in their own CPUs :D
Posted on Reply
#56
cueman
well,wait for it....
its good also that intel finaly get soon its 10nm cpu out.

when that happend,its soon ,then we can reality compare and see how 'good' amd ryzen 7nm really are.


its just useless to compare 14nm cpu against 7nm cpu...sure amd want it that way..


think we seen final battle next year when 10nm adler lake coming,its showing where we go cpu world.


but bfore that intel release is out,march 2021 release is answer for amd vermeer cpu,its rocket lake,still 14nm,but last one that kind.
its should beat easily amd vermeer, and i mean gaming performance.


but ,then,june 2021 incoming adler lake, it is 10nm tech maded and also hydrib cpu,
so,its finally show is it intel step up top of cpu world.
Posted on Reply
#57
TumbleGeorge
cuemanthen,june 2021 incoming adler lake
Too optimistic.
Posted on Reply
#58
efikkan
cuemanbut bfore that intel release is out,march 2021 release is answer for amd vermeer cpu,its rocket lake,still 14nm,but last one that kind.

its should beat easily amd vermeer, and i mean gaming performance.

but ,then,june 2021 incoming adler lake, it is 10nm tech maded and also hydrib cpu,
so,its finally show is it intel step up top of cpu world.
I believe even end of 2021 is a best case scenario for Alder Lake. If Intel had these ready in volumes, they would have skipped Rocket Lake, which may seem to be pushed all the way to March.

Still, I wonder how much it matters when Zen 3 is so hard to find, while Comet Lake is in plenty supply. It's really hard to tell how large quantities are shipping when it's sold out everywhere. But I saw one shop estimates the next batch in April, so it makes me wonder what's going on…
Posted on Reply
#60
TumbleGeorge
128 cores after Genoa(up to 96c) with ZEN5 architecture(noname yet for products with this architecture). Milan is up to 64 cores just like Rome. In citated article clear is write that 128c is is 2sp configuration.
Posted on Reply
#61
Xajel
HD64GMax 96 cores per die most possibly.
You mean per package, this means either 12Cx8D or 8Cx12D. Either way, it's not happening till 5nm.
Posted on Reply
#62
r9
Chrispy_Intel has spent 3 years spreading FUD about AMD's glue.

The irony and shame here is fantastic.
If you can't beat them join them ..
Posted on Reply
#64
Tech00
One of the things I do not see mentioned in this thread is the new flag ship instruction set features that Sapphire Rapids will introduce: AMX.
I see some discussion about AVX-512 but that's not the "new thing" with Sapphire Rapids, AMX is!
In short AMX is the advanced matrix extension. It basically gives you 8KB of new registers files space in the form of 8 separate dynamically (size) configurable (2 dimensional) matrix registers (tmm0-tmm7) compared to AVX-512's 32 separate (1-dimensional) vector 64-byte registers (zmm0-zmm31 ) = 2KB. That's 4x additional register files space!
Those who (me) actually work with and use AVX-512 in programming can see speed ups of 2x to 4x (I've seen even higher in special cases) compared to AVX2. It does require you to have a good grasp of linear algebra and multivariate calculus, SIMD algorithms, because the compilers of today have a really hard time automatically translating high level programming languages to high performing vectorized AVX-512 machine code.
To get the value out of AVX-512 the designer/programmer needs to vectorized the algorithms, not just write sequential programs like usual and hope that the compiler somehow figures it out (it won't). This requires theoretical knowledge of math and computer science, but if you learan how to do it right the reward is great!
An 18 core consumer skylake/cascade lake like 10980XE has 32 FMA's (2 per core on port 0 and port 5) and if you know what you are doing you can execute 32*16 = 512 single precision mult+add flops (or 32-bit integer ops) every 1 CPU cycle (Throughput) or so. That comes in very handy if you know what you are doing and are comfortable vectorizing your algorithms.

So moving to AMX will require even more knowledge to really capitalize and make full use of the extra register capacity and instruction set.

a few points about that AVX-512 and AMX is "only" AI and neural nets: It is true that NN and AI benefits greatly from AVX-512, not only because it comes natural to vectorize and translate to AVX-512 registers (zmm0-zmm31), but also because AVX-512 provides versatility, generality and richness in the instruction set you can use on the zmm## registers, and integrates seamlessly with the rest of your x64 code.
However, it is far from the only application for AVX-512. Again if you know what you are doing as a architect/designer/programmer and spend a bit of effort verctorizing the algorithms and the code you can get similar benefits in the inner loops of other classes of problems/applications (image processing, sound processing etc), anything really that involves processing lots of data applying floating point or integer calculations in some sort of uniform pattern (loops). This is an advantage AVX-512 based general CPUs have over specialized GPUs (fewer but more general flops/iops vs. more but specialized flops/iops)

A lot of weight is put on 10-20 % IPC improvement of the same code (gen over gen AMD or Intel - doesn't matter) which is nice, but if you know what you are doing and vectorize your algorithms you can get 200% - 400% improvements with AVX-512 today and maybe another ???% with AMX - we'll see... I have high hopes... AMX do look great on paper. Intel® Architecture Instruction Set Extensions Programming Reference
Posted on Reply
#65
HansRapad
Vya DomusRight, that's why AMD is winning high profile contracts left and right. Apparently it's a lot easier for companies to switch than you think.

Intel hasn't updated their Xeon platform with anything noteworthy for around 2 years now, that's an eternity in the server space and customers just don't wont settle for an inferior, slower and more costly platform. These things are custom built with huge support teams dedicated towards their maintenance, it's not like you plop in some racks with Intel hardware and they miraculously "just work" and the AMD ones don't. If that fabled Intel support and ecosystem was worth so much they'd never switch, expect they do, because it isn't.

Oh, and Intel's memory technology business side is so good that apparently they're looking to sell it. Hmm.
Sadly no, so far only Big Tech company that have resource to develop it's own infrastructure are able to do so, mostly cloud computer business like AWS, Google and Microsoft, which is all have development team in house

Other big company that isn't tech company does not have luxury, and they already paid another IT Vendor, to write a custom software that run on Xeon hardware, something like McD, Unilever, Coca Cola, PnG, Nestle, Bank Industry and so on, Requesting developing new software to work with whole new architecture and even new hardware vendor is not cheap, even application themselves is not centralized, the transition to new system, possibility downtime, losing sales, ect, not to mention they have to Retest all of their smaller application, that interfacing with these application that run on Server,

these company are not made their own software, they rather pay Intel Tax. and pay additional redundant security measure rather to have Downtime during transition and possibility future unforeseen downtime in future due to new software on new architecture, possibility disrupting business process, or even losing sales, for a company these scale, losing sales even 1 day can equal tons of money, and they rather not take a risk, even the cost of changing software requirment, Testing of all software module, is not cheap, and for multinational company, each country need to do the testing as well since each country business requirement may vary

For now to AMD to completely take over Server market is by All of these Big company is moving to cloud, and that seems unlikely happen very soon, I don't even happen in next 20 years, especially when it involving bank industry
Posted on Reply
#66
Super XP
HansRapadon server and datacenter market, it's not only about raw horsepower but more about whole ecosystem

Intel has spend so much time on nuturing Xeon platform, that's their core business, even if the performance wasn't the top, the overall feature that they are using and being utilized by their customer cannot be replaced by AMD

so far I haven't see any simillar technology as Optane on Data Center environment from AMD, while optane is kind of janky on Desktop, it being fully utilized on datacenter platform
Irrelevant when AMDs EPYC are beating anything Intel has to offer in the data centre and server market. The upgrade cycle is a long one, and Intel has entrenched themselves in that market due to years of domination. I believe Intel owns over 90% of this market. If AMD can even capture say 10% or even 15%, that is more than enough to keep AMD pumping out faster, efficient and highly innovative processors. A slight increase for AMD in % is a huge win for the company. That's how much Intel controls that market. But its slowly turning to AMDs favour. And the industry sees AMD now as a compelling & proven alternative. Things are going to look quite interesting in 2-5 years from now.
Chrispy_Intel has spent 3 years spreading FUD about AMD's glue.

The irony and shame here is fantastic.
AMD's ZEN :D caught Intel with its knickers in a knot o_O And they haven't recovered, so you need to ask yourself why they created a job position named "Marketing Strategist" i.e.: Damage Control :kookoo:
That last time I've witnessed Intel do weird things because they are behind innovatively and technologically was back in the day, with its Legendary Athlon 64 days when AMD took the Price/Performance crown.
Posted on Reply
#67
efikkan
Tech00It does require you to have a good grasp of linear algebra and multivariate calculus, SIMD algorithms, because the compilers of today have a really hard time automatically translating high level programming languages to high performing vectorized AVX-512 machine code.
To get the value out of AVX-512 the designer/programmer needs to vectorized the algorithms, not just write sequential programs like usual and hope that the compiler somehow figures it out (it won't). This requires theoretical knowledge of math and computer science, but if you learan how to do it right the reward is great!
The good news is that many workloads are essentially vectorized in nature, but unfortunately the coding practices taught in school these days tell people to ignore the real problem (the data) and instead focus on building a complex "architecture" to hide and scatter the data and state all over the place, making SIMD (or any decent performance really) virtually impossible.

Vectorizing the data is always a good start, but it may not be enough. Whenever you see a loop iterating over a dense array, there is some theoretical potential there, but the solution may not be obvious. Often, the solution is to restructure the data with similar data grouped together (the data oriented approach), rather than the typical "world modelling", then potential for SIMD often will become obvious. But as we know, a compiler can never do this, the developer still has to do the ground work.
Tech00So moving to AMX will require even more knowledge to really capitalize and make full use of the extra register capacity and instruction set.
I haven't had time to look into the details of AMX yet, but my gut feeling tells me that any 2D data should have some potential here, like images, video, etc.
Posted on Reply
#68
TumbleGeorge
Is CPU's using math tricks to shortcut calculations? Or this is possible only for analog systems like human brain?
Posted on Reply
#69
Vya Domus
TumbleGeorgeIs CPU's using math tricks to shortcut calculations?
Kind of, it might be possible for example that a CPU would execute integer multiplication by shifting the bits instead of doing the actual multiplication.
Posted on Reply
#70
TheoneandonlyMrK
cuemanwell,wait for it....
its good also that intel finaly get soon its 10nm cpu out.

when that happend,its soon ,then we can reality compare and see how 'good' amd ryzen 7nm really are.


its just useless to compare 14nm cpu against 7nm cpu...sure amd want it that way..


think we seen final battle next year when 10nm adler lake coming,its showing where we go cpu world.


but bfore that intel release is out,march 2021 release is answer for amd vermeer cpu,its rocket lake,still 14nm,but last one that kind.
its should beat easily amd vermeer, and i mean gaming performance.


but ,then,june 2021 incoming adler lake, it is 10nm tech maded and also hydrib cpu,
so,its finally show is it intel step up top of cpu world.
With recent reports claiming 50%+ yeilds and the size of one side/half of an intel core here I'm not surprised there's shortages, hopefully they get their top man sorted soon and gain direction.
Posted on Reply
#71
efikkan
TumbleGeorgeIs CPU's using math tricks to shortcut calculations? Or this is possible only for analog systems like human brain?
CPUs can do some minor optimization in real time, but this is limited to things that are guaranteed to produce identical results, probably limited to specific patterns of instructions which have an "optimal" alternative. CPUs can also unroll some loops and eliminate some redundant instructions in real time.

But the compilers are able to do much more, as they have more time and more context. I don't know how much tricks like these are done by compilers though, since many such integer tricks have a lot of assumptions the compiler might not know. Nevertheless, many of these operations have dedicated instructions that are branchless and faster anyway.
Still, the compilers do things like if a variable is set to 0 (for integers), they will convert this to xor itself, because it's faster.
Compilers like GCC offers flags like -ffast-math, which will do a lot of tricks with floats which may sacrifice some precision or compliance with IEEE 754 in order to gain some extra performance.
Posted on Reply
#72
TumbleGeorge
efikkanCPUs can do some minor optimization in real time, but this is limited to things that are guaranteed to produce identical results, probably limited to specific patterns of instructions which have an "optimal" alternative. CPUs can also unroll some loops and eliminate some redundant instructions in real time.

But the compilers are able to do much more, as they have more time and more context. I don't know how much tricks like these are done by compilers though, since many such integer tricks have a lot of assumptions the compiler might not know. Nevertheless, many of these operations have dedicated instructions that are branchless and faster anyway.
Still, the compilers do things like if a variable is set to 0 (for integers), they will convert this to xor itself, because it's faster.
Compilers like GCC offers flags like -ffast-math, which will do a lot of tricks with floats which may sacrifice some precision or compliance with IEEE 754 in order to gain some extra performance.
Almost of this transformations are normal algebra actions not tricks like this which I linked but make some shortcuts.

I read more about math tricks. Its are hundreds maybe thousands. LoL! Maybe has many of them useful to implement inside CPU architecture. Maybe the CPU arhitects and programmers need of help for this at high educated mathematicians with lot of experience?
Posted on Reply
#73
efikkan
TumbleGeorgeI read more about math tricks. Its are hundreds maybe thousands. LoL! Maybe has many of them useful to implement inside CPU architecture. Maybe the CPU arhitects and programmers need of help for this at high educated mathematicians with lot of experience?
While no one is perfect, I do believe Intel have several thousands of engineers working on chip design, and math usually accounts for ~1/3 of a MS degree, so I don't think that's the problem.

I do wonder what kind of tricks you are talking about though. Like replacing a single multiplication with a shift operation, or more like an approximation like the fast inverse square root?
There are many transformations that are correct algorithmically, but may lead to errors when implemented in fixed precision. The CPU can't do any such things in real time, and never will, but the compiler can, if you ask it to.

Also keep in mind that a CPU decodes 5-6 instructions per clock cycle, so anything optimized in real time needs to be very simple and hardcoded logic. That's not to say someone can't come up with some smarter way of implementing stuff. It's not that many years ago someone came up with a way to do multiplication with less transistors.
Posted on Reply
#74
TumbleGeorge
efikkanAlso keep in mind that a CPU decodes 5-6 instructions per clock cycle, so anything optimized in real time needs to be very simple and hardcoded logic.
Yes this is problem. Implementation to limited to 0;1;0/1. Must make different achitecture which is more complex on level 0 and to be not limited to working with the binary number system, but to be able to work with all major number systems directly in the hardware, not in the software layers of the operating system and applications. But this seems to be an impossible task for engineers?
Posted on Reply
#75
Tech00
efikkanThe good news is that many workloads are essentially vectorized in nature, but unfortunately the coding practices taught in school these days tell people to ignore the real problem (the data) and instead focus on building a complex "architecture" to hide and scatter the data and state all over the place, making SIMD (or any decent performance really) virtually impossible.

Vectorizing the data is always a good start, but it may not be enough. Whenever you see a loop iterating over a dense array, there is some theoretical potential there, but the solution may not be obvious. Often, the solution is to restructure the data with similar data grouped together (the data oriented approach), rather than the typical "world modelling", then potential for SIMD often will become obvious. But as we know, a compiler can never do this, the developer still has to do the ground work.


I haven't had time to look into the details of AMX yet, but my gut feeling tells me that any 2D data should have some potential here, like images, video, etc.
Agree!
In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.
Performance come from:
1) understanding the nature of the problem your are trying to solve -> options for organizing data and what algorithms to use. Vectorizing the problem quickly becomes the best way to get stepwise improvement: e.g. 2x+ improvement. Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.
2) understanding the instruction set of the processor you are targeting -> focus on the instruction set that can do the job done fastest (currently that is AVX-512)
3) understanding caching and memory access and how data should be organized and accessed to maximize data throughput to feed the CPU optimally

If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy). Focus shifts to how you read and write data fastest possibly, maximizing use of the different cache levels doing as many ops as possible on the data while it's still hot in cache before you move on the the next sub set of data (fitted to cache sizes).

Designing the data structures and algorithms around optimal date flow becomes the way you get max performance.

The point is that a Cascadelake-X core i9-10980XE CPU is not the bottle neck, the memory subsystem is - you just can't feed this beast fast enough - it chews thru all you can through at it quickly utilizing the full memory band width - and that is as good as it gets! Assuming you vectorize data and algorithms :)

I'll try to make the point here again using the 10980XE as an example: Due to the super scalar nature of the 10980XE where each core can reorder (from independent instructions in the pipeline) and execute instructions in parallel (10 parallel but differently capable execution ports) each core has incredible parallelism as it is. BUT most importantly for what I talk about here, a 10980XE has 36 vector FMAs/ALUs (2 per core). Each core has effectively two 16 float wide (or 16 32-bit integer wide) vector ALU/FMA that can perform 2 SIMD instructions in parallel (if programmed correctly and that is the big if). with propper programming you effectively have 36 FMA or ALU vector cores (depending on float or integer).
From a pure number crunching perspective that makes the 10980 a 36 core 16 float wide number crunching CPU (not 18)!

So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?
Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:
1) They are under too much budget pressure so they cannot go back and redesign their inefficient sequential code (commercial dev houses). This is inexcusable poor management! Why? Well if the is a business case for faster CPU (and that's the whole model for AMD and Intel so yeah there is) then there is also a BC for faster s/w.
if this is the case dev houses are basically relying on that gen over gen 10%-20% IPC improvement to continue, giving users a minuscule, imperceptible performance increases gen over gen. btw it is really hard to shave off another 1/5 fraction of clock cycle of a each instruction when you have already been doing that gen over gen for decades.
2) Incompetent s/w developers (no a python class does not make you a Computer Scientist)
3) Lazy developers and NO, a compiler will NOT compensate for laziness
4) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.

This problem will be even bigger with AMX because if AVX-512 is a s/w redesign then AMX is even more of a redesign: You really need to "vectorize/matricize" and put data into vectors and/or matrices to get the max performance and throughput for the Sapphire Rapids. It will be even harder for a compiler to this (compared to AVX-512) - not going to happen!
In summary it will be down to the individual programmer to get 2x-4x performance out of the next gen of AVX-512 / AMX Intel CPUs. If not we will just get another 10%-20% IPC improvement as usual and have lots of wasted silicon with AVX-512 and AMX.
The bigger point I'm making is that design and programming has to change to "vectorize/matricize" data and algorithms by default NOT as some sort of obscure after thought.
Posted on Reply
Add your own comment
Dec 21st, 2024 21:55 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts