Thursday, June 4th 2020
Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction
Intel's three upcoming processor microarchitectures, namely the next-generation Xeon "Sapphire Rapids," Core "Alder Lake," and low-power "Tremont" cores found in Atom, Pentium Silver, Celeron, and even Core Hybrid processors, will feature a new instruction set that aims to speed up processor cache performance, called CLDEMOTE "cache line demote." This is a means for the operating system to tell a processor core that a specific content of a cache (a cache line), isn't needed to loiter around in a lower cache level (closer to the core), and can be demoted to a higher cache level (away from the core); though not flushed back to the main memory.
There are a handful benefits to what CLDEMOTE does. Firstly, it frees up lower cache levels such as L1 and L2, which are smaller in size and dedicated to a CPU core, by pushing cache lines to the last-level cache (usually L3). Secondly, it enables rapid load movements between cores by pushing cache lines to L3, which is shared between multiple cores; so it could be picked up by a neighboring core. Dr. John McCalpin from UT Austin wrote a detailed article on CLDEMOTE.
Source:
InstLatX64 (Twitter)
There are a handful benefits to what CLDEMOTE does. Firstly, it frees up lower cache levels such as L1 and L2, which are smaller in size and dedicated to a CPU core, by pushing cache lines to the last-level cache (usually L3). Secondly, it enables rapid load movements between cores by pushing cache lines to L3, which is shared between multiple cores; so it could be picked up by a neighboring core. Dr. John McCalpin from UT Austin wrote a detailed article on CLDEMOTE.
16 Comments on Intel "Sapphire Rapids," "Alder Lake" and "Tremont" Feature CLDEMOTE Instruction
Unfortunately most software is lagging >15 years behind in terms of ISA. Your OS (Windows or Linux), most libraries and applications are all compiled for x86-64/SSE2. Only a small selection of demanding applications use more modern features. This is unfortunate since we are missing out on a lot of free performance. The Linux distribution "Intel Clear Linux" demonstrates this, where some core libraries and applications are optimized for modern ISA features, and gets a good portion of performance improvements. Really?
The OS scheduler lives on a completely different time scale (ms), while data residing in CPU caches usually stay there for nanoseconds to microseconds. I know the article mentions this, but I don't understand where it gets it from.
This sounds like something that can help synchronizing data between cores, yet probably have a very limited use case.
If this was managed dynamically at an OS level it would be hideously slow, if it was managed in silicon it would cause some strange performance inconsistencies with no way to turn it off.
It will most likely need to be part of the OS Kernel, similar to the Intel Clear Linux @efikkan mentions. It will be useful in fringe circumstances until very widespread hardware adoption.
Intel is partly to blame for the AVX adoption rate though, as Celeron and Pentium models are still lacking support.
As hardware is progressing, but software fails to keep up, we are loosing out on more and more performance potential. I wish that as Microsoft are phasing out 32-bit Windows they would transition to having two 64-bit versions of Windows; a "current" version(e.g. Haswell/Bulldozer or Skylake/Zen ISA level) and a "legacy" version (x86-64/SSE2). Just recompiling the kernel and core libraries would easily yield ~5-10% of free performance, and much more in certain cases by writing optimized code using intrinsics. Such improvements would benefit pretty much all applications, and would help achieving better energy efficiency. For a company which spends thousands of developers on all kinds of gimmicks, this would be a much smarter initiative.
It makes sense I guess, but I'd never thought of it like that before....
The memory/cache system inside CPUs are much more sophisticated than you think. When the CPU intends to read or write to an address, it fetches a whole cache line (64 bytes) from memory into L2, even if it's only going to write a single byte. There are various locks in place in case cores needs to write to addresses within the same cache line, so I guess if a core can free this lock earlier, it can help in some cases. Otherwise it needs to wait until it's discarded.
What i meant was that Alder Lake platform (LGA 1700) is gonna be the Intel platform worth buying!
And of course anyone looking for a new PC or simply an upgrade should and must consider price-performance value of the products he/she plans to purchase. That's like the number one rule, no two ways about it.
Right now you can have a 3900X for 10700K price, i mean it's crazy the reality we live in today.
And being only 7% slower at 1080p in games according to
www.techpowerup.com/review/intel-core-i7-10700k/15.html
, the 3900X is the clear choice due to its higher core count.
For those wanting to learn more about CPU caches and how it impacts software, I highly recommend watching this lecture. Watch at least ~32:47-43:00, but watching the whole thing is well worth an hour of your life. I recommend watching it if you're either interested in how CPU caches work in principle, or is a programmer. The principles explained here are essential for any performant code, and is a subject only becoming more relevant as CPUs are getting more powerful but more reliant on efficient cache usage.
The video explains why e.g. false sharing is a huge problem for scaling with multiple cores, and also states that there is no fix for it yet, but this new CLDEMOTE instruction might just help with some of these cases. Actual sharing have a similar behavior as well, and while that will always have a significant cost tied to it, this new instruction might help reducing that cost and improve scaling.
What's interesting to me is whether this new instruction is targeted only for specific edge cases, or if this is something that helps a lot of common cases of unnecessary stalls in the CPU. That makes no sense. What matters is performance in the workloads relevant to the buyer, not that it has higher core count.
If 3% lower performance in 1440p (and 7% in 1080p but who cares) matters is up to the end user. If you buy a $800 graphics card, then I could argue that $24 of that is "wasted" due to the CPU. The i7-10700K and i9-10900K models are quite a bit snappier in Photoshop and Premiere, on top of office work and web browsing. These are things which may be highly relevant to some buyers, and much more important than the core count.
When considering prices, the buyer must always look at their local prices. A much bigger problem for Comet Lake is availability. I have yet to see any of the K-models in stock, and many stores expect delivery in August. Of course, I can't know for sure if some are shipping or not. But if availability is practically nonexistent globally, then it's pretty much dead on arrival. It doesn't matter how good it is if you can't buy it.
www.techpowerup.com/review/intel-core-i7-10700k/21.html
, the 3900X is 6.6% faster in applications than 10700K, 7% slower at 1080p in games and the performance per dollar (TPU review says the Intel Core i7-10700K retails for around $400) is 1% higher for the 10700K assuming the $400 price tag.
www.techpowerup.com/review/intel-core-i7-10700k/10.html
In Photoshop and Office the difference is less than 100 milliseconds. (58.1 milliseconds for Photoshop, 73.4 milliseconds for Word, 86.8 milliseconds for PowerPoint and 27.8 milliseconds for Excel)
And Premiere shows that the difference is less than 10%. (6%)
www.techpowerup.com/review/intel-core-i7-10700k/8.html
In web browser performance the difference is less than 10%. (8% for Google Octane and 6% for WebXPRT)
Actually in Mozilla Kraken, the 3900X is quite a bit snappier.
A CPU that has more threads enables more multitasking, although in order to do that, applications in your daily usage must support multithreading which modern apps support by default. And so 3900X is the better option because not only it has comparable performance and price compared to 10700K, it also has, you guessed it, higher thread count which allows you to do more with your PC.
As for the comparable price globally, for example in Germany, you can purchase the 10700K for 429 euros and the 3900X for 419 euros. The purchasing decision is up to the buyer of course. Here are the links: (June 7th, 2020)
geizhals.de/intel-core-i7-10700k-bx8070110700k-a2290993.html
geizhals.de/amd-ryzen-9-3900x-100-100000023box-a2064391.html
At the end of the day, everyone has their own preferences. We're not gonna force anything (or any idea) on anyone. Intel, AMD and NVIDIA are just brands, what gives them value is the performance they bring to the market. Intel enabled HT from i9 all the way to i3 with 10th Gen Comet Lake processors, an example of bringing performance to the market.