Monday, January 3rd 2022
Intel to Disable Rudimentary AVX-512 Support on Alder Lake Processors
Intel is reportedly disabling the rudimentary AVX-512 instruction-set support on its 12th Gen Core "Alder Lake" processors using a firmware/ME update, reports Igor's Lab. Intel does not advertise AVX-512 for Alder Lake, even though the instruction-set was much publicized for a couple of its past-generation client-segment chips, namely 11th Gen Rocket Lake, and 10th Gen Cascade Lake-X HEDT processors. The company will likely make AVX-512 a feature that sets apart its next-gen HEDT processors derived from Sapphire Rapids, its upcoming enterprise microarchitecture.
AVX-512 is technically not advertised for Alder Lake, but software that calls for these instructions can utilize them on certain 12th Gen Core processors, when paired with older versions of the Intel ME firmware. The ME version Intel releases to OEMs and motherboard vendors alongside its upcoming 65 W Core desktop processors, and the Alder Lake-P mobile processors, will prevent AVX-512 from being exposed to the software. Intel's reason to deprecate what little client-relevant AVX-512 instructions it had for Core processors, could have do with energy efficiency, as much as lukewarm reception from client software developers. The instruction is more relevant to the HPC and cloud-computing markets.Many Thanks to TheoneandonlyMrK for the tip.
Source:
Igor's Lab
AVX-512 is technically not advertised for Alder Lake, but software that calls for these instructions can utilize them on certain 12th Gen Core processors, when paired with older versions of the Intel ME firmware. The ME version Intel releases to OEMs and motherboard vendors alongside its upcoming 65 W Core desktop processors, and the Alder Lake-P mobile processors, will prevent AVX-512 from being exposed to the software. Intel's reason to deprecate what little client-relevant AVX-512 instructions it had for Core processors, could have do with energy efficiency, as much as lukewarm reception from client software developers. The instruction is more relevant to the HPC and cloud-computing markets.Many Thanks to TheoneandonlyMrK for the tip.
49 Comments on Intel to Disable Rudimentary AVX-512 Support on Alder Lake Processors
Name 1 program/game not for professional use that uses AVX512, bet there is not many, so disabling AVX512 on ADL is going to have very minimal impact on it's performance on everyday apps anyway.
U have to buy the 4 Core / 8 Thread Xeon for about 200€.
This is just Intel being Intel, profit winning over quality and long term planning overthrown by investor panic. It screams of arrogance regained. So much for that rebranding they did just now. Still Intel Inside.
Honestly, I'm staying far away from this crap, as long as I see these shenanigans, no penny from this wallet.
Our hardware is quite good actually if the code for it were better and more low level. What I am implying introducing another instruction set like crutch to the X86 won't make the end code smaller and more efficient for us desktop users. If some code monkeys will start to use this instruction set just for fashion as a hipster trend, then bad things usually happen. This time, if it would trigger the instruction and feed too many data through the long instruction pipe, it overheats as it has much longer peak execution and heats up more. Remember the early Intel burn test program introducing AVX to them, made them literately furmark class showing temps you never ever see in your daily usage. If programs would trigger it like on gaming, it would be rad and FPS would tank due to single core frequency decrease for the main render thread.
That's PS3 example is just a rare exception. He tries to hammer nails using his shoes by using AVX512, just because it is more lazy and for what, few people using Xeons now? There was an older instruction set the emulator could use, but that was omitted due to HW bugs on almost any Intel CPU arch it had with time too, but wasn't so bashed around in the media... so YOLO.
We are lucky, that there are some harsh code maintainers that tame down some snowflakes with introducing limitations and automatic optimization in compilers like Aquinus said.
And compilers don't have a will of their own to decide which ISA to use, that's specified by the developer.
The downclocking argument is 100% BS and you know it. Even if the core runs a few hundred MHz lower, it will still churn through more data, so this is just nonsense. Most vectorized data which benefits from SIMD are greater than 512 bits (64 bytes). 512 bits is tiny.
In fact, using a vector size of 512 bit is genius, as it perfectly matches the cache line size of current x86 implementations. I wish AMD would come up with a CPU with AVX-512 support which kicks ass, like 4x FMAs and better energy efficiency. It would be a serious powerhouse. Actually, optimized AVX core is smaller, more cache efficient, not to mention it eliminates a lot of branching, looping, register shuffling and load/stores. If the computational density is high enough, it offers orders of magnitude higher performance. But not all code is that computationally dense, and much of the kernel code is probably not. First of all, SIMD is used to some degree in many applications. I'm pretty sure you use it every day. Video playback, compression, web browsing (both compression and encryption), video editing, photo editing, etc. all use AVX/AVX2 or SSE. Without it, many of these things would be dreadfully slow. When popular applications start to get good AVX-512 support, you will not want to be left behind. What on earth makes you come up with a claim like that? Stop embarrassing yourself.
Most AVX operations are a few clock cycles, and the work done is equivalent to filling up the pipeline many times. That's not how throttling works at all. This is utter nonsense. FYI, AVX-512 is supported by Ice Lake, Tiger Lake, Rocket Lake, Cascade Lake-X and Skylake-X, so not just Xeons. ;)
I think most (if not all) of you have missed the biggest advantage of AVX-512. It's not just AVX2 with double vector size, it's vastly more flexible and have a better instruction encoding scheme. It's much more than just simple fp add/sub/mul/div operations, it actually will allow previously unseen efficiency when implementing dense algorithms, for e.g. encoding, encryption, compression, etc. with an efficiency coming relatively close to ASICs.
But then, neither is a headache if you know what PL1/2 or PPT means and how to change them in the BIOS.
Rocket lake is the only exception there, other's aren't desktop platforms either, latter being rebadged Xeons with cut down feature set to even put some tax more. You have to pay extra for ECC support choosing Xeon while there is literary nothing that stops it working on the Skylake X etc parts. Intel being Intel.
Cascade Lake-X and Skylake-X exist as non-Xeons.
All S-series CPUs from Intel share dies with Xeons.
stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355
In summary:
AVX512 instructions work on twice as much data as AVX2 instructions, and 16 times as much as scalar fp32 instructions. So even if a CPU has to drop the clock speed a little bit and there are a few scalar instructions in-between the AVX operations, the total throughput is still better. These CPUs constantly scales the core clocks individually. On top of that, using vector operations reduces stress on instruction cache and eliminates a lot of register shuffling and instructions for control flow, which also means there will be fewer scalar operations to be performed. This in turn simplifies the workload for the CPU resulting in more work completed even though fewer instructions are executed. And contrary to popular opinion, the purpose of a CPU is to execute work, not run at the highest clock speed!
The fact that Skylake-SP throttles more than desired is an implementation issue, not an ISA issue. And it doesn't make AVX-512 a bad feature, it just reduces the advantage of it.
He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth. I know some encoders will use AVX512, but that's all I know of.
AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
Look, I'm not saying AVX-512 is useless or bad. I'm just saying it's not the magic bullet you're making it out to be @efikkan. There are plenty of cases where it's not an effective strategy and you're better off sticking with something like AVX-256 instead a lot of the time because the clock penalty is very real for these heavy instructions.
And regardless, the heavy load finishing quicker means more time and cycles free for anything else.
Still, none of these are ISA issues. Ice Lake-SP is able to sustain much better clocks with heavy AVX loads, and Sapphire Rapids will do it even better. That's a separate subject. And yes, pretty much non-existant in the consumer space. Really? And what kind of alternative would you propose to advance CPU throughput? The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seemlessy switch between vector operations and scalar operations, and mix them of course.
Let's examine what I said;
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seamlessly switch between vector operations and scalar operations, and mix them of course.
In case it wasn't clear enough; it's the performance critical code which does all the real computational work. It may be a small portion of the total code base, but it's the code that runs the majority of the CPU time. That's why optimizing the performance critical code is what matters. Those who know the first thing about optimizing code knows the most important type of optimizations are cache optimizations; divided into data cache(1) and instruction cache(2) optimizations. This is important because failing to do so results in lots of cache misses, and the cost of a cache miss on current x86 CPUs is ~450 clocks, which roughly means each cache miss costs you ~1000-2000+ instructions. And how do you solve this? By packing the data tight - which means it's vectorized. Then you have the instruction cache(2), which has to do with usage of function calls, data locality and computational density (avoiding bloat and extra branching is implied here too). So again, packing the data tight, packing the computational code tight is the key to performance.
So in conclusion, if your code is performant at all, the data will have to be layed out in vectors, the data will have to be traversed linearly, and the code better have good computational density, because otherwise the CPU time will be spent on cache misses, branch mispredictions etc. instead of doing real work. So if you can put two and two together, you'll see that this is also the groundwork for using SIMD. And any code that works on vectors >32 bytes (most of them are much larger), will benefit from using AVX-512 over AVX2.
I can't think of any application that would benefit from higher throughput in terms of vector processing but that wouldn't be worthwhile implementing to a GPU.
Edit: Maybe a little article from CloudFlare might help show how painful this can be, even in the server setting.
blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/