Monday, January 3rd 2022

Intel to Disable Rudimentary AVX-512 Support on Alder Lake Processors

Intel is reportedly disabling the rudimentary AVX-512 instruction-set support on its 12th Gen Core "Alder Lake" processors using a firmware/ME update, reports Igor's Lab. Intel does not advertise AVX-512 for Alder Lake, even though the instruction-set was much publicized for a couple of its past-generation client-segment chips, namely 11th Gen Rocket Lake, and 10th Gen Cascade Lake-X HEDT processors. The company will likely make AVX-512 a feature that sets apart its next-gen HEDT processors derived from Sapphire Rapids, its upcoming enterprise microarchitecture.

AVX-512 is technically not advertised for Alder Lake, but software that calls for these instructions can utilize them on certain 12th Gen Core processors, when paired with older versions of the Intel ME firmware. The ME version Intel releases to OEMs and motherboard vendors alongside its upcoming 65 W Core desktop processors, and the Alder Lake-P mobile processors, will prevent AVX-512 from being exposed to the software. Intel's reason to deprecate what little client-relevant AVX-512 instructions it had for Core processors, could have do with energy efficiency, as much as lukewarm reception from client software developers. The instruction is more relevant to the HPC and cloud-computing markets.

Many Thanks to TheoneandonlyMrK for the tip.
Source: Igor's Lab
Add your own comment

49 Comments on Intel to Disable Rudimentary AVX-512 Support on Alder Lake Processors

#26
Unregistered
Avx512 is no use for gaming anyway, so i don't give a hoot. Benches are just epeen, nothing else. The only thing AVX512 is any good for are serious apps. It seems the only reason to OC is for benches, as a high performance CPU (58/900x, ADL 127/900k) is gonna be fine for most gaming needs anyway.

Name 1 program/game not for professional use that uses AVX512, bet there is not many, so disabling AVX512 on ADL is going to have very minimal impact on it's performance on everyday apps anyway.
#27
Aquinus
Resident Wat-man
TiggerAvx512 is no use for gaming anyway, so i don't give a hoot.
Even beyond that, several compilers will avoid AVX-512 in favor of AVX-256 because of the issue with downclocking, which impacts all workloads occurring at the time that the CPU clocks down. The usefulness of AVX-512 completely depends on the workload and how well the task is vectorized. Even for situations that can use AVX-512, going with 256 or 128 might actually net you better performance depending on the application. From that perspective, the useful situations for AVX-512 are few and far inbetween outside of the server and HEDT ecosystems. Fun fact, GCC will actually prefer 256-bit AVX for Skylake chips that support AVX-512 because of this because you're less likely to benefit from 512 than 256, which doesn't have the same downclocking issues. Also depending on how well the task is vectorized, you may not need 512-bit either, so why slow down the CPU to use it?
Posted on Reply
#28
seth1911
Mhm if anyone need AVX 512 its better to get a rocket lake than an alder lake, intel seems to be really stupid those days.
IkarugaPS.: I3 processors had ECC support until 9th gen, but nobody bothered (motherboard makers dropped it because there was zero market for it). I personally do think it would be good to have, but apparently most of the users think otherwise (they are probably enthusiast like me and want faster ram, which is a lot harder to do with ECC). :)
Nope intel cancled ECC on the I3 because 4 Cores/ 8 Threads for about 72€.
U have to buy the 4 Core / 8 Thread Xeon for about 200€.
Posted on Reply
#29
qubit
Overclocked quantum bit
Intel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
Posted on Reply
#30
Aquinus
Resident Wat-man
qubitIntel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
AMD probably isn't implementing it because they know how much of a niche instruction set it is and how it comes with some very real drawbacks as I mentioned above. AVX-512 isn't a magical solution to everything vector-related and people need to stop treating it that way. A bit flipping once a year is probably still more often than most people are running software that'd benefit from AVX-512 over 256.
Posted on Reply
#31
mechtech
Intel..........love me a ton of market segmentation
Posted on Reply
#32
Vayra86
TheoneandonlyMrKDude, it makes CPUs hot when people actually use it/them, we can't have that :p, how can it be good. Massive sarcasm and jest I agree with your points.

I do like Intel leaning towards this RUDIMENTARY ( ahahaa my arse) statement, the information gleaned from the web makes them seem disingenuous about this, the E cores had non the P cores is third gen avx512 no?! , And that's rudimentary whatever Intel.

They're just after segregation again the gits.
Its a double win, they're also not having to deal with the gamur sentiment or ('enthusiasts') that try desperately to run AVX512 over their 5 Ghz all core OC on a measly aio. And the fallout that would generate, because I reckon some nuclear plants might explode in the process. Next thing you know they'll tell you not to OC your K-CPUs (oh wait... :D)

This is just Intel being Intel, profit winning over quality and long term planning overthrown by investor panic. It screams of arrogance regained. So much for that rebranding they did just now. Still Intel Inside.

Honestly, I'm staying far away from this crap, as long as I see these shenanigans, no penny from this wallet.
Posted on Reply
#33
Ferrum Master
IkarugaIf Torvalds wants better products than intel and avx512, then he shouldn't “hope” for the death of new instructions, he should hope for competition like the apple m1 instead, which shows intel (and nvidia) how inefficient their stuff really are. True competition is our only hope against these monsters with their prices and segregation techniques, not death wishes on instructions.
His concern is building and maintaining kernel code, that already has grown fat and keeping it slim enough and rational. He has seen it all in the x86 branch and how make it happen with his own efforts.

Our hardware is quite good actually if the code for it were better and more low level. What I am implying introducing another instruction set like crutch to the X86 won't make the end code smaller and more efficient for us desktop users. If some code monkeys will start to use this instruction set just for fashion as a hipster trend, then bad things usually happen. This time, if it would trigger the instruction and feed too many data through the long instruction pipe, it overheats as it has much longer peak execution and heats up more. Remember the early Intel burn test program introducing AVX to them, made them literately furmark class showing temps you never ever see in your daily usage. If programs would trigger it like on gaming, it would be rad and FPS would tank due to single core frequency decrease for the main render thread.

That's PS3 example is just a rare exception. He tries to hammer nails using his shoes by using AVX512, just because it is more lazy and for what, few people using Xeons now? There was an older instruction set the emulator could use, but that was omitted due to HW bugs on almost any Intel CPU arch it had with time too, but wasn't so bashed around in the media... so YOLO.

We are lucky, that there are some harsh code maintainers that tame down some snowflakes with introducing limitations and automatic optimization in compilers like Aquinus said.
Posted on Reply
#34
Unregistered
Vayra86Its a double win, they're also not having to deal with the gamur sentiment or ('enthusiasts') that try desperately to run AVX512 over their 5 Ghz all core OC on a measly aio. And the fallout that would generate, because I reckon some nuclear plants might explode in the process. Next thing you know they'll tell you not to OC your K-CPUs (oh wait... :D)

This is just Intel being Intel, profit winning over quality and long term planning overthrown by investor panic. It screams of arrogance regained. So much for that rebranding they did just now. Still Intel Inside.

Honestly, I'm staying far away from this crap, as long as I see these shenanigans, no penny from this wallet.
If i was loaded, every time a better CPU came out whoever it was from, i would have one, whether it was AMD or Intel as the only thing that matters is performance. I'm sure if you are loaded enough, you can build a rig good enough to cool even a 500w CPU. Heat or power use should not matter to anyone, as long as you can cool it.
#35
efikkan
TiggerAvx512 is no use for gaming anyway, so i don't give a hoot. Benches are just epeen, nothing else. The only thing AVX512 is any good for are serious apps. It seems the only reason to OC is for benches, as a high performance CPU (58/900x, ADL 127/900k) is gonna be fine for most gaming needs anyway.

Name 1 program/game not for professional use that uses AVX512, bet there is not many, so disabling AVX512 on ADL is going to have very minimal impact on it's performance on everyday apps anyway.
It will probably be some time before games start to utilize AVX-512, but there are certainly games which use AVX2, so don't think SIMD isn't useful for games. But it's not something that directly benefit FPS though, so if a game relies heavily on SIMD, it usually means that's a minimum requirement or you have to sacrifice game features if you don't have it.
AquinusEven beyond that, several compilers will avoid AVX-512 in favor of AVX-256 because of the issue with downclocking, which impacts all workloads occurring at the time that the CPU clocks down.
Will they? The big three all have offered support for some time.
And compilers don't have a will of their own to decide which ISA to use, that's specified by the developer.

The downclocking argument is 100% BS and you know it. Even if the core runs a few hundred MHz lower, it will still churn through more data, so this is just nonsense.
AquinusThe usefulness of AVX-512 completely depends on the workload and how well the task is vectorized. Even for situations that can use AVX-512, going with 256 or 128 might actually net you better performance depending on the application.
Most vectorized data which benefits from SIMD are greater than 512 bits (64 bytes). 512 bits is tiny.
In fact, using a vector size of 512 bit is genius, as it perfectly matches the cache line size of current x86 implementations.
qubitIntel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
I wish AMD would come up with a CPU with AVX-512 support which kicks ass, like 4x FMAs and better energy efficiency. It would be a serious powerhouse.
Ferrum MasterWhat I am implying introducing another instruction set like crutch to the X86 won't make the end code smaller and more efficient for us desktop users.
Actually, optimized AVX core is smaller, more cache efficient, not to mention it eliminates a lot of branching, looping, register shuffling and load/stores. If the computational density is high enough, it offers orders of magnitude higher performance. But not all code is that computationally dense, and much of the kernel code is probably not.
Ferrum MasterIf some code monkeys will start to use this instruction set just for fashion as a hipster trend, then bad things usually happen.
First of all, SIMD is used to some degree in many applications. I'm pretty sure you use it every day. Video playback, compression, web browsing (both compression and encryption), video editing, photo editing, etc. all use AVX/AVX2 or SSE. Without it, many of these things would be dreadfully slow. When popular applications start to get good AVX-512 support, you will not want to be left behind.
Ferrum MasterThis time, if it would trigger the instruction and feed too many data through the long instruction pipe, it overheats as it has much longer peak execution and heats up more.
What on earth makes you come up with a claim like that? Stop embarrassing yourself.
Most AVX operations are a few clock cycles, and the work done is equivalent to filling up the pipeline many times.
Ferrum MasterRemember the early Intel burn test program introducing AVX to them, made them literately furmark class showing temps you never ever see in your daily usage. If programs would trigger it like on gaming, it would be rad and FPS would tank due to single core frequency decrease for the main render thread.
That's not how throttling works at all. This is utter nonsense.
Ferrum MasterThat's PS3 example is just a rare exception. He tries to hammer nails using his shoes by using AVX512, just because it is more lazy and for what, few people using Xeons now?
FYI, AVX-512 is supported by Ice Lake, Tiger Lake, Rocket Lake, Cascade Lake-X and Skylake-X, so not just Xeons. ;)

I think most (if not all) of you have missed the biggest advantage of AVX-512. It's not just AVX2 with double vector size, it's vastly more flexible and have a better instruction encoding scheme. It's much more than just simple fp add/sub/mul/div operations, it actually will allow previously unseen efficiency when implementing dense algorithms, for e.g. encoding, encryption, compression, etc. with an efficiency coming relatively close to ASICs.
Posted on Reply
#36
Aquinus
Resident Wat-man
efikkanWill they? The big three all have offered support for some time.
And compilers don't have a will of their own to decide which ISA to use, that's specified by the developer.
Stop. `gcc -march=skylake-avx512` defaults to `-mprefer-vector-width=256`. You can override it, but that's the default out of the box and there is a reason for it.
efikkanMost vectorized data which benefits from SIMD are greater than 512 bits (64 bytes). 512 bits is tiny.
In fact, using a vector size of 512 bit is genius, as it perfectly matches the cache line size of current x86 implementations.
Except it comes with the very real drawback that it slows down everything that isn't SIMD. It's great for servers and HEDT, but it sucks for consumer hardware. I'm not saying that AVX-512 is useless. I'm saying for the average user, it's useless.
Posted on Reply
#37
chrcoluk
Vya DomusI bet they want it disabled just so that people can't run AVX 512 benchmarks that would expose even more laughable power consumption figures. Other than that it speeds up the validation process and practically no consumer software needs AVX 512, so it's completely irrelevant whether it's there or not.


That doesn't really mean anything from an emulation stand point , at the end of the day you still need to emulate more or less the same thing irrespective of the ISA. The reason you couldn't use CUDA or OpenCL is not because the CPU is RISC but because of the software that runs on those SPEs which needs complex thread synchronization logic that simply can't be done on a GPU. The PS3 GPU is documented, it's just some run of the mill 7000GTX series Nvidia architecture, nothing special there so there is no point in trying to use anything else than just OpenGL or any other graphics API.
I agree, I think it was only enabled for marketing wins on benchmarks that enable avx-512, but then the realisation kicked in the power numbers were overcoming any positive PR, hence the new marketing decision to disable it.
Posted on Reply
#38
AusWolf
TheinsanegamerNApparently you've missed my criticism of intel, AMD, AND Nvidia for pushing their parts out of the efficiency sweet spot (and eliminating OC headroom for us) for years now.

Power use can easily be limited to one number. Mobile parts, T series parts, and even normal desktop parts have have power draw limited. In fact, such limits are a thing according to intel. Intel however is very mushy on the actual limit of PL2/3 power draw and time limits as well, things that should be enforced by default then turned off for OC, not the other way around. Most importantly, they need to be consistent, as right now all these board makers can be "in spec" yet have wildly different power draws and time limits.

This wasnt an issue before the boost wars, boost timing and power draw limits were prtty clear in the nehalem/sandy bridge era. AMD today is still more stringent on how much juice ryzen can pull to boost. Intel has been playing fast and loose for years, and it's a headache to keep track of.
I agree, except that AMD is just as much of a headache as Intel is, with their TDP meaning nothing, and PPT being a totally different and undisclosed value.

But then, neither is a headache if you know what PL1/2 or PPT means and how to change them in the BIOS.
Posted on Reply
#39
Ferrum Master
efikkanFYI, AVX-512 is supported by Ice Lake, Tiger Lake, Rocket Lake, Cascade Lake-X and Skylake-X, so not just Xeons. ;)
All your points are not valid, all guys here are telling the same. AVX512 for consumers is not needed, its workload produces more heat and slows down the system.

Rocket lake is the only exception there, other's aren't desktop platforms either, latter being rebadged Xeons with cut down feature set to even put some tax more. You have to pay extra for ECC support choosing Xeon while there is literary nothing that stops it working on the Skylake X etc parts. Intel being Intel.
Posted on Reply
#40
efikkan
AquinusExcept it comes with the very real drawback that it slows down everything that isn't SIMD. It's great for servers and HEDT, but it sucks for consumer hardware. I'm not saying that AVX-512 is useless. I'm saying for the average user, it's useless.
No, it does not slow down everything else. Any time a core throttles from heavy use of AVX-512, the performance gained from it will greatly outweigh the minor downclock.
Ferrum MasterAll your points are not valid, all guys here are telling the same. AVX512 for consumers is not needed, its workload produces more heat and slows down the system.
No, it does not slow down the system. This is complete nonsense.
AquinusRocket lake is the only exception there, other's aren't desktop platforms either, latter being rebadged Xeons with cut down feature set to even put some tax more. You have to pay extra for ECC support choosing Xeon while there is literary nothing that stops it working on the Skylake X etc parts. Intel being Intel.
Ice Lake-U/-Y and Tiger Lake-U/-Y/-H are not Xeons, these are high volume consumer products.
Cascade Lake-X and Skylake-X exist as non-Xeons.

All S-series CPUs from Intel share dies with Xeons.
Posted on Reply
#41
Aquinus
Resident Wat-man
efikkanNo, it does not slow down the system. This is complete nonsense.
Yes, it does. FP AVX-512 is the worst offender. This reply on StackOverflow describes what's going on pretty well. The only thing that's complete nonsense is how you're pushing on this so hard when this is a very easy thing to validate.

stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355

In summary:
Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.
Posted on Reply
#42
efikkan
AquinusYes, it does. FP AVX-512 is the worst offender. This reply on StackOverflow describes what's going on pretty well. The only thing that's complete nonsense is how you're pushing on this so hard when this is a very easy thing to validate.
Once again, you clearly demonstrate that you don't understand the subject.
AVX512 instructions work on twice as much data as AVX2 instructions, and 16 times as much as scalar fp32 instructions. So even if a CPU has to drop the clock speed a little bit and there are a few scalar instructions in-between the AVX operations, the total throughput is still better. These CPUs constantly scales the core clocks individually. On top of that, using vector operations reduces stress on instruction cache and eliminates a lot of register shuffling and instructions for control flow, which also means there will be fewer scalar operations to be performed. This in turn simplifies the workload for the CPU resulting in more work completed even though fewer instructions are executed. And contrary to popular opinion, the purpose of a CPU is to execute work, not run at the highest clock speed!

The fact that Skylake-SP throttles more than desired is an implementation issue, not an ISA issue. And it doesn't make AVX-512 a bad feature, it just reduces the advantage of it.
Posted on Reply
#43
bug
@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth. I know some encoders will use AVX512, but that's all I know of.

AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
Posted on Reply
#44
Aquinus
Resident Wat-man
efikkanOnce again, you clearly demonstrate that you don't understand the subject.
AVX512 instructions work on twice as much data as AVX2 instructions, and 16 times as much as scalar fp32 instructions. So even if a CPU has to drop the clock speed a little bit and there are a few scalar instructions in-between the AVX operations, the total throughput is still better. These CPUs constantly scales the core clocks individually. On top of that, using vector operations reduces stress on instruction cache and eliminates a lot of register shuffling and instructions for control flow, which also means there will be fewer scalar operations to be performed. This in turn simplifies the workload for the CPU resulting in more work completed even though fewer instructions are executed. And contrary to popular opinion, the purpose of a CPU is to execute work, not run at the highest clock speed!

The fact that Skylake-SP throttles more than desired is an implementation issue, not an ISA issue. And it doesn't make AVX-512 a bad feature, it just reduces the advantage of it.
Sure, if your workload is purely vector operations. That's not a realistic workload for most applications, even more so in the consumer space. No application has only AVX instructions sans anything else. As that StackOverflow article mentioned, it depends on how much you're using these vector units. Even for L1 licensure there is a cost that needs to be considered, forget the hit at L2 which is far more pronounced.
bug@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth. I know some encoders will use AVX512, but that's all I know of.

AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
At least one person understands what I'm saying. Hell, even GCC opts for AVX-256 on chips that support 512 because the cost (most of the time,) isn't worth it. If it was such a magical solution, it should be the prefered default, but it's not, for this reason.

Look, I'm not saying AVX-512 is useless or bad. I'm just saying it's not the magic bullet you're making it out to be @efikkan. There are plenty of cases where it's not an effective strategy and you're better off sticking with something like AVX-256 instead a lot of the time because the clock penalty is very real for these heavy instructions.
Posted on Reply
#45
efikkan
bug@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
That wouldn't happen unless the CPU reaches its thermal limit, and keep in mind that entails putting heavy AVX-512 loads no most if not all cores.
And regardless, the heavy load finishing quicker means more time and cycles free for anything else.
Still, none of these are ISA issues. Ice Lake-SP is able to sustain much better clocks with heavy AVX loads, and Sapphire Rapids will do it even better.
bugHe's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth.
That's a separate subject. And yes, pretty much non-existant in the consumer space.
bugAVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
Really? And what kind of alternative would you propose to advance CPU throughput?
AquinusSure, if your workload is purely vector operations. That's not a realistic workload for most applications, even more so in the consumer space. No application has only AVX instructions sans anything else.
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seemlessy switch between vector operations and scalar operations, and mix them of course.
Posted on Reply
#46
Aquinus
Resident Wat-man
efikkanThat wouldn't happen unless the CPU reaches its thermal limit, and keep in mind that entails putting heavy AVX-512 loads no most if not all cores.
And regardless, the heavy load finishing quicker means more time and cycles free for anything else.
Still, none of these are ISA issues. Ice Lake-SP is able to sustain much better clocks with heavy AVX loads, and Sapphire Rapids will do it even better.
Clock speeds decrease as more cores use AVX-512 regardless of the CPUs thermal state. What you just described is not how Intel processors work with heavy instructions that hit the L2 license. Even with one core, you have reduced clocks, but it gets a lot worse the more cores you use. In my example from StackOverflow you can see that with the Xeon Gold 5120 by the time you're at 5 cores with AVX-512 heavy instructions, you're down to 1.9Ghz. That has nothing to do with thermal throttling and everything to do with how Intel handles L1 and L2 licenses.
efikkanIn most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seemlessy switch between vector operations and scalar operations, and mix them of course.
That's kind of my point. The clock speed hit is very real even if it's just for a few instructions and that impacts everything else until the CPU switches back to L0 or L1 which takes time. You need to actually read that article I sent because it explains all of this. The only time AVX-512 is going to shine is if the majority of the work being done can be vectorized, not if it's sprinkled out throughout your application. That's actually the case where AVX-256 is far more advantageous. So thank you for proving my point...
Posted on Reply
#47
efikkan
AquinusThe only time AVX-512 is going to shine is if the majority of the work being done can be vectorized, not if it's sprinkled out throughout your application. That's actually the case where AVX-256 is far more advantageous. So thank you for proving my point...
If you seriously think I proved your point, then you're don't understand the subject at all :facepalm:
Let's examine what I said;
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seamlessly switch between vector operations and scalar operations, and mix them of course.

In case it wasn't clear enough; it's the performance critical code which does all the real computational work. It may be a small portion of the total code base, but it's the code that runs the majority of the CPU time. That's why optimizing the performance critical code is what matters. Those who know the first thing about optimizing code knows the most important type of optimizations are cache optimizations; divided into data cache(1) and instruction cache(2) optimizations. This is important because failing to do so results in lots of cache misses, and the cost of a cache miss on current x86 CPUs is ~450 clocks, which roughly means each cache miss costs you ~1000-2000+ instructions. And how do you solve this? By packing the data tight - which means it's vectorized. Then you have the instruction cache(2), which has to do with usage of function calls, data locality and computational density (avoiding bloat and extra branching is implied here too). So again, packing the data tight, packing the computational code tight is the key to performance.
So in conclusion, if your code is performant at all, the data will have to be layed out in vectors, the data will have to be traversed linearly, and the code better have good computational density, because otherwise the CPU time will be spent on cache misses, branch mispredictions etc. instead of doing real work. So if you can put two and two together, you'll see that this is also the groundwork for using SIMD. And any code that works on vectors >32 bytes (most of them are much larger), will benefit from using AVX-512 over AVX2.
Posted on Reply
#48
Vya Domus
Wide SIMD support is and will always remain counter productive and nonsensical from a practical point of view, even in the datacenter space. Whatever can be improved by a wider SIMD ISA can simply be delegated to a GPU, stuff like ML, video encoding/decoding, etc.

I can't think of any application that would benefit from higher throughput in terms of vector processing but that wouldn't be worthwhile implementing to a GPU.
Posted on Reply
#49
Aquinus
Resident Wat-man
efikkanIf you seriously think I proved your point, then you're don't understand the subject at all :facepalm:
Let's examine what I said;
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seamlessly switch between vector operations and scalar operations, and mix them of course.

In case it wasn't clear enough; it's the performance critical code which does all the real computational work. It may be a small portion of the total code base, but it's the code that runs the majority of the CPU time. That's why optimizing the performance critical code is what matters. Those who know the first thing about optimizing code knows the most important type of optimizations are cache optimizations; divided into data cache(1) and instruction cache(2) optimizations. This is important because failing to do so results in lots of cache misses, and the cost of a cache miss on current x86 CPUs is ~450 clocks, which roughly means each cache miss costs you ~1000-2000+ instructions. And how do you solve this? By packing the data tight - which means it's vectorized. Then you have the instruction cache(2), which has to do with usage of function calls, data locality and computational density (avoiding bloat and extra branching is implied here too). So again, packing the data tight, packing the computational code tight is the key to performance.
So in conclusion, if your code is performant at all, the data will have to be layed out in vectors, the data will have to be traversed linearly, and the code better have good computational density, because otherwise the CPU time will be spent on cache misses, branch mispredictions etc. instead of doing real work. So if you can put two and two together, you'll see that this is also the groundwork for using SIMD. And any code that works on vectors >32 bytes (most of them are much larger), will benefit from using AVX-512 over AVX2.
No, it won't result in more cache misses. You still need to read all of that data to populate the SIMD unit. If you get a cache hit or miss depends on what was done with that data before hand, how often it's been used, etc. You're making a lot of claims here and a lot of them are flat out incorrect. I suggest you start citing sources if you're going to play this game. I at least provided something to show that there is a cost to using heavy SIMD instructions. You're just repeating yourself incessantly. Let's just cut to the part where you provide evidence for your claims.

Edit: Maybe a little article from CloudFlare might help show how painful this can be, even in the server setting.
blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/
Posted on Reply
Add your own comment
Dec 22nd, 2024 07:59 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts