Sunday, July 12th 2020
Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks
"I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on." These were the words of Linux and Git creator Linus Torvalds in a mailing list, expressing his displeasure over "Alder Lake" lacking AVX-512. Torvalds also cautioned against placing too much weightage on floating-point performance benchmarks, particularly those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.
"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance."Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
Source:
Phoronix
"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance."Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
42 Comments on Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks
Truth be told, though, I kinda agree with Torvalds? I mean, AVX has a history of generating more heat, introducing a performance penalty (triggered by either using one single instruction or by using more than a certain number, depending on which specific instruction is used) in mixed workloads, and on top of that, AVX-512 has a multitude of instructions that are not necessarily all available together, if you want them, probably due to Intel's habit of aggressively cutting off features for market segmentation.
That's the core of his rant. He also is upset that they don't just make a better FPU.
The background for this topic is the addition of Alder Lake support in GCC which lacked AVX-512. It remains to be seen if this means the core itself lacks the feature, or if parts of it does. I assume following all this noise Intel will make some sort of statement.
I'm also disappointed with the adoption rate of AVX-512, but that doesn't make it a gimmick. It holds incredible performance potential and increased flexibility over AVX2. But what annoys me much more is Intel's complete lack of support of any AVX in their Pentium/Celeron processors, which is unnecessary fragmentation and holds back mainstream software from embracing modern features. Why do you need FP64 on GPUs? Please elabrorate.
It used to be the choice of instruction had isolated impact and predictable results. They neither slowed down any other code around it nor impacted code running on other cores. These were almost free to use and a benefit when used correctly.
The problem is when mixing code and mixing running tasks, AVX512 et. al. reduce the clockspeed to impact the integer code running in the same thread AND ALL OTHER running threads on the same processor. It slows down all integer & non-AVX FP code running in ALL cores. Compilers cannot know during compiling, what the potential performance impacts will be for users at runtime. The OS cannot know the potential performance impacts that occur at runtime when scheduling a mixture of threads. Fairness and predictable performance goes out the window. The best choice for fairness and predictable performance is to IGNORE occasional use of AVX. It may be nice for a computers/servers dedicated to a single task that benefits from these instructions but the typical general user is hurt more then helped by them. Cloud and VM users are hurt by them. Arbitrary and occasional use of them impact all running code so the OS should avoid using them.
It would be OK if the processor could maintain clock speed while using exotic instructions. They would have to be engineered to increase the stages/cycles required to complete the more complex work but maintain clockspeed at all costs. I would much rather have more FP units that are simpler for greater throughput and flexibility. Good if you can pipeline the Multiply into the Add and get the result slightly later than AVX512, but doesn't slowdown the rest of the code. Just because you can use an AVX__ instruction, doesn't mean you should.
CPU's with AVX support a mixture of yes and no. The clockspeed impact also varies according to the CPU model and many other variables.
I agree with Linus, it shouldn't be this complicated and problematic.
AVX code is if anything much more predictable, since the throughput is more consistent, cache lines are more effectively used and there is less branching. Firstly, both single FP operations, SSE and AVX are fed into the same vector units, the only difference is how filled the vector registers are. Intel have two full FMA-sets of AVX-512, to compete with that with single FPUs in FP32 throughput you would need 32 of them, you would also need the circuitry to handle these writing back to the same cache lines without adding pipeline steps. Then the instructions would be at least 16x larger, meaning you would have to increase the instruction cache >10x and probably L2 a bit as well, then the instruction window would have to increase ~10x, and the prefetcher, branch predictor etc. needs to work much more efficiently. And even if you manage all this, you better pray that compiler have unrolled all loops aggressively, because otherwise there is no way you are going to feed your 32 hungry FPUs. :rolleyes:
If you have a rough understanding of how CPUs works, you have probably understood by now that your suggestion was short-sighted.
Probably many things, but one I know of quite well is - engineering simulations.
Thousands and thousands of engineers are relying on Xeons every day to run their finite element- and finite difference type analyses (mechanical FE, CFD, electromagnetics etc.).
For FE, specifically, you spec a machine like this -> As many AVX2/512 cores you can get away with and nCores * ~8GB ECC RAM. Turn off hyperthreading and go have fun.
It's a big market for Intel, and increasingly nVidia (new codes start to introduce GPU FP64 slowly, but typically require CUDA, so no luck for AMD).
As I alluded to in #10, there would probably be some kind of response.
Videocardz (if we can trust them), have some clarifications: link
So it may appear that the big cores offers more ISA features.
Can't say I've ever run into avx-512 so far ?
Set it to 5 and clocks have never dropped that far.
Besides, AVX-512 is found only in high-end desktop processors (Core i7 or i9) or Xeons, and for whatever reason, on some specific mobile chips.
On top of that, while there is a subset that is sort of available on every Intel CPU that "supports" AVX-512, there are some instructions that are only found on specific CPUs. Tiger Lake has not even launched yet, if I remember correctly.
AVX impact is relative, apparently, according to this
stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355
TLDR, it seems to affect only Turbo frequencies, in the first place, and how much it will downclock will depend on the type and number of instructions executed. AVX512 does trigger this throttling a bit more, while AVX and AVX2 do it less or don't even do so at all.
Yep my prior x299/ 7900x had it and so does my current 9940x
z490/ 10900k does not nor does x99/ 5930k.
Once client applications starts to utilize it, it will offer significant performance and efficiency gains, even for low-power laptops.
*) Except Atom, Pentium and Celeron of course.
After all, faster integer performance will... let your computer send E-mail faster? Gaming uses floating-point too, so improving the power of chips for HPC applications will make them more powerful for everyone.
But maybe Linus Torvalds is at least partly right. Maybe it's time to split the processor line-up, to offer a choice between chips that have high floating-point performance, and other chips that tilt more towards integer performance, so that one can buy a processor appropriate to one's workload.
3x2x 256-bit multiply-and-adds, 2x 256-bit loads from L1 cache and 1x 256-bit store to L1 cache... per clock tick with like 5-cycle latency.Outside of going to 512-bits, how exactly do you expect Intel to improve upon that? AVX512 simply change that to
3x2x 512-bit multiply-and-adds, 2x 512-bit loads and 1x-512 bit stores. Its the most obvious way to improve the SIMD / FPU unit.EDIT: Apparently 2x multiply-and-adds supported per clock on Skylake, according to software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3508,3922,2581&techs=FMA&text=fmadd. Still, that's 16 flops per cycle. Hard to imagine how to make this 2x better aside from the "obvious" extend to 512 bits.
------
SIMD FPU-multiply is higher performance than 64-bit integer-multiply, lol. (to be fair: SIMD FPU-multiply is easier at only 53-bits (Double precision), but still...) But virtually everything has a "memset(blah, 0, ...)" somewhere. And this memset code is almost always compiled into SIMD in my experience (be it 128-bit SSE, 256-bit AVX, or 512-bit AVX512 code)
GCC and Clang have surprisingly good auto-vectorizers that can change many simple for-loops into SIMD accelerated versions. AVX512 has literally double the performance with memset, memcmp, memcpy, strcmp, strcpy, etc. etc compared to 256-bit AVX2. (Note: AVX does NOT support integer operations. You need AVX2 in your compile flags, as well as an AVX2 CPU).
The 512-bit thick data-path extends all the way to L2 cache... meaning memcmp / memcpy / etc. etc. bonus applies to a huge amount of C code automatically.
This does however require the front-end to be able to decode and issue micro-ops faster, having a larger instruction window, etc., and even then run the risk of underutilization. I do expect that we will eventually move to 3 or even 4 FMA sets in desktop CPUs, but but the architectures will need to evolve a lot to facilitate that.
One interesting bit is the rumor about Zen 3 offering 50% higher FPU performance. If true, I do wonder if they added more units, or if they improved them somehow. They do, and software can get a good portion of free performance simply by enabling these instructions.
But still, the huge performance gains still requires tailored code using intrinsics, which is unfortunately a bit too difficult for most programmers. But I do hope we get to a point where the compilers are able to convert a bit more complex calculations into pretty optimal AVX, provided you have cache optimized etc.
One of the interesting things about AVX is the vast feature set which extends far beyond just arithmetics. It also support things like comparisons with masks, which essentially enables you to do conditionals without branching logic, and the feature set of AVX-512 is almost like a new instruction set. The potential here is huge, but it's still "inaccessible" to most programmers. If we get to a point where writing clean C code can be compiled into decent AVX instructions, even with more complex calculations and some basic conditionals, that would be huge for the adoption of AVX. One thing that comes to mind is the 512-bit vector size fits very well with the cache line size.
Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.
AVX and AVX2 are over a decade behind GPU-SIMD computers. AVX512 finally brings parity to CPU-autovectorizers to what GPUs have been doing since 2006. AVX512 is actually a really well designed instruction set... but Intel is certainly messing up the business side of things IMO.