Sunday, July 12th 2020
Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks
"I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on." These were the words of Linux and Git creator Linus Torvalds in a mailing list, expressing his displeasure over "Alder Lake" lacking AVX-512. Torvalds also cautioned against placing too much weightage on floating-point performance benchmarks, particularly those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.
"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance."Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
Source:
Phoronix
"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance."Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
42 Comments on Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks
The only part I could agree about is some of the more application specific instructions (like "AI" stuff). I believe a standard ISA should be generic compute and logic, not application specific. So in my opinion, throw out all the AES, zip, jpeg(!) etc. acceleration instructions, and give us four 512-bit FMA-sets instead.
Consider your typical "comparison" for a sorting problem. You'd think you need an "if" statement, but in reality... you can make due with:
The max/min version of the code is branchless at the lowest level, thanks to instructions like vpmaxud. And all of a sudden, your for-loop starts to look far more auto-vectorizable and branchless.
AVX512's main issues are business related. Its locked out of mainstream Skylake chips (typical i7s), so its not really a common compilation target. It was originally Knights-landing feature (aka: Xeon Phi), which is a dead-end. I suggest reading through this dissertation by the way: www.cs.cmu.edu/~guyb/papers/Ble90.pdf
Blelloch's dissertation from 1990 would seem out-of-date at first glance. But in reality, modern SIMD machines (both AVX512 and GPUs) are heavily based on the CM5 machine he used as the basis of his dissertation. As such, his dissertation reads amazingly close to modern machines.
Dr. Blelloch's more recent papers map more closely to modern machines: www.cs.cmu.edu/~guyb/
Just some food for thought. I wouldn't try to do the "flattened nested parallelism" from the top-down in every algorithm. Its unlikely to be fast on all modern architectures. But what's interesting is that Dr. Blelloch has proven an equivalence between recursive definitions and the prefix scan-operations. As such, we have a "universal gadget" to try to convert recursive forms of algorithms into prefix-sum, prefix-max, and similar operations.
Not that the gadget is always efficient on a modern SIMD machine. Its absolutely not... but maybe restating the problem in a prefix-sum style provides insight and gives you ideas for a more efficient algorithm.
---------
You don't have to go very far to be amazed. In as early as Chapter 1, Dr. Blelloch converts recursive quicksort (yes, quicksort) into prefix sum operations.
Some good corona-times reading :)
If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355
en.wikichip.org/wiki/intel/frequency_behavior
It's documented behavior that Intel processors have different frequency sets according to whatever is running on it.
The purpose of the AVX offset is for overclockers to push non-AVX workloads to a higher clock speed. The CPUs are superscalar, so the technically it can execute both integer instructions and vector instructions at the same time, and it often does. E.g. you have a loop with dense math, the math is AVX, but the loop is not. But it's not a problem, as the alternative would be to do much more code, so even if a few instructions technically runs slower, the overall workload is still a lot faster.
Running the same calculations as AVX greatly reduces the instruction count and the clock cycles needed. It also makes it unroll even more the loops, which again reduces the loop code and branching associated with it. And denser code also helps both data caches, instruction caches, data dependencies and branch prediction, as the logic is more dense.
My perspective is that these microarchitectural issues (ie: downclocking or whatnot) will absolutely change by the next major "tick-tock" architecture from Intel. Intel's first implementation of any SIMD has always been crappy.
When AVX was first released, it was executed 128-bits at a time (Sandy Bridge). It was missing integer instructions: that's right, you could do 53-bit double-precision multiplies but you couldn't do 32-bit integer multiplies. All sorts of terrible. Eventually, Haswell + AVX2 came out and fixed the issues, finally making the AVX transition mostly worthwhile over SSE instructions. But all of the flamewars from the early 2010s about "is AVX worth it" look hopelessly outdated in today's environment.
I guess my point is... don't judge the AVX512 instruction set based on its current implementation (ie: Skylake-X). Skylake-X is clearly a "bad" implementation of AVX512. We should instead judge AVX512 based on its future viability. Focusing too much on Skylake-X's performance quirks will make our comments obsolete quicker.
-------------
Case in point: the CNS AVX512 chip (yeah, Via-chips. Surprise!!) can support AVX512 at full clock speeds. It does this by implementing all AVX512 instructions as 256-bit instructions executed over 2x clock ticks. No downclocking involved at all. Maybe this 2x256-bit methodology will be superior in the future, and Intel will copy it. Or maybe Intel figures out the 512-bit power issues and removes the need of downclocking.
Even as a 2x256-bit implementation, AVX512 has enough bonuses (auto-vectorization instructions, opcode masks, scatter instructions, extended register sets) that its worthwhile to use.
And if that chart is correct, a larger subset available on more mainstream CPUs (not just top-of-the-line Extreme Edition CPUs or Xeons) could make it worthwhile for devs and programmers of all kinds of work to use it.
Via's decision to do it over two cycles have probably to do with saving die space. Zen(1) did something similar with AVX2. While those charts might look a bit intimidating, most of the common features are covered by the F and CD sets, and these also require the most die space.
BTW; you can see the massive list of instructions in the F set here.