Monday, March 1st 2021
AMD "Zen 4" Microarchitecture to Support AVX-512
The next-generation "Zen 4" CPU microarchitecture powering AMD's 4th Gen EPYC "Genoa" enterprise processors, will support 512-bit AVX instruction sets, according to an alleged company slide leaked to the web on the ChipHell forums. The slide references "AVX3-512" support in addition to BFloat16 and "other ISA extensions." This would make "Zen 4" the first AMD microarchitecture to support AVX-512. It remains to be seen which specific instructions the architecture supports, and whether all of them are available to both the enterprise and client implementations of "Zen 4," or whether AMD would take an approach similar to Intel, in only enabling certain "relevant" instructions on the client parts. The slide also mentions core counts being "greater than 64" corresponding withour story from earlier today.
Sources:
ChipHell Forums, via VideoCardz
43 Comments on AMD "Zen 4" Microarchitecture to Support AVX-512
AVX-512 is hard to program for effectively. Even if you manage to find a workload that benefits from it, the power scaling will more than likely ruin your day. Unless all you do on that entire system is crunching numbers for a long time.
Sources:
- blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ (For people not too familiar with programming, OpenSSL is a very popular implementation of TLS and other encryption algorithms)
- www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-On-AVX-512
Edit:
On Intels newest platforms Ice Lake (2019) and up this effect is greatly minimized. Thanks @dragontamer5788 for pointing this out.
travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
The AVX512 power-scaling problem is null-and-void as of Icelake / Rocket Lake. Just because Skylake-X had a poor AVX512 implementation doesn't mean that the next processors have that same issue. In fact: Intel commonly has "poor implementations" of SIMD for a generaton or two. (Ex: AVX on Sandy Bridge was utter crap and not worthwhile until Haswell). The point of the early implementations is to get assembly-programmers used to the instruction set before a newer, better processor implements the ISA for real.
f.e. latest Pentium G6600 (LGA1200).
So if you're a buyer today, you should probably not care about it unless you do plan to use custom software which benefits from it, or you do applications development.
Keep in mind that we are talking about Zen 4 here, which is probably 1-1.5 years away, and will stay in the market for >2 years, so by then AVX-512 may be very much relevant. If suddenly Photoshop, Blender or ffmpeg starts using it, then suddenly it will matter for many, and people will not be going back.
There is also some interesting movement in the Linux ecosystem, where Red Hat has been pushing x86 feature levels to more easily compile Linux and related software for more modern ISA features. So within the next couple of years we should expect large Linux distros to ship completely compiled for e.g. x86-64-v3(Haswell and Zen) or x86-64-v4(e.g. Ice Lake and Zen 4). This will be huge for the adoption rate of AVX.
AVX(1)/AVX2 is already used in numerous applications which you are familiar with; Blender, ffmpeg, WinRAR, 7zip, Chrome, etc. So you are probably using it a lot without even knowing it. But the majority of applications, libraries, drivers and the OS itself is compiled with x86-64 and SSE2 at best, so basically lagging >17 years behind. There is a huge potential here for performance, lower latency and energy efficiency.
CPU-based SIMD is suitable for anything computationally intensive. But it's kind of hard to use though, but many pieces of software still gets some "free" performance gains just from enabling compiler optimizations. The real heavy workloads which uses AVX generally is hand-optimized using intrinsics, which is time consuming, but luckily for most programs, only a tiny fraction of the code base is performance critical.
What AVX-512 brings to the table, is obviously an increased vector width to 512-bits. But it also brings a new more flexible instruction encoding and many more operations, which means many more algorithms can be efficiently be implemented with AVX instead of application specific instructions. Hopefully the new push in GCC and LLVM for x86 feature levels will lead to more compiler optimizations for auto-vectorization, I believe there is a huge potential here. I believe Intel screwed up with this naming scheme;
AVX(1) - Partial 128-bit and 256-bit, mostly a "small" extension of SSE 4.
AVX2 - Fully 256-bit and adds the very useful FMA. (I would argue this should have been the first AVX version)
AVX-512 - Fully 512-bit and more flexible I'm pretty sure Intel want's their ISA to be adopted. Right now the best argument against AVX-512 is AMD's lack of support.
AVX512 is a major change to how AVX was done. With opcode masks and other features, Intel likely was gathering data for how programmers would use those AVX512 features. Originally AVX512 was for the Xeon Phi series, but its important to "test" and see who adopts AVX512 in the server/desktop market someway first.
Intel can now remove extensions that aren't being used, as well as focus on which instructions need to be improved. In particular, Intel clearly has noticed the "throttling" issue in Skylake X and has fixed it (which was probably the biggest criticism of AVX512).
-----
You really want these issues to be figured out BEFORE mainstream programmers start relying upon those instructions. Once the mainstream starts using the instructions, you pretty much can never change them again.
xkcd.com/1172/
AVX-512 did emerge from requirements from researchers and enterprises, contrary to popular belief which seems to think it's something Intel concocted to show off in benchmarks.
I think Intel made a very smart move by making it modular, both to make it easier to implement for different segments, to make it easier for AMD to implement, and to evolve it further over time. It's very hard to figure out a balanced and well featured ISA revision without someone using it for a while.
Intel is probably at fault for the slow adoption rate of AVX-512. If Ice Lake desktop had released three years ago, we would possibly have seen some software by now.
But there are still case where wide simd on CPU is a good idea, like super large simulation that require a lot of ram and wide SIMD. Some of these machine will be able to have 192 core/384 Thread with Terabytes of memory were largest GPU are very far from that.
I think AMD is not interested to have multiple CPU SKU. They will just create one for all Desktop Ryzen without APU, Threadripper and EPYC.
If you really want the ability to have fast and robust vector-type processing maybe they could come up with something along the lines of a modified GPU compute unit without any of the graphics hardware with an open universal ISA, AMD/Intel could pull something like this off and make heterogenous computing an actual reality. You can't scale stuff like AVX easily, it just doesn't work look how dog slow AVX512 adoption is.
If this stuff really was critical the industry would have demanded it to be implemented everywhere from day one, but they haven't.
4x parallel AES seems stupid when you first see it, but suddenly AES-GCM mode becomes popular and lo-and-behold, 4x128 bit parallel AES streams on a 512b instruction suddenly makes sense. And while the ARM-guys are stuck in 128-bit land, Intel benefits from 4x-parallel (and AMD Zen3 already benefits from the 2x-parallel) versions of those instructions. Turns out there's an AES mode-of-operation that actually benefits from that.
IMO, Intel needs to spend more time studying GPUs and implement a proper shuffle-crossbar (like NVidia and AMD's permute / bpermute instructions). Intel is kinda-sorta getting there with pshufb but its not quite as flexible as GPU-assembly instructions yet. Even then, AMD GPUs still opt for a flurry of new modes-of-operations for their shuffles (the DPP instructions: butterfly shuffle, 4x shuffle, etc. etc.). It seems like a highly-optimized (but non-flexible) shuffle is still ideal. (The Butterfly shuffle deserves its own optimization because of its application to the Fast Fourier Transform, and it seems to be a pattern in sorting networks, and map/reduce pattern).
Getting a faster butterfly shuffle, rather than a "generic shuffle" has proven itself useful in GPU-land. To the point that it requires its own instruction. Not that AVX512 implements butterfly shuffle yet (at least, last time I checked. AVX512 keeps adding new instructions...)... but... yeah. There's actually a lot of data-movement instructions at the byte-level that are very useful, and are still being defined today. See NVidia "shfl.bfly.b32" PTX instruction: that butterfly shuffle is very important!!
-------
AVX512 restarts SIMD instructions under the opcode-mask paradigm, finally allowing Intel to catch up to NVidia's 2008 era technology. Its... a bit late. But hey, I welcome AVX512's sane decisions with regards to SIMD-ISA design. Intel actually has gather/scatter instructions (though not implemented quickly yet), and has a baseline for future extensions (which I hope will serve as placeholders for important movement like butterfly shuffles). Intel needs to catch up: they are missing important and efficient instructions that NVidia figured out uses for and already deployed.
If anything: Intel is still missing an important set of data-movement instructions. Intel needs more instructions, not fewer instructions.
I would call it recovering, not catching up. Given that Intel only took the lead in the first place by paying OEMs to not carry AMD products, AMD isn't so much catching up as it is recovering from Intel's illegal market manipulation. The amount AMD got in court from Intel for cornering the market is paltry compared to the damages Intel caused to the market and to AMD.
In any case, everyone should be thankful that the market actually has competition. I remember some people actually believed Intel's marketing BS that 10% or greater IPC increases were a thing of the past and that TIM was superior to solder. Longer lasting my behind, I've had zero Intel 2000 series CPUs that have had issues with their solder, I've had to delid over a dozen 4000 series CPUs because the TIM lost contact with the IHS.
"stucked" ARM conquers the world. The leading x86 is dying out.
Second: That means you need EIGHT instructions (AESE AESMC) x 4 to do the same work as one 512-bit AESenc instruction.
Third: Intel can perform 4x instructions per clock tick. Which means in one clocktick, it can issue the 512-bit AESenc instruction AND 3 other instructions (add, multiply, whatever). In contrast, Apple M1 needs to spend all of its 8x decoder on that singular 512-bit operation.
Fourth: Intel / AMD are 4GHz processors executing these things every 0.25 nanoseconds. Apple ARM M1 is a 2.8GHz processor closer to 0.35ish ns.
Its pretty clear to me that the Intel / +(future) AMD AVX512 bit aesenc instruction is superior and better designed for this situation actually, compared to ARM's AESE + AESMC pair. Now I recognize that ARM systems are optimized to macro-op fuse AESE+AESMC pairs (so those 8-instructions only need 4-pipeline-cycles to execute), but I'm pretty sure that'd still take up the slot in the ARM decoder.
ARM's decision is cleaner in theory: AESE is all you need for the final iteration, while x86 requires a 2nd instruction, AESENCLAST for the final iteration. But in terms of performance, the x86 approach is clearly faster and better designed IMO, though probably uses more transistors. But more-and-more transistors are being given to encryption these days thanks to HTTPS's popularity, so intel's "more transistors" but more efficient design is superior over ARM's original design.
I'm not exactly sold on the requirement for large and wide SIMD on a CPU rather than something more appropriate for the package, but I'm happy to be proven wrong.
Two RISC instructions per one AES round are irrelevant. ARM A78 can decode four, ARM X1 can take five, IBM Power9 can chaw eight. And nobody can give guarantee that ARM will (not) implement fused instruction.
3. Recent Intel's CPU can decode up to 5 instruction per clock, but 16-byte decode window and only one complex instruction per clock turn your dream into a fart.
Do you think this might have forced Intel to make the AVX-512?
4. Try to look at the situation from the point of view of energy efficiency. You can't compare architectures with different goals. If there is a real need for a high-performance ARM core, then just wait. cheap RISC decoder VS "superior" x86 monster that chokes on a single 9-byte instruction. Nice. I think ARM had found a balance between decoder and ALU.
And you can not guarantee that ARM will (not) implement fused instruction.