Monday, March 1st 2021

AMD "Zen 4" Microarchitecture to Support AVX-512

The next-generation "Zen 4" CPU microarchitecture powering AMD's 4th Gen EPYC "Genoa" enterprise processors, will support 512-bit AVX instruction sets, according to an alleged company slide leaked to the web on the ChipHell forums. The slide references "AVX3-512" support in addition to BFloat16 and "other ISA extensions." This would make "Zen 4" the first AMD microarchitecture to support AVX-512. It remains to be seen which specific instructions the architecture supports, and whether all of them are available to both the enterprise and client implementations of "Zen 4," or whether AMD would take an approach similar to Intel, in only enabling certain "relevant" instructions on the client parts. The slide also mentions core counts being "greater than 64" corresponding withour story from earlier today.
Sources: ChipHell Forums, via VideoCardz
Add your own comment

43 Comments on AMD "Zen 4" Microarchitecture to Support AVX-512

#26
Voultapher
I get why AMD wants to take away Intels last shiny toy, money.

AVX-512 is hard to program for effectively. Even if you manage to find a workload that benefits from it, the power scaling will more than likely ruin your day. Unless all you do on that entire system is crunching numbers for a long time.

Sources:
- blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ (For people not too familiar with programming, OpenSSL is a very popular implementation of TLS and other encryption algorithms)
- www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-On-AVX-512

Edit:
On Intels newest platforms Ice Lake (2019) and up this effect is greatly minimized. Thanks @dragontamer5788 for pointing this out.
Posted on Reply
#27
thesmokingman
It must suck extra hard working for Intel today after reading this news.
Posted on Reply
#28
dragontamer5788
VoultapherAVX-512 is hard to program for effectively. Even if you manage to find a workload that benefits from it, the power scaling will more than likely ruin your day. Unless all you do on that entire system is crunching numbers for a long time.
Your links are out-of-date.

travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

The AVX512 power-scaling problem is null-and-void as of Icelake / Rocket Lake. Just because Skylake-X had a poor AVX512 implementation doesn't mean that the next processors have that same issue. In fact: Intel commonly has "poor implementations" of SIMD for a generaton or two. (Ex: AVX on Sandy Bridge was utter crap and not worthwhile until Haswell). The point of the early implementations is to get assembly-programmers used to the instruction set before a newer, better processor implements the ISA for real.
Posted on Reply
#29
billEST
dragontamer5788Your links are out-of-date.

travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

The AVX512 power-scaling problem is null-and-void as of Icelake / Rocket Lake. Just because Skylake-X had a poor AVX512 implementation doesn't mean that the next processors have that same issue. In fact: Intel commonly has "poor implementations" of SIMD for a generaton or two. (Ex: AVX on Sandy Bridge was utter crap and not worthwhile until Haswell). The point of the early implementations is to get assembly-programmers used to the instruction set before a newer, better processor implements the ISA for real.
and what happen when 70 % of world processor is intel and you program .. you dont use what the new man put on is processor .. dont have time to loose
Posted on Reply
#30
mechtech
watzupkenIt seems that AMD may be slowly encroaching into areas where Intel likes to toot their horn. Currently, Intel have a significant advantage with AVX 512 load, and also in the area of AI. So I will not be surprise AMD will start attacking these "strongholds". But I do feel AVX 512 is quite unlikely to be used by most folks.
But the intel fanbois have been saying this is awesome and AMD sucks since they don't have it.......................but those were fanbois though....................
Posted on Reply
#31
AlB80
Kohl BaasMy friend and I were unable to try StarCitizen a week ago, because his CPU is lacking AVX which the game needs... Not a big deal, every CPU since 2012 has it, but he uses a first gen i7 which unfortunately do not...
Are Intel Pentiums or Celerons every CPU?
f.e. latest Pentium G6600 (LGA1200).
Posted on Reply
#32
efikkan
Understandably, there are so much confusion about AVX-512, so some of us have to clear up a few things.
Vya DomusLike I said many times, wide SIMD is kind of stupid in CPUs
And like I've told you numerous times; there are different levels of parallelism. The main advantage of CPU based SIMD is the virtual zero overhead, and the ability to mix SIMD with other instructions seamlessly. This is a stark contrast to the GPU, where there is a huge latency barrier. So it depends totally on the workload whether it should be GPU accelerated or not, or perhaps even both.
Vya DomusI just hope AVX512 doesn't ruin the power consumption of these chips.
Even if the core loses a few hundred MHz, it will still complete far more work compared to earlier AVX-versions or non-SIMD implementations.
TheinsanegamerNCool, but what consumer applications actually USE AVX 512. Actually, what consumer applications use AVX to begin with? Games certianly dont, they use SSE4 at most. I guess there are some production applications? But it doesnt seem to help at all given AMD crushes intel in every production market out there.

It'd be nice to see wider adoption. These instruction sets are great but seemingly nobody uses them....
AVX-512 is so far limited to custom enterprise software and educational institutions, I'm not aware of any mainstream consumer software benefiting from it yet.

So if you're a buyer today, you should probably not care about it unless you do plan to use custom software which benefits from it, or you do applications development.

Keep in mind that we are talking about Zen 4 here, which is probably 1-1.5 years away, and will stay in the market for >2 years, so by then AVX-512 may be very much relevant. If suddenly Photoshop, Blender or ffmpeg starts using it, then suddenly it will matter for many, and people will not be going back.

There is also some interesting movement in the Linux ecosystem, where Red Hat has been pushing x86 feature levels to more easily compile Linux and related software for more modern ISA features. So within the next couple of years we should expect large Linux distros to ship completely compiled for e.g. x86-64-v3(Haswell and Zen) or x86-64-v4(e.g. Ice Lake and Zen 4). This will be huge for the adoption rate of AVX.

AVX(1)/AVX2 is already used in numerous applications which you are familiar with; Blender, ffmpeg, WinRAR, 7zip, Chrome, etc. So you are probably using it a lot without even knowing it. But the majority of applications, libraries, drivers and the OS itself is compiled with x86-64 and SSE2 at best, so basically lagging >17 years behind. There is a huge potential here for performance, lower latency and energy efficiency.

CPU-based SIMD is suitable for anything computationally intensive. But it's kind of hard to use though, but many pieces of software still gets some "free" performance gains just from enabling compiler optimizations. The real heavy workloads which uses AVX generally is hand-optimized using intrinsics, which is time consuming, but luckily for most programs, only a tiny fraction of the code base is performance critical.

What AVX-512 brings to the table, is obviously an increased vector width to 512-bits. But it also brings a new more flexible instruction encoding and many more operations, which means many more algorithms can be efficiently be implemented with AVX instead of application specific instructions. Hopefully the new push in GCC and LLVM for x86 feature levels will lead to more compiler optimizations for auto-vectorization, I believe there is a huge potential here.
SelayaPossibly alluding to the fact that there is AVX(1), then AVX2 (256); AVX-512 would be AVX3 in that case.
I believe Intel screwed up with this naming scheme;
AVX(1) - Partial 128-bit and 256-bit, mostly a "small" extension of SSE 4.
AVX2 - Fully 256-bit and adds the very useful FMA. (I would argue this should have been the first AVX version)
AVX-512 - Fully 512-bit and more flexible
thesmokingmanIt must suck extra hard working for Intel today after reading this news.
I'm pretty sure Intel want's their ISA to be adopted. Right now the best argument against AVX-512 is AMD's lack of support.
Posted on Reply
#33
dragontamer5788
efikkanAVX-512 is so far limited to custom enterprise software and educational institutions, I'm not aware of any mainstream consumer software benefiting from it yet.
To be fair, that's the right strategy.

AVX512 is a major change to how AVX was done. With opcode masks and other features, Intel likely was gathering data for how programmers would use those AVX512 features. Originally AVX512 was for the Xeon Phi series, but its important to "test" and see who adopts AVX512 in the server/desktop market someway first.

Intel can now remove extensions that aren't being used, as well as focus on which instructions need to be improved. In particular, Intel clearly has noticed the "throttling" issue in Skylake X and has fixed it (which was probably the biggest criticism of AVX512).

-----

You really want these issues to be figured out BEFORE mainstream programmers start relying upon those instructions. Once the mainstream starts using the instructions, you pretty much can never change them again.

xkcd.com/1172/

Posted on Reply
#34
efikkan
dragontamer5788To be fair, that's the right strategy.

AVX512 is a major change to how AVX was done. With opcode masks and other features, Intel likely was gathering data for how programmers would use those AVX512 features. Originally AVX512 was for the Xeon Phi series, but its important to "test" and see who adopts AVX512 in the server/desktop market someway first.

Intel can now remove extensions that aren't being used, as well as focus on which instructions need to be improved. In particular, Intel clearly has noticed the "throttling" issue in Skylake X and has fixed it (which was probably the biggest criticism of AVX512).

-----

You really want these issues to be figured out BEFORE mainstream programmers start relying upon those instructions. Once the mainstream starts using the instructions, you pretty much can never change them again.
I agree.
AVX-512 did emerge from requirements from researchers and enterprises, contrary to popular belief which seems to think it's something Intel concocted to show off in benchmarks.

I think Intel made a very smart move by making it modular, both to make it easier to implement for different segments, to make it easier for AMD to implement, and to evolve it further over time. It's very hard to figure out a balanced and well featured ISA revision without someone using it for a while.

Intel is probably at fault for the slow adoption rate of AVX-512. If Ice Lake desktop had released three years ago, we would possibly have seen some software by now.
Posted on Reply
#35
Punkenjoy
Vya DomusLike I said many times, wide SIMD is kind of stupid in CPUs, I just hope AVX512 doesn't ruin the power consumption of these chips.
I don't think it will if it's not used but it might add a lot of silicon real eastate for something the end users might rarely use.

But there are still case where wide simd on CPU is a good idea, like super large simulation that require a lot of ram and wide SIMD. Some of these machine will be able to have 192 core/384 Thread with Terabytes of memory were largest GPU are very far from that.

I think AMD is not interested to have multiple CPU SKU. They will just create one for all Desktop Ryzen without APU, Threadripper and EPYC.
Posted on Reply
#36
Vya Domus
dragontamer5788Honestly, there's just a ton of advantages to supporting 512-bit, especially when you consider all the possible designs AMD can do here. There's really no reasons NOT to support 512-bit.
I know there are advantages but a "ton" is a gross exaggeration. The problem I have with SIMD in modern CPUs is that it's an archaic approach to solving data parallelism in hardware and adding wider and wider extensions with a million new instructions seems really inefficient and stupid to me, I wish there was more innovation.

If you really want the ability to have fast and robust vector-type processing maybe they could come up with something along the lines of a modified GPU compute unit without any of the graphics hardware with an open universal ISA, AMD/Intel could pull something like this off and make heterogenous computing an actual reality. You can't scale stuff like AVX easily, it just doesn't work look how dog slow AVX512 adoption is.

If this stuff really was critical the industry would have demanded it to be implemented everywhere from day one, but they haven't.
Posted on Reply
#37
dragontamer5788
Vya DomusI know there are advantages but a "ton" is a gross exaggeration. The problem I have with SIMD in modern CPUs is that it's an archaic approach to solving data parallelism in hardware and adding wider and wider extensions with a million new instructions seems really inefficient and stupid to me, I wish there was more innovation.
But these new instructions are in fact, new and different.

4x parallel AES seems stupid when you first see it, but suddenly AES-GCM mode becomes popular and lo-and-behold, 4x128 bit parallel AES streams on a 512b instruction suddenly makes sense. And while the ARM-guys are stuck in 128-bit land, Intel benefits from 4x-parallel (and AMD Zen3 already benefits from the 2x-parallel) versions of those instructions. Turns out there's an AES mode-of-operation that actually benefits from that.

IMO, Intel needs to spend more time studying GPUs and implement a proper shuffle-crossbar (like NVidia and AMD's permute / bpermute instructions). Intel is kinda-sorta getting there with pshufb but its not quite as flexible as GPU-assembly instructions yet. Even then, AMD GPUs still opt for a flurry of new modes-of-operations for their shuffles (the DPP instructions: butterfly shuffle, 4x shuffle, etc. etc.). It seems like a highly-optimized (but non-flexible) shuffle is still ideal. (The Butterfly shuffle deserves its own optimization because of its application to the Fast Fourier Transform, and it seems to be a pattern in sorting networks, and map/reduce pattern).

Getting a faster butterfly shuffle, rather than a "generic shuffle" has proven itself useful in GPU-land. To the point that it requires its own instruction. Not that AVX512 implements butterfly shuffle yet (at least, last time I checked. AVX512 keeps adding new instructions...)... but... yeah. There's actually a lot of data-movement instructions at the byte-level that are very useful, and are still being defined today. See NVidia "shfl.bfly.b32" PTX instruction: that butterfly shuffle is very important!!

-------

AVX512 restarts SIMD instructions under the opcode-mask paradigm, finally allowing Intel to catch up to NVidia's 2008 era technology. Its... a bit late. But hey, I welcome AVX512's sane decisions with regards to SIMD-ISA design. Intel actually has gather/scatter instructions (though not implemented quickly yet), and has a baseline for future extensions (which I hope will serve as placeholders for important movement like butterfly shuffles). Intel needs to catch up: they are missing important and efficient instructions that NVidia figured out uses for and already deployed.

If anything: Intel is still missing an important set of data-movement instructions. Intel needs more instructions, not fewer instructions.
Posted on Reply
#38
evernessince
voltagelmao. catching up to Intel. it sure took amd long enough. DECADES in fact, with of course the Iranian bail out (massive stock buy in when it was 3 dollars per share) to save them.
Decade, not decades.

I would call it recovering, not catching up. Given that Intel only took the lead in the first place by paying OEMs to not carry AMD products, AMD isn't so much catching up as it is recovering from Intel's illegal market manipulation. The amount AMD got in court from Intel for cornering the market is paltry compared to the damages Intel caused to the market and to AMD.

In any case, everyone should be thankful that the market actually has competition. I remember some people actually believed Intel's marketing BS that 10% or greater IPC increases were a thing of the past and that TIM was superior to solder. Longer lasting my behind, I've had zero Intel 2000 series CPUs that have had issues with their solder, I've had to delid over a dozen 4000 series CPUs because the TIM lost contact with the IHS.
Posted on Reply
#39
AlB80
dragontamer5788And while the ARM-guys are stuck in 128-bit land, Intel benefits from 4x-parallel (and AMD Zen3 already benefits from the 2x-parallel) versions of those instructions.
Wait. But you can issue 128-bit instruction four times and if you have four 128-bit ALU you get the same result. There is no need to extend an ISA, if an ISA is modern and effective. ARM can issue 4 vector instructions, but x86 can't. It's explains why ancient architecture becomes more ugly.
"stucked" ARM conquers the world. The leading x86 is dying out.
Posted on Reply
#40
dragontamer5788
AlB80Wait. But you can issue 128-bit instruction four times and if you have four 128-bit ALU you get the same result. There is no need to extend an ISA, if an ISA is modern and effective. ARM can issue 4 vector instructions, but x86 can't. It's explains why ancient architecture becomes more ugly.
"stucked" ARM conquers the world. The leading x86 is dying out.
First: ARM requires TWO-instructions (AESE and AESMC) to do the same work as one x86 AESenc instruction

Second: That means you need EIGHT instructions (AESE AESMC) x 4 to do the same work as one 512-bit AESenc instruction.

Third: Intel can perform 4x instructions per clock tick. Which means in one clocktick, it can issue the 512-bit AESenc instruction AND 3 other instructions (add, multiply, whatever). In contrast, Apple M1 needs to spend all of its 8x decoder on that singular 512-bit operation.

Fourth: Intel / AMD are 4GHz processors executing these things every 0.25 nanoseconds. Apple ARM M1 is a 2.8GHz processor closer to 0.35ish ns.

Its pretty clear to me that the Intel / +(future) AMD AVX512 bit aesenc instruction is superior and better designed for this situation actually, compared to ARM's AESE + AESMC pair. Now I recognize that ARM systems are optimized to macro-op fuse AESE+AESMC pairs (so those 8-instructions only need 4-pipeline-cycles to execute), but I'm pretty sure that'd still take up the slot in the ARM decoder.

ARM's decision is cleaner in theory: AESE is all you need for the final iteration, while x86 requires a 2nd instruction, AESENCLAST for the final iteration. But in terms of performance, the x86 approach is clearly faster and better designed IMO, though probably uses more transistors. But more-and-more transistors are being given to encryption these days thanks to HTTPS's popularity, so intel's "more transistors" but more efficient design is superior over ARM's original design.
Posted on Reply
#41
Camm
My problem with AVX 512 is every implementation seen to date has the rest of the chip throttling heavily whilst running an AVX workload.

I'm not exactly sold on the requirement for large and wide SIMD on a CPU rather than something more appropriate for the package, but I'm happy to be proven wrong.
Posted on Reply
#42
AlB80
dragontamer5788First: ARM requires TWO-instructions (AESE and AESMC) to do the same work as one x86 AESenc instruction

Second: That means you need EIGHT instructions (AESE AESMC) x 4 to do the same work as one 512-bit AESenc instruction.

Third: Intel can perform 4x instructions per clock tick. Which means in one clocktick, it can issue the 512-bit AESenc instruction AND 3 other instructions (add, multiply, whatever). In contrast, Apple M1 needs to spend all of its 8x decoder on that singular 512-bit operation.

Fourth: Intel / AMD are 4GHz processors executing these things every 0.25 nanoseconds. Apple ARM M1 is a 2.8GHz processor closer to 0.35ish ns.
1-2. How can AESE/MC pair discard the fact that the recent x86 cannot issue more than one AESENC?
Two RISC instructions per one AES round are irrelevant. ARM A78 can decode four, ARM X1 can take five, IBM Power9 can chaw eight. And nobody can give guarantee that ARM will (not) implement fused instruction.

3. Recent Intel's CPU can decode up to 5 instruction per clock, but 16-byte decode window and only one complex instruction per clock turn your dream into a fart.
Do you think this might have forced Intel to make the AVX-512?

4. Try to look at the situation from the point of view of energy efficiency. You can't compare architectures with different goals. If there is a real need for a high-performance ARM core, then just wait.
Its pretty clear to me that the Intel / +(future) AMD AVX512 bit aesenc instruction is superior and better designed for this situation actually, compared to ARM's AESE + AESMC pair. Now I recognize that ARM systems are optimized to macro-op fuse AESE+AESMC pairs (so those 8-instructions only need 4-pipeline-cycles to execute), but I'm pretty sure that'd still take up the slot in the ARM decoder.
cheap RISC decoder VS "superior" x86 monster that chokes on a single 9-byte instruction. Nice.
ARM's decision is cleaner in theory: AESE is all you need for the final iteration, while x86 requires a 2nd instruction, AESENCLAST for the final iteration. But in terms of performance, the x86 approach is clearly faster and better designed IMO, though probably uses more transistors. But more-and-more transistors are being given to encryption these days thanks to HTTPS's popularity, so intel's "more transistors" but more efficient design is superior over ARM's original design.
I think ARM had found a balance between decoder and ALU.
And you can not guarantee that ARM will (not) implement fused instruction.
Posted on Reply
Add your own comment
Jan 5th, 2025 15:47 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts