Monday, February 20th 2023

Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

Feb 20th, 2023 04:24 Discuss (28 Comments)

Intel has recently updated its open-source C++ header file library for high-performance SIMD-based sorting to support the AVX-512 SIMD instruction set. Extending the capability of regular AVX2 support, the sorting functions now implement 512-bit extensions to offer greater performance. According to Phoronix, the NumPy Python library for mathematics that underpins a lot of software has updated its software base to use the AVX-512 boosted sorting functionality that yields a fantastic uplift in performance. The library uses AVX-512 to vectorize the quicksort for 16-bit and 64-bit data types using the extended instruction set. Benchmarked on an Intel Tiger Lake system, the NumPy sorting saw a 10-17x increase in performance.

Intel's engineer Raghuveer Devulapalli changed the NumPy code, which was merged into the NumPy codebase on Wednesday. Regarding individual data types, the new implementation increases 16-bit int sorting by 17x and 32-bit data type sorting by 12-13x, while float 64-bit sorting for random arrays has experienced a 10x speed up. Using the x86-simd-sort code, this speed-up shows the power of AVX-512 and its capability to enhance the performance of various libraries. We hope to see more implementations of AVX-512, as AMD has joined the party by placing AVX-512 processing elements on Zen 4.

Source: Phoronix

Add your own comment

28 Comments on Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

Assimilator

This is exactly what AVX-512 was created for. A shame it hasn't seen wider adoption.

Frick

Fishfaced Nincompoop

Didn't they just remove AVX512 from their consumer CPUs?

vimsux

FrickDidn't they just remove AVX512 from their consumer CPUs?

Alder Lake P-cores still had AVX512, but Intel purposefully disabled them.

Dr. Dro

vimsuxAlder Lake P-cores still had AVX512, but Intel purposefully disabled them.

Only on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

Vya Domus

FrickDidn't they just remove AVX512 from their consumer CPUs?

And AMD hilariously has put AVX 512 in their consumer CPUs now.

Knowing Intel I bet this is set up such that it chooses non AVX-512 code path for anything that isn't Intel, so this is going to run on... nothing that is current gen I guess ? Lol.

ZoneDymo

Dr. DroOnly on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

The only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

Dr. Dro

ZoneDymoThe only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

According to RPCS3 developers at least, it's a better implementation than the one on any currently available Intel CPU (except maybe Sapphire Rapids?), as it uses less clock cycles to issue the commands. AVX-512 itself is a very broad instruction set with several subtypes, and no CPU currently supports its entire breadth of features, when we say AVX-512 colloquially, I believe that we're mostly referring to AVX-512F. Intel instead opted to backport some of the other -512 extensions such as AVX-512VNNI (that is used for neural network training) into 256-bit AVX2, and that is a supported configuration on ADL/RPL-S CPUs, but as far as I am aware, Zen 4's implementation is essentially as complete as it gets, only bested by Sapphire Rapids that also supports FP16 subset.

Denver

Dr. DroOnly on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

AMD's version uses much less area, while still bringing most of the performance benefits that the instructions do. Another point is that AVX512 on Zen4 does not consume much more energy or dissipate more heat (forcing it to run at low clocks) like intel CPUs.

So yes, it was a smart move by AMD.

Flanker

vimsuxAlder Lake P-cores still had AVX512, but Intel purposefully disabled them.

So AVX512 will be a future DLC?

#10

Punkenjoy

ZoneDymoThe only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

The thing with news instructions is it take a lot of time for their usage to become widespread. Took few years for SSE*, took few years for AVX*. It's going to be the same for AVX 512. Even more when you consider that Intel CPU don't have it right now.

By example, no games would spend time right now trying to find way to make their game run faster for just the few folks that have TigerLake and Zen 4 CPU. (or the very few that run Alder Lake without the E-Core to get AVX-512 before intel disabled it).

But that will come. We don't know how useful it will be until it start being utilized.

As for instruction support, it's a hot mess. But mostly a hot mess on Intel side.

en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

FlankerSo AVX512 will be a future DLC?

They will probably add it again in future generation when their E core (if they continue on that path) will have it. The Intel implementation of AVX512 take a lot of silicon area. This is against the goal of E-core to be really small in die size.

#11

BorisDG

Nice my Cascade Lake-X seems future proof.

#12

erek

Dr. DroOnly on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

Emulating AVX-512 using a combination of opcodes is possible but challenging. AVX-512 provides a wide range of vector instructions that can operate on large 512-bit vectors. Emulating such wide vector operations using 128-bit or 256-bit vectors requires decomposing the 512-bit vectors into smaller chunks and operating on them in multiple steps. This can significantly increase the code complexity…

One approach to emulate AVX-512 instructions is to use a combination of SSE and AVX2 instructions. This approach can achieve similar results to AVX-512 but with reduced performance. The following code snippet shows an example of how to emulate the 512-bit vector addition operation using SSE and AVX2 instructions:

cCopy code
#include <immintrin.h> void add_avx512_emulation(float *a, float *b, float *c, int n) {int i; __m256 v0, v1, v2, v3; __m128 w0, w1, w2, w3; for (i = 0; i < n; i += 16) { v0 = _mm256_load_ps(&a); v1 = _mm256_load_ps(&a[i + 8]); v2 = _mm256_load_ps(&b); v3 = _mm256_load_ps(&b[i + 8]); w0 = _mm_add_ps(_mm256_castps256_ps128(v0), _mm256_castps256_ps128(v2)); w1 = _mm_add_ps(_mm256_extractf128_ps(v0, 1), _mm256_extractf128_ps(v2, 1)); w2 = _mm_add_ps(_mm256_castps256_ps128(v1), _mm256_castps256_ps128(v3)); w3 = _mm_add_ps(_mm256_extractf128_ps(v1, 1), _mm256_extractf128_ps(v3, 1)); v0 = _mm256_castps128_ps256(w0); v1 = _mm256_insertf128_ps(v1, w1, 1); v2 = _mm256_castps128_ps256(w2); v3 = _mm256_insertf128_ps(v3, w3, 1); _mm256_store_ps(&c, _mm256_add_ps(v0, v2)); _mm256_store_ps(&c[i + 8], _mm256_add_ps(v1, v3)); } }

In this example, the 512-bit vectors are decomposed into four 128-bit vectors and then operated on in two steps. First, the 128-bit vectors are added using SSE instructions. Then, the resulting 128-bit vectors are combined to form the final 512-bit vectors using AVX2 instructions. This emulation technique can be extended to other AVX-512 instructions as well.

Note that emulating AVX-512 instructions using a combination of SSE and AVX2 instructions can be useful in scenarios where AVX-512 support is not available.

#13

Wirko

PunkenjoyThey will probably add it again in future generation when their E core (if they continue on that path) will have it. The Intel implementation of AVX512 take a lot of silicon area. This is against the goal of E-core to be really small in die size.

An E-core is capable of running code that contains AVX-512 instructions ... until it encounters one. When that happens, the core freezes the execution and send an interrupt signal. But it can also save its state like it does when normal task switching occurs, and then the scheduler can make the thread continue on a P-core. I'm not sure why Intel and MS haven't implemented that, maybe later they will. It's not trivial I guess but it's trivial compared to, for example, developing the immensely complex Intel Thread Director.

#14

erek

erekEmulating AVX-512 using a combination of opcodes is possible but challenging. AVX-512 provides a wide range of vector instructions that can operate on large 512-bit vectors. Emulating such wide vector operations using 128-bit or 256-bit vectors requires decomposing the 512-bit vectors into smaller chunks and operating on them in multiple steps. This can significantly increase the code complexity…

One approach to emulate AVX-512 instructions is to use a combination of SSE and AVX2 instructions. This approach can achieve similar results to AVX-512 but with reduced performance. The following code snippet shows an example of how to emulate the 512-bit vector addition operation using SSE and AVX2 instructions:

cCopy code
#include <immintrin.h> void add_avx512_emulation(float *a, float *b, float *c, int n) {int i; __m256 v0, v1, v2, v3; __m128 w0, w1, w2, w3; for (i = 0; i < n; i += 16) { v0 = _mm256_load_ps(&a); v1 = _mm256_load_ps(&a[i + 8]); v2 = _mm256_load_ps(&b); v3 = _mm256_load_ps(&b[i + 8]); w0 = _mm_add_ps(_mm256_castps256_ps128(v0), _mm256_castps256_ps128(v2)); w1 = _mm_add_ps(_mm256_extractf128_ps(v0, 1), _mm256_extractf128_ps(v2, 1)); w2 = _mm_add_ps(_mm256_castps256_ps128(v1), _mm256_castps256_ps128(v3)); w3 = _mm_add_ps(_mm256_extractf128_ps(v1, 1), _mm256_extractf128_ps(v3, 1)); v0 = _mm256_castps128_ps256(w0); v1 = _mm256_insertf128_ps(v1, w1, 1); v2 = _mm256_castps128_ps256(w2); v3 = _mm256_insertf128_ps(v3, w3, 1); _mm256_store_ps(&c, _mm256_add_ps(v0, v2)); _mm256_store_ps(&c[i + 8], _mm256_add_ps(v1, v3)); } }

In this example, the 512-bit vectors are decomposed into four 128-bit vectors and then operated on in two steps. First, the 128-bit vectors are added using SSE instructions. Then, the resulting 128-bit vectors are combined to form the final 512-bit vectors using AVX2 instructions. This emulation technique can be extended to other AVX-512 instructions as well.

Note that emulating AVX-512 instructions using a combination of SSE and AVX2 instructions can be useful in scenarios where AVX-512 support is not available.

@Dr. Dro

To increase the performance of an emulation technique that uses a combination of SSE and AVX2 instructions to emulate AVX-512, you can try the following techniques:

Loop unrolling: Unrolling the loop can improve the performance by reducing the number of iterations and increasing the arithmetic intensity of the loop body. For example, you can unroll the loop by a factor of two or four, depending on the available registers and the data dependencies in the loop body.
Memory alignment: Memory alignment can significantly improve the performance by reducing the number of memory accesses and improving the cache locality. Ensure that the data is aligned to the cache line size and that the load and store operations use aligned memory addresses.
Code optimization: Code optimization can improve the performance by reducing the number of instructions and improving the instruction pipeline efficiency. Techniques such as loop-invariant code motion, common subexpression elimination, and dead-code elimination can reduce the number of instructions and improve the instruction pipeline efficiency.
Processor-specific optimizations: Processor-specific optimizations can improve the performance by taking advantage of the specific features of the processor. For example, some processors have specialized instructions that can improve the performance of specific operations. By using these instructions, you can improve the performance of your emulation technique.
Data format optimization: Data format optimization can improve the performance by using data formats that are optimized for the specific operations. For example, using a packed data format can reduce the number of instructions and improve the performance of vector operations.

Note that the above techniques can improve the performance of an emulation technique, but they may not be able to match the performance of the native AVX-512 instructions.

#15

Dr. Dro

That's pretty cool stuff, CPUs are wonderfully programmable nowadays. I do have a concern though, wouldn't this cost a lot of CPU clock cycles vs. native support?

#16

erek

Dr. DroThat's pretty cool stuff, CPUs are wonderfully programmable nowadays. I do have a concern though, wouldn't this cost a lot of CPU clock cycles vs. native support?

also couldn't someone implement Code Morphing to address the AVX-512 feature gap like the Transmeta Crusoe did?

#17

Shihab

Dr. DroI'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

I'd wager that most contemporary academia run consumer-grade hardware. Not all researchers/students have access to Xeon and Threadripper farms. Otoh, amount of data to be processed is going no where but up.

Vya DomusKnowing Intel I bet this is set up such that it chooses non AVX-512 code path for anything that isn't Intel, so this is going to run on... nothing that is current gen I guess ? Lol.

The libs -by default- would use non-AVX-512 implementations for all archs, Intel's included. Default baseline is way lower than that (SSE2, specifically).
Users can -and should- manually set which feature level to target (or enable native optimizations, which checks what CPU support and enables features accordingly). Which is quite common, afaik. Even your own code would throw errors if you used AVX2/512 functions without setting appropriate flags in your compiler.

Most AVX512 subfeatures(?) are grouped into levels conforming to Intel's SKU generations, but that's mostly due to the fact that AMD wasn't offering any support for them. Skimming over the commit, I spotted a comment about a planned AVX512_ZEN4 grouping, so it's just a matter of time.

#18

Dr. Dro

erekalso couldn't someone implement Code Morphing to address the AVX-512 feature gap like the Transmeta Crusoe did?

Though this was many years ago, the Transmeta Crusoe and Efficeon processors weren't known for their high processing power, and eventually the Pentium III was able to close the power efficiency gap. A similar software exists today, provided by Intel - the SDE, but its emulation tends to be extremely slow for unsupported instruction sets.

#19

Fouquin

PunkenjoyThe thing with news instructions is it take a lot of time for their usage to become widespread. Took few years for SSE*, took few years for AVX*. It's going to be the same for AVX 512.

Except AVX-512 is not new and it has already been a couple few years with almost no benefit below ML training or emulation. The best we can hope for is that it stops being a massive waste of die space on consumer hardware as more efficient implementations are created such as the one in Zen 4.

Dr. DroAVX-512 itself is a very broad instruction set with several subtypes, and no CPU currently supports its entire breadth of features

We almost had a CPU that supported it all, though Intel added a bunch of new extensions with Ice Lake that didn't make it into this chip because they didn't yet exist at the time it was designed.

#20

lemonadesoda

Die space optimization. Intel product strategy:

consumer: avx512 or addition core? (Or e cores)

avx512 is very powerful for edge case workloads

x consumer

#21

A&P211

FlankerSo AVX512 will be a future DLC?

That will be $59.99

#22

Gica

Vya DomusAnd AMD hilariously has put AVX 512 in their consumer CPUs now.

AVX 512F to be exact. Partial implementation.
Alder processors with the old Intel logo have full support, like Xeon.

ZoneDymoThe only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

When AVX 512F was implemented in Rocket Lake (11th), AMD fans frowned. We don't need it! Now, because AMD has it, I see the same fans abandoning Fortnite to update the NumPy Python library, a matter of life and death for a home user. It helps you how to turn on the vacuum cleaner, not to burn the steak, not to let your wife find out that you have a mistress and much, much more.

#23

erek

GicaAVX 512F to be exact. Partial implementation.
Alder processors with the old Intel logo have full support, like Xeon.

When AVX 512F was implemented in Rocket Lake (11th), AMD fans frowned. We don't need it! Now, because AMD has it, I see the same fans abandoning Fortnite to update the NumPy Python library, a matter of life and death for a home user. It helps you how to turn on the vacuum cleaner, not to burn the steak, not to let your wife find out that you have a mistress and much, much more.

do you miss AVX-512?

#24

Gica

erekdo you miss AVX-512?

The impact of these instructions is zero or negligible for home users. My 12500 supports AVX 512 but, having experience with the 11600K, I preferred the latest BIOS version (F21) and not F2 or older which unlocks these instructions.
No, I don't miss them.

#25

R-T-B

FrickDidn't they just remove AVX512 from their consumer CPUs?

Yeah. That's sadly why it HASN'T seen wider adoption. It's segmentation hell.

GicaAlder processors with the old Intel logo have full support, like Xeon.

Pretty sure the alder lake implementation was partial as well. Maybe more complete than AMD's, but still partial.

Add your own comment

Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

28 Comments on Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

Related News

28 Comments on Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts