• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Publishes Sorting Library Powered by AVX-512, Offers 10-17x Speed Up

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,651 (0.99/day)
Intel has recently updated its open-source C++ header file library for high-performance SIMD-based sorting to support the AVX-512 SIMD instruction set. Extending the capability of regular AVX2 support, the sorting functions now implement 512-bit extensions to offer greater performance. According to Phoronix, the NumPy Python library for mathematics that underpins a lot of software has updated its software base to use the AVX-512 boosted sorting functionality that yields a fantastic uplift in performance. The library uses AVX-512 to vectorize the quicksort for 16-bit and 64-bit data types using the extended instruction set. Benchmarked on an Intel Tiger Lake system, the NumPy sorting saw a 10-17x increase in performance.

Intel's engineer Raghuveer Devulapalli changed the NumPy code, which was merged into the NumPy codebase on Wednesday. Regarding individual data types, the new implementation increases 16-bit int sorting by 17x and 32-bit data type sorting by 12-13x, while float 64-bit sorting for random arrays has experienced a 10x speed up. Using the x86-simd-sort code, this speed-up shows the power of AVX-512 and its capability to enhance the performance of various libraries. We hope to see more implementations of AVX-512, as AMD has joined the party by placing AVX-512 processing elements on Zen 4.



View at TechPowerUp Main Site | Source
 
Joined
Feb 18, 2005
Messages
5,847 (0.81/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) 3x AOC Q32E2N (32" 2560x1440 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G602
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
This is exactly what AVX-512 was created for. A shame it hasn't seen wider adoption.
 

Frick

Fishfaced Nincompoop
Joined
Feb 27, 2006
Messages
19,669 (2.86/day)
Location
w
System Name Black MC in Tokyo
Processor Ryzen 5 7600
Motherboard MSI X670E Gaming Plus Wifi
Cooling Be Quiet! Pure Rock 2
Memory 2 x 16GB Corsair Vengeance @ 6000Mhz
Video Card(s) XFX 6950XT Speedster MERC 319
Storage Kingston KC3000 1TB | WD Black SN750 2TB |WD Blue 1TB x 2 | Toshiba P300 2TB | Seagate Expansion 8TB
Display(s) Samsung U32J590U 4K + BenQ GL2450HT 1080p
Case Fractal Design Define R4
Audio Device(s) Plantronics 5220, Nektar SE61 keyboard
Power Supply Corsair RM850x v3
Mouse Logitech G602
Keyboard Dell SK3205
Software Windows 10 Pro
Benchmark Scores Rimworld 4K ready!
Didn't they just remove AVX512 from their consumer CPUs?
 
Joined
Dec 25, 2020
Messages
7,013 (4.81/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG Maximus Z790 Apex Encore
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Audio Device(s) Apple USB-C + Sony MDR-V7 headphones
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard IBM Model M type 1391405 (distribución española)
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
Alder Lake P-cores still had AVX512, but Intel purposefully disabled them.

Only on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.
 
Joined
Jan 8, 2017
Messages
9,504 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Didn't they just remove AVX512 from their consumer CPUs?
And AMD hilariously has put AVX 512 in their consumer CPUs now.

Knowing Intel I bet this is set up such that it chooses non AVX-512 code path for anything that isn't Intel, so this is going to run on... nothing that is current gen I guess ? Lol.
 
Joined
Feb 11, 2009
Messages
5,570 (0.96/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
Only on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

The only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing
 
Joined
Dec 25, 2020
Messages
7,013 (4.81/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG Maximus Z790 Apex Encore
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Audio Device(s) Apple USB-C + Sony MDR-V7 headphones
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard IBM Model M type 1391405 (distribución española)
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
The only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

According to RPCS3 developers at least, it's a better implementation than the one on any currently available Intel CPU (except maybe Sapphire Rapids?), as it uses less clock cycles to issue the commands. AVX-512 itself is a very broad instruction set with several subtypes, and no CPU currently supports its entire breadth of features, when we say AVX-512 colloquially, I believe that we're mostly referring to AVX-512F. Intel instead opted to backport some of the other -512 extensions such as AVX-512VNNI (that is used for neural network training) into 256-bit AVX2, and that is a supported configuration on ADL/RPL-S CPUs, but as far as I am aware, Zen 4's implementation is essentially as complete as it gets, only bested by Sapphire Rapids that also supports FP16 subset.
 
Joined
Oct 6, 2021
Messages
1,605 (1.37/day)
Only on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.
AMD's version uses much less area, while still bringing most of the performance benefits that the instructions do. Another point is that AVX512 on Zen4 does not consume much more energy or dissipate more heat (forcing it to run at low clocks) like intel CPUs.

So yes, it was a smart move by AMD.
 
Joined
Apr 8, 2010
Messages
1,012 (0.19/day)
Processor Intel Core i5 8400
Motherboard Gigabyte Z370N-Wifi
Cooling Silverstone AR05
Memory Micron Crucial 16GB DDR4-2400
Video Card(s) Gigabyte GTX1080 G1 Gaming 8G
Storage Micron Crucial MX300 275GB
Display(s) Dell U2415
Case Silverstone RVZ02B
Power Supply Silverstone SSR-SX550
Keyboard Ducky One Red Switch
Software Windows 10 Pro 1909
Joined
Oct 12, 2005
Messages
711 (0.10/day)
The only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing

The thing with news instructions is it take a lot of time for their usage to become widespread. Took few years for SSE*, took few years for AVX*. It's going to be the same for AVX 512. Even more when you consider that Intel CPU don't have it right now.

By example, no games would spend time right now trying to find way to make their game run faster for just the few folks that have TigerLake and Zen 4 CPU. (or the very few that run Alder Lake without the E-Core to get AVX-512 before intel disabled it).

But that will come. We don't know how useful it will be until it start being utilized.

As for instruction support, it's a hot mess. But mostly a hot mess on Intel side.


So AVX512 will be a future DLC?
They will probably add it again in future generation when their E core (if they continue on that path) will have it. The Intel implementation of AVX512 take a lot of silicon area. This is against the goal of E-core to be really small in die size.
 
Joined
Apr 16, 2013
Messages
549 (0.13/day)
Location
Bulgaria
System Name Black Knight | White Queen
Processor Intel Core i9-10940X (28 cores) | Intel Core i7-5775C (8 cores)
Motherboard ASUS ROG Rampage VI Extreme Encore X299G | ASUS Sabertooth Z97 Mark S (White)
Cooling Noctua NH-D15 chromax.black | Xigmatek Dark Knight SD-1283 Night Hawk (White)
Memory G.SKILL Trident Z RGB 4x8GB DDR4 3600MHz CL16 | Corsair Vengeance LP 4x4GB DDR3L 1600MHz CL9 (White)
Video Card(s) ASUS ROG Strix GeForce RTX 4090 OC | KFA2/Galax GeForce GTX 1080 Ti Hall of Fame Edition
Storage Samsung 990 Pro 2TB, 980 Pro 1TB, 850 Pro 256GB, 840 Pro 256GB, WD 10TB+ (incl. VelociRaptors)
Display(s) Dell Alienware AW2721D 240Hz| LG OLED evo C4 48" 144Hz
Case Corsair 7000D AIRFLOW (Black) | NZXT ??? w/ ASUS DRW-24B1ST
Audio Device(s) ASUS Xonar Essence STX | Realtek ALC1150
Power Supply Enermax Revolution 1250W 85+ | Super Flower Leadex Gold 650W (White)
Mouse Razer Basilisk Ultimate, Razer Naga Trinity | Razer Mamba 16000
Keyboard Razer Blackwidow Chroma V2 (Orange switch) | Razer Ornata Chroma
Software Windows 10 Pro 64bit
Nice my Cascade Lake-X seems future proof.
 
Joined
Jun 22, 2006
Messages
1,097 (0.16/day)
System Name Beaver's Build
Processor AMD Ryzen 9800X3D
Motherboard Asus TUF Gaming X670E Plus WiFi
Cooling Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory G.SKILL Trident Z5 Neo DDR5-6000 CL30 RAM 32GB (2x16GB)
Video Card(s) NVIDIA GeForce RTX 4090 Founders Edition
Storage WD_BLACK 8TB SN850X NVMe
Display(s) Alienware AW3225QF 32" 4K 240 Hz OLED
Case Fractal Design Design Define R6 USB-C
Audio Device(s) Focusrite 2i4 USB Audio Interface
Power Supply SuperFlower LEADEX TITANIUM 1600W
Mouse Razer DeathAdder V2
Keyboard Corsair K70 RGB Pro
Software Microsoft Windows 11 Pro
Benchmark Scores 3dmark = https://www.3dmark.com/spy/51229598
Only on earlier samples. Newer Alder Lake CPUs have AVX-512 cores fused off, so most newer i9-12900K and almost if not all i9-12900KS should be completely incapable of it. Raptor Lake doesn't have it to begin with.

I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.
Emulating AVX-512 using a combination of opcodes is possible but challenging. AVX-512 provides a wide range of vector instructions that can operate on large 512-bit vectors. Emulating such wide vector operations using 128-bit or 256-bit vectors requires decomposing the 512-bit vectors into smaller chunks and operating on them in multiple steps. This can significantly increase the code complexity…

One approach to emulate AVX-512 instructions is to use a combination of SSE and AVX2 instructions. This approach can achieve similar results to AVX-512 but with reduced performance. The following code snippet shows an example of how to emulate the 512-bit vector addition operation using SSE and AVX2 instructions:

cCopy code
#include <immintrin.h> void add_avx512_emulation(float *a, float *b, float *c, int n) {int i; __m256 v0, v1, v2, v3; __m128 w0, w1, w2, w3; for (i = 0; i < n; i += 16) { v0 = _mm256_load_ps(&a); v1 = _mm256_load_ps(&a[i + 8]); v2 = _mm256_load_ps(&b); v3 = _mm256_load_ps(&b[i + 8]); w0 = _mm_add_ps(_mm256_castps256_ps128(v0), _mm256_castps256_ps128(v2)); w1 = _mm_add_ps(_mm256_extractf128_ps(v0, 1), _mm256_extractf128_ps(v2, 1)); w2 = _mm_add_ps(_mm256_castps256_ps128(v1), _mm256_castps256_ps128(v3)); w3 = _mm_add_ps(_mm256_extractf128_ps(v1, 1), _mm256_extractf128_ps(v3, 1)); v0 = _mm256_castps128_ps256(w0); v1 = _mm256_insertf128_ps(v1, w1, 1); v2 = _mm256_castps128_ps256(w2); v3 = _mm256_insertf128_ps(v3, w3, 1); _mm256_store_ps(&c, _mm256_add_ps(v0, v2)); _mm256_store_ps(&c[i + 8], _mm256_add_ps(v1, v3)); } }

In this example, the 512-bit vectors are decomposed into four 128-bit vectors and then operated on in two steps. First, the 128-bit vectors are added using SSE instructions. Then, the resulting 128-bit vectors are combined to form the final 512-bit vectors using AVX2 instructions. This emulation technique can be extended to other AVX-512 instructions as well.

Note that emulating AVX-512 instructions using a combination of SSE and AVX2 instructions can be useful in scenarios where AVX-512 support is not available.
 
Joined
Jan 3, 2021
Messages
3,605 (2.49/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
They will probably add it again in future generation when their E core (if they continue on that path) will have it. The Intel implementation of AVX512 take a lot of silicon area. This is against the goal of E-core to be really small in die size.
An E-core is capable of running code that contains AVX-512 instructions ... until it encounters one. When that happens, the core freezes the execution and send an interrupt signal. But it can also save its state like it does when normal task switching occurs, and then the scheduler can make the thread continue on a P-core. I'm not sure why Intel and MS haven't implemented that, maybe later they will. It's not trivial I guess but it's trivial compared to, for example, developing the immensely complex Intel Thread Director.
 
Joined
Jun 22, 2006
Messages
1,097 (0.16/day)
System Name Beaver's Build
Processor AMD Ryzen 9800X3D
Motherboard Asus TUF Gaming X670E Plus WiFi
Cooling Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory G.SKILL Trident Z5 Neo DDR5-6000 CL30 RAM 32GB (2x16GB)
Video Card(s) NVIDIA GeForce RTX 4090 Founders Edition
Storage WD_BLACK 8TB SN850X NVMe
Display(s) Alienware AW3225QF 32" 4K 240 Hz OLED
Case Fractal Design Design Define R6 USB-C
Audio Device(s) Focusrite 2i4 USB Audio Interface
Power Supply SuperFlower LEADEX TITANIUM 1600W
Mouse Razer DeathAdder V2
Keyboard Corsair K70 RGB Pro
Software Microsoft Windows 11 Pro
Benchmark Scores 3dmark = https://www.3dmark.com/spy/51229598
Emulating AVX-512 using a combination of opcodes is possible but challenging. AVX-512 provides a wide range of vector instructions that can operate on large 512-bit vectors. Emulating such wide vector operations using 128-bit or 256-bit vectors requires decomposing the 512-bit vectors into smaller chunks and operating on them in multiple steps. This can significantly increase the code complexity…

One approach to emulate AVX-512 instructions is to use a combination of SSE and AVX2 instructions. This approach can achieve similar results to AVX-512 but with reduced performance. The following code snippet shows an example of how to emulate the 512-bit vector addition operation using SSE and AVX2 instructions:

cCopy code
#include <immintrin.h> void add_avx512_emulation(float *a, float *b, float *c, int n) {int i; __m256 v0, v1, v2, v3; __m128 w0, w1, w2, w3; for (i = 0; i < n; i += 16) { v0 = _mm256_load_ps(&a); v1 = _mm256_load_ps(&a[i + 8]); v2 = _mm256_load_ps(&b); v3 = _mm256_load_ps(&b[i + 8]); w0 = _mm_add_ps(_mm256_castps256_ps128(v0), _mm256_castps256_ps128(v2)); w1 = _mm_add_ps(_mm256_extractf128_ps(v0, 1), _mm256_extractf128_ps(v2, 1)); w2 = _mm_add_ps(_mm256_castps256_ps128(v1), _mm256_castps256_ps128(v3)); w3 = _mm_add_ps(_mm256_extractf128_ps(v1, 1), _mm256_extractf128_ps(v3, 1)); v0 = _mm256_castps128_ps256(w0); v1 = _mm256_insertf128_ps(v1, w1, 1); v2 = _mm256_castps128_ps256(w2); v3 = _mm256_insertf128_ps(v3, w3, 1); _mm256_store_ps(&c, _mm256_add_ps(v0, v2)); _mm256_store_ps(&c[i + 8], _mm256_add_ps(v1, v3)); } }

In this example, the 512-bit vectors are decomposed into four 128-bit vectors and then operated on in two steps. First, the 128-bit vectors are added using SSE instructions. Then, the resulting 128-bit vectors are combined to form the final 512-bit vectors using AVX2 instructions. This emulation technique can be extended to other AVX-512 instructions as well.

Note that emulating AVX-512 instructions using a combination of SSE and AVX2 instructions can be useful in scenarios where AVX-512 support is not available.
@Dr. Dro

To increase the performance of an emulation technique that uses a combination of SSE and AVX2 instructions to emulate AVX-512, you can try the following techniques:

  1. Loop unrolling: Unrolling the loop can improve the performance by reducing the number of iterations and increasing the arithmetic intensity of the loop body. For example, you can unroll the loop by a factor of two or four, depending on the available registers and the data dependencies in the loop body.
  2. Memory alignment: Memory alignment can significantly improve the performance by reducing the number of memory accesses and improving the cache locality. Ensure that the data is aligned to the cache line size and that the load and store operations use aligned memory addresses.
  3. Code optimization: Code optimization can improve the performance by reducing the number of instructions and improving the instruction pipeline efficiency. Techniques such as loop-invariant code motion, common subexpression elimination, and dead-code elimination can reduce the number of instructions and improve the instruction pipeline efficiency.
  4. Processor-specific optimizations: Processor-specific optimizations can improve the performance by taking advantage of the specific features of the processor. For example, some processors have specialized instructions that can improve the performance of specific operations. By using these instructions, you can improve the performance of your emulation technique.
  5. Data format optimization: Data format optimization can improve the performance by using data formats that are optimized for the specific operations. For example, using a packed data format can reduce the number of instructions and improve the performance of vector operations.
Note that the above techniques can improve the performance of an emulation technique, but they may not be able to match the performance of the native AVX-512 instructions.
 
Joined
Dec 25, 2020
Messages
7,013 (4.81/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG Maximus Z790 Apex Encore
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Audio Device(s) Apple USB-C + Sony MDR-V7 headphones
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard IBM Model M type 1391405 (distribución española)
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
That's pretty cool stuff, CPUs are wonderfully programmable nowadays. I do have a concern though, wouldn't this cost a lot of CPU clock cycles vs. native support?
 
Joined
Jun 22, 2006
Messages
1,097 (0.16/day)
System Name Beaver's Build
Processor AMD Ryzen 9800X3D
Motherboard Asus TUF Gaming X670E Plus WiFi
Cooling Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory G.SKILL Trident Z5 Neo DDR5-6000 CL30 RAM 32GB (2x16GB)
Video Card(s) NVIDIA GeForce RTX 4090 Founders Edition
Storage WD_BLACK 8TB SN850X NVMe
Display(s) Alienware AW3225QF 32" 4K 240 Hz OLED
Case Fractal Design Design Define R6 USB-C
Audio Device(s) Focusrite 2i4 USB Audio Interface
Power Supply SuperFlower LEADEX TITANIUM 1600W
Mouse Razer DeathAdder V2
Keyboard Corsair K70 RGB Pro
Software Microsoft Windows 11 Pro
Benchmark Scores 3dmark = https://www.3dmark.com/spy/51229598
That's pretty cool stuff, CPUs are wonderfully programmable nowadays. I do have a concern though, wouldn't this cost a lot of CPU clock cycles vs. native support?
also couldn't someone implement Code Morphing to address the AVX-512 feature gap like the Transmeta Crusoe did?

 
Joined
Jan 10, 2011
Messages
1,451 (0.28/day)
Location
[Formerly] Khartoum, Sudan.
System Name 192.168.1.1~192.168.1.100
Processor AMD Ryzen5 5600G.
Motherboard Gigabyte B550m DS3H.
Cooling AMD Wraith Stealth.
Memory 16GB Crucial DDR4.
Video Card(s) Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s) ViewSonic VA2406-MH 75Hz
Case Bitfenix Nova Midi
Audio Device(s) On-Board.
Power Supply SeaSonic CORE GM-650.
Mouse Logitech G300s
Keyboard Kingston HyperX Alloy FPS.
VR HMD A pair of OP spectacles.
Software Ubuntu 24.04 LTS.
Benchmark Scores Me no know English. What bench mean? Bench like one sit on?
I'm still not sold on AVX-512 being a necessity to consumer segment chips, AMD fans have long pointed out to it being worthless... until Zen 4 supported it anyway and then they forgot they held Linus Torvalds in the esteem of a God for saying that it should die.

I'd wager that most contemporary academia run consumer-grade hardware. Not all researchers/students have access to Xeon and Threadripper farms. Otoh, amount of data to be processed is going no where but up.

Knowing Intel I bet this is set up such that it chooses non AVX-512 code path for anything that isn't Intel, so this is going to run on... nothing that is current gen I guess ? Lol.

The libs -by default- would use non-AVX-512 implementations for all archs, Intel's included. Default baseline is way lower than that (SSE2, specifically).
Users can -and should- manually set which feature level to target (or enable native optimizations, which checks what CPU support and enables features accordingly). Which is quite common, afaik. Even your own code would throw errors if you used AVX2/512 functions without setting appropriate flags in your compiler.

Most AVX512 subfeatures(?) are grouped into levels conforming to Intel's SKU generations, but that's mostly due to the fact that AMD wasn't offering any support for them. Skimming over the commit, I spotted a comment about a planned AVX512_ZEN4 grouping, so it's just a matter of time.
 
Joined
Dec 25, 2020
Messages
7,013 (4.81/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG Maximus Z790 Apex Encore
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Audio Device(s) Apple USB-C + Sony MDR-V7 headphones
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard IBM Model M type 1391405 (distribución española)
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
also couldn't someone implement Code Morphing to address the AVX-512 feature gap like the Transmeta Crusoe did?


Though this was many years ago, the Transmeta Crusoe and Efficeon processors weren't known for their high processing power, and eventually the Pentium III was able to close the power efficiency gap. A similar software exists today, provided by Intel - the SDE, but its emulation tends to be extremely slow for unsupported instruction sets.
 
Joined
May 30, 2015
Messages
1,942 (0.56/day)
Location
Seattle, WA
The thing with news instructions is it take a lot of time for their usage to become widespread. Took few years for SSE*, took few years for AVX*. It's going to be the same for AVX 512.

Except AVX-512 is not new and it has already been a couple few years with almost no benefit below ML training or emulation. The best we can hope for is that it stops being a massive waste of die space on consumer hardware as more efficient implementations are created such as the one in Zen 4.

AVX-512 itself is a very broad instruction set with several subtypes, and no CPU currently supports its entire breadth of features

We almost had a CPU that supported it all, though Intel added a bunch of new extensions with Ice Lake that didn't make it into this chip because they didn't yet exist at the time it was designed.
 
Joined
Aug 30, 2006
Messages
7,223 (1.08/day)
System Name ICE-QUAD // ICE-CRUNCH
Processor Q6600 // 2x Xeon 5472
Memory 2GB DDR // 8GB FB-DIMM
Video Card(s) HD3850-AGP // FireGL 3400
Display(s) 2 x Samsung 204Ts = 3200x1200
Audio Device(s) Audigy 2
Software Windows Server 2003 R2 as a Workstation now migrated to W10 with regrets.
Die space optimization. Intel product strategy:

consumer: avx512 or addition core? (Or e cores)

avx512 is very powerful for edge case workloads

x consumer
 
Joined
Jan 14, 2023
Messages
842 (1.19/day)
System Name Asus G16
Processor i9 13980HX
Motherboard Asus motherboard
Cooling 2 fans
Memory 32gb 4800mhz
Video Card(s) 4080 laptop
Storage 16tb, x2 8tb SSD
Display(s) QHD+ 16in 16:10 (2560x1600, WQXGA) 240hz
Power Supply 330w psu
Joined
Jun 6, 2022
Messages
622 (0.67/day)
And AMD hilariously has put AVX 512 in their consumer CPUs now.
AVX 512F to be exact. Partial implementation.
Alder processors with the old Intel logo have full support, like Xeon.
Clipboard02.jpg


The only good thing about AVX 512 that I know of us PS3 emulation.....
Also AMDs support apparently isnt entirely the same as what intel is doing
When AVX 512F was implemented in Rocket Lake (11th), AMD fans frowned. We don't need it! Now, because AMD has it, I see the same fans abandoning Fortnite to update the NumPy Python library, a matter of life and death for a home user. It helps you how to turn on the vacuum cleaner, not to burn the steak, not to let your wife find out that you have a mistress and much, much more.
 
Last edited:
Joined
Jun 22, 2006
Messages
1,097 (0.16/day)
System Name Beaver's Build
Processor AMD Ryzen 9800X3D
Motherboard Asus TUF Gaming X670E Plus WiFi
Cooling Corsair H115i RGB PLATINUM 97 CFM Liquid
Memory G.SKILL Trident Z5 Neo DDR5-6000 CL30 RAM 32GB (2x16GB)
Video Card(s) NVIDIA GeForce RTX 4090 Founders Edition
Storage WD_BLACK 8TB SN850X NVMe
Display(s) Alienware AW3225QF 32" 4K 240 Hz OLED
Case Fractal Design Design Define R6 USB-C
Audio Device(s) Focusrite 2i4 USB Audio Interface
Power Supply SuperFlower LEADEX TITANIUM 1600W
Mouse Razer DeathAdder V2
Keyboard Corsair K70 RGB Pro
Software Microsoft Windows 11 Pro
Benchmark Scores 3dmark = https://www.3dmark.com/spy/51229598
AVX 512F to be exact. Partial implementation.
Alder processors with the old Intel logo have full support, like Xeon.
View attachment 284745


When AVX 512F was implemented in Rocket Lake (11th), AMD fans frowned. We don't need it! Now, because AMD has it, I see the same fans abandoning Fortnite to update the NumPy Python library, a matter of life and death for a home user. It helps you how to turn on the vacuum cleaner, not to burn the steak, not to let your wife find out that you have a mistress and much, much more.
do you miss AVX-512?
 
Joined
Jun 6, 2022
Messages
622 (0.67/day)
do you miss AVX-512?
The impact of these instructions is zero or negligible for home users. My 12500 supports AVX 512 but, having experience with the 11600K, I preferred the latest BIOS version (F21) and not F2 or older which unlocks these instructions.
No, I don't miss them.
 
Top