• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel to Disable Rudimentary AVX-512 Support on Alder Lake Processors

Joined
Jan 8, 2017
Messages
9,505 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
I bet they want it disabled just so that people can't run AVX 512 benchmarks that would expose even more laughable power consumption figures. Other than that it speeds up the validation process and practically no consumer software needs AVX 512, so it's completely irrelevant whether it's there or not.

God, the PS3 is a risc based platform with a undocumented GPU. You cant just use OpenCL or Cuda to "Emulate" a certain specific console and it's hardware.
That doesn't really mean anything from an emulation stand point , at the end of the day you still need to emulate more or less the same thing irrespective of the ISA. The reason you couldn't use CUDA or OpenCL is not because the CPU is RISC but because of the software that runs on those SPEs which needs complex thread synchronization logic that simply can't be done on a GPU. The PS3 GPU is documented, it's just some run of the mill 7000GTX series Nvidia architecture, nothing special there so there is no point in trying to use anything else than just OpenGL or any other graphics API.
 
D

Deleted member 24505

Guest
Avx512 is no use for gaming anyway, so i don't give a hoot. Benches are just epeen, nothing else. The only thing AVX512 is any good for are serious apps. It seems the only reason to OC is for benches, as a high performance CPU (58/900x, ADL 127/900k) is gonna be fine for most gaming needs anyway.

Name 1 program/game not for professional use that uses AVX512, bet there is not many, so disabling AVX512 on ADL is going to have very minimal impact on it's performance on everyday apps anyway.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Avx512 is no use for gaming anyway, so i don't give a hoot.
Even beyond that, several compilers will avoid AVX-512 in favor of AVX-256 because of the issue with downclocking, which impacts all workloads occurring at the time that the CPU clocks down. The usefulness of AVX-512 completely depends on the workload and how well the task is vectorized. Even for situations that can use AVX-512, going with 256 or 128 might actually net you better performance depending on the application. From that perspective, the useful situations for AVX-512 are few and far inbetween outside of the server and HEDT ecosystems. Fun fact, GCC will actually prefer 256-bit AVX for Skylake chips that support AVX-512 because of this because you're less likely to benefit from 512 than 256, which doesn't have the same downclocking issues. Also depending on how well the task is vectorized, you may not need 512-bit either, so why slow down the CPU to use it?
 
Joined
Oct 23, 2020
Messages
671 (0.44/day)
Location
Austria
System Name nope
Processor I3 10100F
Motherboard ATM Gigabyte h410
Cooling Arctic 12 passive
Memory ATM Gskill 1x 8GB NT Series (No Heatspreader bling bling garbage, just Black DIMMS)
Video Card(s) Sapphire HD7770 and EVGA GTX 470 and Zotac GTX 960
Storage 120GB OS SSD, 240GB M2 Sata, 240GB M2 NVME, 300GB HDD, 500GB HDD
Display(s) Nec EA 241 WM
Case Coolermaster whatever
Audio Device(s) Onkyo on TV and Mi Bluetooth on Screen
Power Supply Super Flower Leadx 550W
Mouse Steelseries Rival Fnatic
Keyboard Logitech K270 Wireless
Software Deepin, BSD and 10 LTSC
Mhm if anyone need AVX 512 its better to get a rocket lake than an alder lake, intel seems to be really stupid those days.

PS.: I3 processors had ECC support until 9th gen, but nobody bothered (motherboard makers dropped it because there was zero market for it). I personally do think it would be good to have, but apparently most of the users think otherwise (they are probably enthusiast like me and want faster ram, which is a lot harder to do with ECC). :)
Nope intel cancled ECC on the I3 because 4 Cores/ 8 Threads for about 72€.
U have to buy the 4 Core / 8 Thread Xeon for about 200€.
 

qubit

Overclocked quantum bit
Joined
Dec 6, 2007
Messages
17,865 (2.87/day)
Location
Quantum Well UK
System Name Quantumville™
Processor Intel Core i7-2700K @ 4GHz
Motherboard Asus P8Z68-V PRO/GEN3
Cooling Noctua NH-D14
Memory 16GB (2 x 8GB Corsair Vengeance Black DDR3 PC3-12800 C9 1600MHz)
Video Card(s) MSI RTX 2080 SUPER Gaming X Trio
Storage Samsung 850 Pro 256GB | WD Black 4TB | WD Blue 6TB
Display(s) ASUS ROG Strix XG27UQR (4K, 144Hz, G-SYNC compatible) | Asus MG28UQ (4K, 60Hz, FreeSync compatible)
Case Cooler Master HAF 922
Audio Device(s) Creative Sound Blaster X-Fi Fatal1ty PCIe
Power Supply Corsair AX1600i
Mouse Microsoft Intellimouse Pro - Black Shadow
Keyboard Yes
Software Windows 10 Pro 64-bit
Intel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Intel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
AMD probably isn't implementing it because they know how much of a niche instruction set it is and how it comes with some very real drawbacks as I mentioned above. AVX-512 isn't a magical solution to everything vector-related and people need to stop treating it that way. A bit flipping once a year is probably still more often than most people are running software that'd benefit from AVX-512 over 256.
 
  • Like
Reactions: bug
Joined
Dec 26, 2006
Messages
3,862 (0.59/day)
Location
Northern Ontario Canada
Processor Ryzen 5700x
Motherboard Gigabyte X570S Aero G R1.1 BiosF5g
Cooling Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s) AMD RX 6800 - Asus Tuf
Storage Kingston KC3000 1TB & 2TB & 4TB Corsair MP600 Pro LPX
Display(s) LG 27UL550-W (27" 4k)
Case Be Quiet Pure Base 600 (no window)
Audio Device(s) Realtek ALC1220-VB
Power Supply SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse Mionix Naos Pro
Keyboard Corsair Strafe with browns
Software W10 22H2 Pro x64
Intel..........love me a ton of market segmentation
 
Joined
Sep 17, 2014
Messages
22,673 (6.05/day)
Location
The Washing Machine
System Name Tiny the White Yeti
Processor 7800X3D
Motherboard MSI MAG Mortar b650m wifi
Cooling CPU: Thermalright Peerless Assassin / Case: Phanteks T30-120 x3
Memory 32GB Corsair Vengeance 30CL6000
Video Card(s) ASRock RX7900XT Phantom Gaming
Storage Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s) Gigabyte G34QWC (3440x1440)
Case Lian Li A3 mATX White
Audio Device(s) Harman Kardon AVR137 + 2.1
Power Supply EVGA Supernova G2 750W
Mouse Steelseries Aerox 5
Keyboard Lenovo Thinkpad Trackpoint II
VR HMD HD 420 - Green Edition ;)
Software W11 IoT Enterprise LTSC
Benchmark Scores Over 9000
Dude, it makes CPUs hot when people actually use it/them, we can't have that :p, how can it be good. Massive sarcasm and jest I agree with your points.

I do like Intel leaning towards this RUDIMENTARY ( ahahaa my arse) statement, the information gleaned from the web makes them seem disingenuous about this, the E cores had non the P cores is third gen avx512 no?! , And that's rudimentary whatever Intel.

They're just after segregation again the gits.

Its a double win, they're also not having to deal with the gamur sentiment or ('enthusiasts') that try desperately to run AVX512 over their 5 Ghz all core OC on a measly aio. And the fallout that would generate, because I reckon some nuclear plants might explode in the process. Next thing you know they'll tell you not to OC your K-CPUs (oh wait... :D)

This is just Intel being Intel, profit winning over quality and long term planning overthrown by investor panic. It screams of arrogance regained. So much for that rebranding they did just now. Still Intel Inside.

Honestly, I'm staying far away from this crap, as long as I see these shenanigans, no penny from this wallet.
 
Joined
Nov 18, 2010
Messages
7,595 (1.48/day)
Location
Rīga, Latvia
System Name HELLSTAR
Processor AMD RYZEN 9 5950X
Motherboard ASUS Strix X570-E
Cooling 2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory 4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s) Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s) Philips PHL BDM3270 + Acer XV242Y
Case Lian Li O11 Dynamic EVO
Audio Device(s) SMSL RAW-MDA1 DAC
Power Supply Fractal Design Newton R3 1000W
Mouse Razer Basilisk
Keyboard Razer BlackWidow V3 - Yellow Switch
Software FEDORA 41
If Torvalds wants better products than intel and avx512, then he shouldn't “hope” for the death of new instructions, he should hope for competition like the apple m1 instead, which shows intel (and nvidia) how inefficient their stuff really are. True competition is our only hope against these monsters with their prices and segregation techniques, not death wishes on instructions.

His concern is building and maintaining kernel code, that already has grown fat and keeping it slim enough and rational. He has seen it all in the x86 branch and how make it happen with his own efforts.

Our hardware is quite good actually if the code for it were better and more low level. What I am implying introducing another instruction set like crutch to the X86 won't make the end code smaller and more efficient for us desktop users. If some code monkeys will start to use this instruction set just for fashion as a hipster trend, then bad things usually happen. This time, if it would trigger the instruction and feed too many data through the long instruction pipe, it overheats as it has much longer peak execution and heats up more. Remember the early Intel burn test program introducing AVX to them, made them literately furmark class showing temps you never ever see in your daily usage. If programs would trigger it like on gaming, it would be rad and FPS would tank due to single core frequency decrease for the main render thread.

That's PS3 example is just a rare exception. He tries to hammer nails using his shoes by using AVX512, just because it is more lazy and for what, few people using Xeons now? There was an older instruction set the emulator could use, but that was omitted due to HW bugs on almost any Intel CPU arch it had with time too, but wasn't so bashed around in the media... so YOLO.

We are lucky, that there are some harsh code maintainers that tame down some snowflakes with introducing limitations and automatic optimization in compilers like Aquinus said.
 
D

Deleted member 24505

Guest
Its a double win, they're also not having to deal with the gamur sentiment or ('enthusiasts') that try desperately to run AVX512 over their 5 Ghz all core OC on a measly aio. And the fallout that would generate, because I reckon some nuclear plants might explode in the process. Next thing you know they'll tell you not to OC your K-CPUs (oh wait... :D)

This is just Intel being Intel, profit winning over quality and long term planning overthrown by investor panic. It screams of arrogance regained. So much for that rebranding they did just now. Still Intel Inside.

Honestly, I'm staying far away from this crap, as long as I see these shenanigans, no penny from this wallet.

If i was loaded, every time a better CPU came out whoever it was from, i would have one, whether it was AMD or Intel as the only thing that matters is performance. I'm sure if you are loaded enough, you can build a rig good enough to cool even a 500w CPU. Heat or power use should not matter to anyone, as long as you can cool it.
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Avx512 is no use for gaming anyway, so i don't give a hoot. Benches are just epeen, nothing else. The only thing AVX512 is any good for are serious apps. It seems the only reason to OC is for benches, as a high performance CPU (58/900x, ADL 127/900k) is gonna be fine for most gaming needs anyway.

Name 1 program/game not for professional use that uses AVX512, bet there is not many, so disabling AVX512 on ADL is going to have very minimal impact on it's performance on everyday apps anyway.
It will probably be some time before games start to utilize AVX-512, but there are certainly games which use AVX2, so don't think SIMD isn't useful for games. But it's not something that directly benefit FPS though, so if a game relies heavily on SIMD, it usually means that's a minimum requirement or you have to sacrifice game features if you don't have it.

Even beyond that, several compilers will avoid AVX-512 in favor of AVX-256 because of the issue with downclocking, which impacts all workloads occurring at the time that the CPU clocks down.
Will they? The big three all have offered support for some time.
And compilers don't have a will of their own to decide which ISA to use, that's specified by the developer.

The downclocking argument is 100% BS and you know it. Even if the core runs a few hundred MHz lower, it will still churn through more data, so this is just nonsense.

The usefulness of AVX-512 completely depends on the workload and how well the task is vectorized. Even for situations that can use AVX-512, going with 256 or 128 might actually net you better performance depending on the application.
Most vectorized data which benefits from SIMD are greater than 512 bits (64 bytes). 512 bits is tiny.
In fact, using a vector size of 512 bit is genius, as it perfectly matches the cache line size of current x86 implementations.

Intel are such spoilsports. AMD need to come up with a CPU that does support them, performs better and at a cheaper price than the equivalent Alder Lake to give them a good kicking over it.
I wish AMD would come up with a CPU with AVX-512 support which kicks ass, like 4x FMAs and better energy efficiency. It would be a serious powerhouse.

What I am implying introducing another instruction set like crutch to the X86 won't make the end code smaller and more efficient for us desktop users.
Actually, optimized AVX core is smaller, more cache efficient, not to mention it eliminates a lot of branching, looping, register shuffling and load/stores. If the computational density is high enough, it offers orders of magnitude higher performance. But not all code is that computationally dense, and much of the kernel code is probably not.

If some code monkeys will start to use this instruction set just for fashion as a hipster trend, then bad things usually happen.
First of all, SIMD is used to some degree in many applications. I'm pretty sure you use it every day. Video playback, compression, web browsing (both compression and encryption), video editing, photo editing, etc. all use AVX/AVX2 or SSE. Without it, many of these things would be dreadfully slow. When popular applications start to get good AVX-512 support, you will not want to be left behind.

This time, if it would trigger the instruction and feed too many data through the long instruction pipe, it overheats as it has much longer peak execution and heats up more.
What on earth makes you come up with a claim like that? Stop embarrassing yourself.
Most AVX operations are a few clock cycles, and the work done is equivalent to filling up the pipeline many times.

Remember the early Intel burn test program introducing AVX to them, made them literately furmark class showing temps you never ever see in your daily usage. If programs would trigger it like on gaming, it would be rad and FPS would tank due to single core frequency decrease for the main render thread.
That's not how throttling works at all. This is utter nonsense.

That's PS3 example is just a rare exception. He tries to hammer nails using his shoes by using AVX512, just because it is more lazy and for what, few people using Xeons now?
FYI, AVX-512 is supported by Ice Lake, Tiger Lake, Rocket Lake, Cascade Lake-X and Skylake-X, so not just Xeons. ;)

I think most (if not all) of you have missed the biggest advantage of AVX-512. It's not just AVX2 with double vector size, it's vastly more flexible and have a better instruction encoding scheme. It's much more than just simple fp add/sub/mul/div operations, it actually will allow previously unseen efficiency when implementing dense algorithms, for e.g. encoding, encryption, compression, etc. with an efficiency coming relatively close to ASICs.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Will they? The big three all have offered support for some time.
And compilers don't have a will of their own to decide which ISA to use, that's specified by the developer.
Stop. `gcc -march=skylake-avx512` defaults to `-mprefer-vector-width=256`. You can override it, but that's the default out of the box and there is a reason for it.
Most vectorized data which benefits from SIMD are greater than 512 bits (64 bytes). 512 bits is tiny.
In fact, using a vector size of 512 bit is genius, as it perfectly matches the cache line size of current x86 implementations.
Except it comes with the very real drawback that it slows down everything that isn't SIMD. It's great for servers and HEDT, but it sucks for consumer hardware. I'm not saying that AVX-512 is useless. I'm saying for the average user, it's useless.
 
Joined
Feb 1, 2019
Messages
3,667 (1.70/day)
Location
UK, Midlands
System Name Main PC
Processor 13700k
Motherboard Asrock Z690 Steel Legend D4 - Bios 13.02
Cooling Noctua NH-D15S
Memory 32 Gig 3200CL14
Video Card(s) 4080 RTX SUPER FE 16G
Storage 1TB 980 PRO, 2TB SN850X, 2TB DC P4600, 1TB 860 EVO, 2x 3TB WD Red, 2x 4TB WD Red
Display(s) LG 27GL850
Case Fractal Define R4
Audio Device(s) Soundblaster AE-9
Power Supply Antec HCG 750 Gold
Software Windows 10 21H2 LTSC
I bet they want it disabled just so that people can't run AVX 512 benchmarks that would expose even more laughable power consumption figures. Other than that it speeds up the validation process and practically no consumer software needs AVX 512, so it's completely irrelevant whether it's there or not.


That doesn't really mean anything from an emulation stand point , at the end of the day you still need to emulate more or less the same thing irrespective of the ISA. The reason you couldn't use CUDA or OpenCL is not because the CPU is RISC but because of the software that runs on those SPEs which needs complex thread synchronization logic that simply can't be done on a GPU. The PS3 GPU is documented, it's just some run of the mill 7000GTX series Nvidia architecture, nothing special there so there is no point in trying to use anything else than just OpenGL or any other graphics API.
I agree, I think it was only enabled for marketing wins on benchmarks that enable avx-512, but then the realisation kicked in the power numbers were overcoming any positive PR, hence the new marketing decision to disable it.
 
Joined
Jan 14, 2019
Messages
12,577 (5.80/day)
Location
Midlands, UK
System Name Nebulon B
Processor AMD Ryzen 7 7800X3D
Motherboard MSi PRO B650M-A WiFi
Cooling be quiet! Dark Rock 4
Memory 2x 24 GB Corsair Vengeance DDR5-4800
Video Card(s) AMD Radeon RX 6750 XT 12 GB
Storage 2 TB Corsair MP600 GS, 2 TB Corsair MP600 R2
Display(s) Dell S3422DWG, 7" Waveshare touchscreen
Case Kolink Citadel Mesh black
Audio Device(s) Logitech Z333 2.1 speakers, AKG Y50 headphones
Power Supply Seasonic Prime GX-750
Mouse Logitech MX Master 2S
Keyboard Logitech G413 SE
Software Bazzite (Fedora Linux) KDE
Apparently you've missed my criticism of intel, AMD, AND Nvidia for pushing their parts out of the efficiency sweet spot (and eliminating OC headroom for us) for years now.

Power use can easily be limited to one number. Mobile parts, T series parts, and even normal desktop parts have have power draw limited. In fact, such limits are a thing according to intel. Intel however is very mushy on the actual limit of PL2/3 power draw and time limits as well, things that should be enforced by default then turned off for OC, not the other way around. Most importantly, they need to be consistent, as right now all these board makers can be "in spec" yet have wildly different power draws and time limits.

This wasnt an issue before the boost wars, boost timing and power draw limits were prtty clear in the nehalem/sandy bridge era. AMD today is still more stringent on how much juice ryzen can pull to boost. Intel has been playing fast and loose for years, and it's a headache to keep track of.
I agree, except that AMD is just as much of a headache as Intel is, with their TDP meaning nothing, and PPT being a totally different and undisclosed value.

But then, neither is a headache if you know what PL1/2 or PPT means and how to change them in the BIOS.
 
Joined
Nov 18, 2010
Messages
7,595 (1.48/day)
Location
Rīga, Latvia
System Name HELLSTAR
Processor AMD RYZEN 9 5950X
Motherboard ASUS Strix X570-E
Cooling 2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory 4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s) Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s) Philips PHL BDM3270 + Acer XV242Y
Case Lian Li O11 Dynamic EVO
Audio Device(s) SMSL RAW-MDA1 DAC
Power Supply Fractal Design Newton R3 1000W
Mouse Razer Basilisk
Keyboard Razer BlackWidow V3 - Yellow Switch
Software FEDORA 41
FYI, AVX-512 is supported by Ice Lake, Tiger Lake, Rocket Lake, Cascade Lake-X and Skylake-X, so not just Xeons. ;)

All your points are not valid, all guys here are telling the same. AVX512 for consumers is not needed, its workload produces more heat and slows down the system.

Rocket lake is the only exception there, other's aren't desktop platforms either, latter being rebadged Xeons with cut down feature set to even put some tax more. You have to pay extra for ECC support choosing Xeon while there is literary nothing that stops it working on the Skylake X etc parts. Intel being Intel.
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Except it comes with the very real drawback that it slows down everything that isn't SIMD. It's great for servers and HEDT, but it sucks for consumer hardware. I'm not saying that AVX-512 is useless. I'm saying for the average user, it's useless.
No, it does not slow down everything else. Any time a core throttles from heavy use of AVX-512, the performance gained from it will greatly outweigh the minor downclock.

All your points are not valid, all guys here are telling the same. AVX512 for consumers is not needed, its workload produces more heat and slows down the system.
No, it does not slow down the system. This is complete nonsense.

Rocket lake is the only exception there, other's aren't desktop platforms either, latter being rebadged Xeons with cut down feature set to even put some tax more. You have to pay extra for ECC support choosing Xeon while there is literary nothing that stops it working on the Skylake X etc parts. Intel being Intel.
Ice Lake-U/-Y and Tiger Lake-U/-Y/-H are not Xeons, these are high volume consumer products.
Cascade Lake-X and Skylake-X exist as non-Xeons.

All S-series CPUs from Intel share dies with Xeons.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
No, it does not slow down the system. This is complete nonsense.
Yes, it does. FP AVX-512 is the worst offender. This reply on StackOverflow describes what's going on pretty well. The only thing that's complete nonsense is how you're pushing on this so hard when this is a very easy thing to validate.


In summary:
Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Yes, it does. FP AVX-512 is the worst offender. This reply on StackOverflow describes what's going on pretty well. The only thing that's complete nonsense is how you're pushing on this so hard when this is a very easy thing to validate.
Once again, you clearly demonstrate that you don't understand the subject.
AVX512 instructions work on twice as much data as AVX2 instructions, and 16 times as much as scalar fp32 instructions. So even if a CPU has to drop the clock speed a little bit and there are a few scalar instructions in-between the AVX operations, the total throughput is still better. These CPUs constantly scales the core clocks individually. On top of that, using vector operations reduces stress on instruction cache and eliminates a lot of register shuffling and instructions for control flow, which also means there will be fewer scalar operations to be performed. This in turn simplifies the workload for the CPU resulting in more work completed even though fewer instructions are executed. And contrary to popular opinion, the purpose of a CPU is to execute work, not run at the highest clock speed!

The fact that Skylake-SP throttles more than desired is an implementation issue, not an ISA issue. And it doesn't make AVX-512 a bad feature, it just reduces the advantage of it.
 

bug

Joined
May 22, 2015
Messages
13,843 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth. I know some encoders will use AVX512, but that's all I know of.

AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Once again, you clearly demonstrate that you don't understand the subject.
AVX512 instructions work on twice as much data as AVX2 instructions, and 16 times as much as scalar fp32 instructions. So even if a CPU has to drop the clock speed a little bit and there are a few scalar instructions in-between the AVX operations, the total throughput is still better. These CPUs constantly scales the core clocks individually. On top of that, using vector operations reduces stress on instruction cache and eliminates a lot of register shuffling and instructions for control flow, which also means there will be fewer scalar operations to be performed. This in turn simplifies the workload for the CPU resulting in more work completed even though fewer instructions are executed. And contrary to popular opinion, the purpose of a CPU is to execute work, not run at the highest clock speed!

The fact that Skylake-SP throttles more than desired is an implementation issue, not an ISA issue. And it doesn't make AVX-512 a bad feature, it just reduces the advantage of it.
Sure, if your workload is purely vector operations. That's not a realistic workload for most applications, even more so in the consumer space. No application has only AVX instructions sans anything else. As that StackOverflow article mentioned, it depends on how much you're using these vector units. Even for L1 licensure there is a cost that needs to be considered, forget the hit at L2 which is far more pronounced.
@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth. I know some encoders will use AVX512, but that's all I know of.

AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
At least one person understands what I'm saying. Hell, even GCC opts for AVX-256 on chips that support 512 because the cost (most of the time,) isn't worth it. If it was such a magical solution, it should be the prefered default, but it's not, for this reason.

Look, I'm not saying AVX-512 is useless or bad. I'm just saying it's not the magic bullet you're making it out to be @efikkan. There are plenty of cases where it's not an effective strategy and you're better off sticking with something like AVX-256 instead a lot of the time because the clock penalty is very real for these heavy instructions.
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
@efikkan That's not what @Aquinus argues. He argues that when AVX512 is in use, more heat is produced and thus all cores have to downclock. Your AVX512 workload may finish faster, but everything else will be slower.
That wouldn't happen unless the CPU reaches its thermal limit, and keep in mind that entails putting heavy AVX-512 loads no most if not all cores.
And regardless, the heavy load finishing quicker means more time and cycles free for anything else.
Still, none of these are ISA issues. Ice Lake-SP is able to sustain much better clocks with heavy AVX loads, and Sapphire Rapids will do it even better.

He's technically correct, except AVX512 workloads in the consumer space are as rare as hen's teeth.
That's a separate subject. And yes, pretty much non-existant in the consumer space.

AVX512's biggest problem (as I see it) is the die area can't be justified anymore in times where everybody is fighting for fab capacity.
Really? And what kind of alternative would you propose to advance CPU throughput?

Sure, if your workload is purely vector operations. That's not a realistic workload for most applications, even more so in the consumer space. No application has only AVX instructions sans anything else.
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seemlessy switch between vector operations and scalar operations, and mix them of course.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
That wouldn't happen unless the CPU reaches its thermal limit, and keep in mind that entails putting heavy AVX-512 loads no most if not all cores.
And regardless, the heavy load finishing quicker means more time and cycles free for anything else.
Still, none of these are ISA issues. Ice Lake-SP is able to sustain much better clocks with heavy AVX loads, and Sapphire Rapids will do it even better.
Clock speeds decrease as more cores use AVX-512 regardless of the CPUs thermal state. What you just described is not how Intel processors work with heavy instructions that hit the L2 license. Even with one core, you have reduced clocks, but it gets a lot worse the more cores you use. In my example from StackOverflow you can see that with the Xeon Gold 5120 by the time you're at 5 cores with AVX-512 heavy instructions, you're down to 1.9Ghz. That has nothing to do with thermal throttling and everything to do with how Intel handles L1 and L2 licenses.
In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seemlessy switch between vector operations and scalar operations, and mix them of course.
That's kind of my point. The clock speed hit is very real even if it's just for a few instructions and that impacts everything else until the CPU switches back to L0 or L1 which takes time. You need to actually read that article I sent because it explains all of this. The only time AVX-512 is going to shine is if the majority of the work being done can be vectorized, not if it's sprinkled out throughout your application. That's actually the case where AVX-256 is far more advantageous. So thank you for proving my point...
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
The only time AVX-512 is going to shine is if the majority of the work being done can be vectorized, not if it's sprinkled out throughout your application. That's actually the case where AVX-256 is far more advantageous. So thank you for proving my point...
If you seriously think I proved your point, then you're don't understand the subject at all :facepalm:
Let's examine what I said;
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seamlessly switch between vector operations and scalar operations, and mix them of course.

In case it wasn't clear enough; it's the performance critical code which does all the real computational work. It may be a small portion of the total code base, but it's the code that runs the majority of the CPU time. That's why optimizing the performance critical code is what matters. Those who know the first thing about optimizing code knows the most important type of optimizations are cache optimizations; divided into data cache(1) and instruction cache(2) optimizations. This is important because failing to do so results in lots of cache misses, and the cost of a cache miss on current x86 CPUs is ~450 clocks, which roughly means each cache miss costs you ~1000-2000+ instructions. And how do you solve this? By packing the data tight - which means it's vectorized. Then you have the instruction cache(2), which has to do with usage of function calls, data locality and computational density (avoiding bloat and extra branching is implied here too). So again, packing the data tight, packing the computational code tight is the key to performance.
So in conclusion, if your code is performant at all, the data will have to be layed out in vectors, the data will have to be traversed linearly, and the code better have good computational density, because otherwise the CPU time will be spent on cache misses, branch mispredictions etc. instead of doing real work. So if you can put two and two together, you'll see that this is also the groundwork for using SIMD. And any code that works on vectors >32 bytes (most of them are much larger), will benefit from using AVX-512 over AVX2.
 
Joined
Jan 8, 2017
Messages
9,505 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Wide SIMD support is and will always remain counter productive and nonsensical from a practical point of view, even in the datacenter space. Whatever can be improved by a wider SIMD ISA can simply be delegated to a GPU, stuff like ML, video encoding/decoding, etc.

I can't think of any application that would benefit from higher throughput in terms of vector processing but that wouldn't be worthwhile implementing to a GPU.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.79/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
If you seriously think I proved your point, then you're don't understand the subject at all :facepalm:
Let's examine what I said;
The application as a whole is irrelevant, in most cases >98% of application code is not performance critical at all.
What matters is the code in performance critical paths, and if they are computationally dense they are generally looping over some vectorized data performing some kind of operation. In such cases, most of the actual work done can be done in vector operations. The rest is then mostly control flow, shuffling data, etc. In most cases where AVX is used, it's only a few lines of code in a handful functions, and the CPU does seamlessly switch between vector operations and scalar operations, and mix them of course.

In case it wasn't clear enough; it's the performance critical code which does all the real computational work. It may be a small portion of the total code base, but it's the code that runs the majority of the CPU time. That's why optimizing the performance critical code is what matters. Those who know the first thing about optimizing code knows the most important type of optimizations are cache optimizations; divided into data cache(1) and instruction cache(2) optimizations. This is important because failing to do so results in lots of cache misses, and the cost of a cache miss on current x86 CPUs is ~450 clocks, which roughly means each cache miss costs you ~1000-2000+ instructions. And how do you solve this? By packing the data tight - which means it's vectorized. Then you have the instruction cache(2), which has to do with usage of function calls, data locality and computational density (avoiding bloat and extra branching is implied here too). So again, packing the data tight, packing the computational code tight is the key to performance.
So in conclusion, if your code is performant at all, the data will have to be layed out in vectors, the data will have to be traversed linearly, and the code better have good computational density, because otherwise the CPU time will be spent on cache misses, branch mispredictions etc. instead of doing real work. So if you can put two and two together, you'll see that this is also the groundwork for using SIMD. And any code that works on vectors >32 bytes (most of them are much larger), will benefit from using AVX-512 over AVX2.
No, it won't result in more cache misses. You still need to read all of that data to populate the SIMD unit. If you get a cache hit or miss depends on what was done with that data before hand, how often it's been used, etc. You're making a lot of claims here and a lot of them are flat out incorrect. I suggest you start citing sources if you're going to play this game. I at least provided something to show that there is a cost to using heavy SIMD instructions. You're just repeating yourself incessantly. Let's just cut to the part where you provide evidence for your claims.

Edit: Maybe a little article from CloudFlare might help show how painful this can be, even in the server setting.
 
Last edited:
Top