• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,323 (7.52/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
"I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on." These were the words of Linux and Git creator Linus Torvalds in a mailing list, expressing his displeasure over "Alder Lake" lacking AVX-512. Torvalds also cautioned against placing too much weightage on floating-point performance benchmarks, particularly those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance.



"Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.

View at TechPowerUp Main Site
 
Joined
Dec 29, 2010
Messages
3,811 (0.74/day)
Processor AMD 5900x
Motherboard Asus x570 Strix-E
Cooling Hardware Labs
Memory G.Skill 4000c17 2x16gb
Video Card(s) RTX 3090
Storage Sabrent
Display(s) Samsung G9
Case Phanteks 719
Audio Device(s) Fiio K5 Pro
Power Supply EVGA 1000 P2
Mouse Logitech G600
Keyboard Corsair K95
Joined
Dec 16, 2017
Messages
2,950 (1.15/day)
System Name System V
Processor AMD Ryzen 5 3600
Motherboard Asus Prime X570-P
Cooling Cooler Master Hyper 212 // a bunch of 120 mm Xigmatek 1500 RPM fans (2 ins, 3 outs)
Memory 2x8GB Ballistix Sport LT 3200 MHz (BLS8G4D32AESCK.M8FE) (CL16-18-18-36)
Video Card(s) Gigabyte AORUS Radeon RX 580 8 GB
Storage SHFS37A240G / DT01ACA200 / ST10000VN0008 / ST8000VN004 / SA400S37960G / SNV21000G / NM620 2TB
Display(s) LG 22MP55 IPS Display
Case NZXT Source 210
Audio Device(s) Logitech G430 Headset
Power Supply Corsair CX650M
Software Whatever build of Windows 11 is being served in Canary channel at the time.
Benchmark Scores Corona 1.3: 3120620 r/s Cinebench R20: 3355 FireStrike: 12490 TimeSpy: 4624
Kudos to Intel for the massive effort of designing instruction sets just to win benchmarks, though :laugh:

Truth be told, though, I kinda agree with Torvalds? I mean, AVX has a history of generating more heat, introducing a performance penalty (triggered by either using one single instruction or by using more than a certain number, depending on which specific instruction is used) in mixed workloads, and on top of that, AVX-512 has a multitude of instructions that are not necessarily all available together, if you want them, probably due to Intel's habit of aggressively cutting off features for market segmentation.
 
Joined
May 3, 2018
Messages
2,881 (1.18/day)
When we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.
 
Joined
Mar 21, 2016
Messages
2,508 (0.78/day)
Would rather see Intel bring a return to 16KB L1 and just label it L0 while retaining the other L caches, sizes, and structure nature.
 
Joined
Aug 20, 2007
Messages
21,579 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
The issue here isn't so much that the instruction set doesn't work, it's that intel's fragmented it to the point it will never be used.

That's the core of his rant. He also is upset that they don't just make a better FPU.
 
Joined
Mar 21, 2016
Messages
2,508 (0.78/day)
Also he should call out Nvidia for their crap FP64 performance.
Apparently AMD's newest Raedeon Pro's that infinity fabric bridge are quite the FP64 beasts if there diagrams benchmarks are to be trusted enough and not cherry picked scenario's. That said FP64 isn't that useful from what I hear for actual gaming tasks though it's wonderful for compute and pretty sure there is more money to be had at compute between the two.
 
Joined
Nov 15, 2016
Messages
454 (0.15/day)
System Name Sillicon Nightmares
Processor Intel i7 9700KF 5ghz (5.1ghz 4 core load, no avx offset), 4.7ghz ring, 1.412vcore 1.3vcio 1.264vcsa
Motherboard Asus Z390 Strix F
Cooling DEEPCOOL Gamer Storm CAPTAIN 360
Memory 2x8GB G.Skill Trident Z RGB (B-Die) 3600 14-14-14-28 1t, tRFC 220 tREFI 65535, tFAW 16, 1.545vddq
Video Card(s) ASUS GTX 1060 Strix 6GB XOC, Core: 2202-2240, Vcore: 1.075v, Mem: 9818mhz (Sillicon Lottery Jackpot)
Storage Samsung 840 EVO 1TB SSD, WD Blue 1TB, Seagate 3TB, Samsung 970 Evo Plus 512GB
Display(s) BenQ XL2430 1080p 144HZ + (2) Samsung SyncMaster 913v 1280x1024 75HZ + A Shitty TV For Movies
Case Deepcool Genome ROG Edition
Audio Device(s) Bunta Sniff Speakers From The Tip Edition With Extra Kenwoods
Power Supply Corsair AX860i/Cable Mod Cables
Mouse Logitech G602 Spilled Beer Edition
Keyboard Dell KB4021
Software Windows 10 x64
Benchmark Scores 13543 Firestrike (3dmark.com/fs/22336777) 601 points CPU-Z ST 37.4ns AIDA Memory
When we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.
except GA100 has more FP64 potential than amd does FP32 potential
 
Joined
Jun 10, 2014
Messages
3,006 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Mr Torvalds is certainly entitled to his own opinions, but that doesn't make every single of them gold.
The background for this topic is the addition of Alder Lake support in GCC which lacked AVX-512. It remains to be seen if this means the core itself lacks the feature, or if parts of it does. I assume following all this noise Intel will make some sort of statement.

I'm also disappointed with the adoption rate of AVX-512, but that doesn't make it a gimmick. It holds incredible performance potential and increased flexibility over AVX2. But what annoys me much more is Intel's complete lack of support of any AVX in their Pentium/Celeron processors, which is unnecessary fragmentation and holds back mainstream software from embracing modern features.

Also he should call out Nvidia for their crap FP64 performance.
Why do you need FP64 on GPUs? Please elabrorate.
 
Last edited:
Joined
Mar 31, 2020
Messages
50 (0.03/day)
It shouldn't be this complicated and problematic.

It used to be the choice of instruction had isolated impact and predictable results. They neither slowed down any other code around it nor impacted code running on other cores. These were almost free to use and a benefit when used correctly.

The problem is when mixing code and mixing running tasks, AVX512 et. al. reduce the clockspeed to impact the integer code running in the same thread AND ALL OTHER running threads on the same processor. It slows down all integer & non-AVX FP code running in ALL cores. Compilers cannot know during compiling, what the potential performance impacts will be for users at runtime. The OS cannot know the potential performance impacts that occur at runtime when scheduling a mixture of threads. Fairness and predictable performance goes out the window. The best choice for fairness and predictable performance is to IGNORE occasional use of AVX. It may be nice for a computers/servers dedicated to a single task that benefits from these instructions but the typical general user is hurt more then helped by them. Cloud and VM users are hurt by them. Arbitrary and occasional use of them impact all running code so the OS should avoid using them.

It would be OK if the processor could maintain clock speed while using exotic instructions. They would have to be engineered to increase the stages/cycles required to complete the more complex work but maintain clockspeed at all costs. I would much rather have more FP units that are simpler for greater throughput and flexibility. Good if you can pipeline the Multiply into the Add and get the result slightly later than AVX512, but doesn't slowdown the rest of the code. Just because you can use an AVX__ instruction, doesn't mean you should.

CPU's with AVX support a mixture of yes and no. The clockspeed impact also varies according to the CPU model and many other variables.
I agree with Linus, it shouldn't be this complicated and problematic.
 
Last edited:

Cheeseball

Not a Potato
Supporter
Joined
Jan 2, 2009
Messages
2,060 (0.35/day)
Location
Pittsburgh, PA
System Name Titan
Processor AMD Ryzen™ 7 7950X3D
Motherboard ASRock X870 Taichi Lite
Cooling Thermalright Phantom Spirit 120 EVO CPU
Memory TEAMGROUP T-Force Delta RGB 2x16GB DDR5-6000 CL30
Video Card(s) ASRock Radeon RX 7900 XTX 24 GB GDDR6 (MBA)
Storage Crucial T500 2TB x 3
Display(s) LG 32GS95UE-B, ASUS ROG Swift OLED (PG27AQDP), LG C4 42" (OLED42C4PUA)
Case Cooler Master QUBE 500 Flatpack Macaron
Audio Device(s) Kanto Audio YU2 and SUB8 Desktop Speakers and Subwoofer, Cloud Alpha Wireless
Power Supply Corsair SF1000
Mouse Logitech Pro Superlight 2 (White), G303 Shroud Edition
Keyboard Keychron K2 HE Wireless / 8BitDo Retro Mechanical Keyboard (N Edition) / NuPhy Air75 v2
VR HMD Meta Quest 3 512GB
Software Windows 11 Pro 64-bit 24H2 Build 26100.2605
except not everything is amenable to matrices

But for AI and machine learning this is advantageous

When we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.

Quadros can handle FP64 fine. Whats lacking is FP16
 
Joined
Mar 6, 2017
Messages
3,358 (1.17/day)
Location
North East Ohio, USA
System Name My Ryzen 7 7700X Super Computer
Processor AMD Ryzen 7 7700X
Motherboard Gigabyte B650 Aorus Elite AX
Cooling DeepCool AK620 with Arctic Silver 5
Memory 2x16GB G.Skill Trident Z5 NEO DDR5 EXPO (CL30)
Video Card(s) XFX AMD Radeon RX 7900 GRE
Storage Samsung 980 EVO 1 TB NVMe SSD (System Drive), Samsung 970 EVO 500 GB NVMe SSD (Game Drive)
Display(s) Acer Nitro XV272U (DisplayPort) and Acer Nitro XV270U (DisplayPort)
Case Lian Li LANCOOL II MESH C
Audio Device(s) On-Board Sound / Sony WH-XB910N Bluetooth Headphones
Power Supply MSI A850GF
Mouse Logitech M705
Keyboard Steelseries
Software Windows 11 Pro 64-bit
Benchmark Scores https://valid.x86.fr/liwjs3
I'd have to agree with @tygrus here, running AVX code if you don't have good cooling (like a majority of OEM pre-builds) is going to result in lower clock speed due to Intel own AVX-offset. I know that there are those of us who have tweaked our motherboard UEFI's to force the processor to run at the same speed even while using AVX code by setting to AVX-offset to 0 but that's not possible on OEM stripped-down UEFI's. And even then, for those of us who have removed the limitation (because we can) you better have a damn good cooler.
 
Joined
Jun 10, 2014
Messages
3,006 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
The problem is when mixing code and mixing running tasks, AVX512 et. al. reduce the clockspeed to impact the integer code running in the same thread AND ALL OTHER running threads on the same processor. It slows down all integer & non-AVX FP code running in ALL cores.
No it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.

Compilers cannot know during compiling, what the potential performance impacts will be for users at runtime. The OS cannot know the potential performance impacts that occur at runtime when scheduling a mixture of threads. Fairness and predictable performance goes out the window. The best choice for fairness and predictable performance is to IGNORE occasional use of AVX.
I'm going to give you a chance to rephrase that, since it makes no sense.
AVX code is if anything much more predictable, since the throughput is more consistent, cache lines are more effectively used and there is less branching.

It would be OK if the processor could maintain clock speed while using exotic instructions. They would have to be engineered to increase the stages/cycles required to complete the more complex work but maintain clockspeed at all costs. I would much rather have more FP units that are simpler for greater throughput and flexibility. Good if you can pipeline the Multiply into the Add and get the result slightly later than AVX512, but doesn't slowdown the rest of the code. Just because you can use an AVX__ instruction, doesn't mean you should.
Firstly, both single FP operations, SSE and AVX are fed into the same vector units, the only difference is how filled the vector registers are. Intel have two full FMA-sets of AVX-512, to compete with that with single FPUs in FP32 throughput you would need 32 of them, you would also need the circuitry to handle these writing back to the same cache lines without adding pipeline steps. Then the instructions would be at least 16x larger, meaning you would have to increase the instruction cache >10x and probably L2 a bit as well, then the instruction window would have to increase ~10x, and the prefetcher, branch predictor etc. needs to work much more efficiently. And even if you manage all this, you better pray that compiler have unrolled all loops aggressively, because otherwise there is no way you are going to feed your 32 hungry FPUs. :rolleyes:
If you have a rough understanding of how CPUs works, you have probably understood by now that your suggestion was short-sighted.
 
Last edited:

TurboFEM

New Member
Joined
Nov 28, 2018
Messages
4 (0.00/day)
I see a lot of questions on why does one even need FP performance.

Probably many things, but one I know of quite well is - engineering simulations.

Thousands and thousands of engineers are relying on Xeons every day to run their finite element- and finite difference type analyses (mechanical FE, CFD, electromagnetics etc.).
For FE, specifically, you spec a machine like this -> As many AVX2/512 cores you can get away with and nCores * ~8GB ECC RAM. Turn off hyperthreading and go have fun.

It's a big market for Intel, and increasingly nVidia (new codes start to introduce GPU FP64 slowly, but typically require CUDA, so no luck for AMD).
 
Joined
Jun 10, 2014
Messages
3,006 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
@btarunr
As I alluded to in #10, there would probably be some kind of response.
Videocardz (if we can trust them), have some clarifications: link
So it may appear that the big cores offers more ISA features.
 
Joined
Feb 20, 2020
Messages
9,340 (5.25/day)
Location
Louisiana
System Name Ghetto Rigs z490|x99|Acer 17 Nitro 7840hs/ 5600c40-2x16/ 4060/ 1tb acer stock m.2/ 4tb sn850x
Processor 10900k w/Optimus Foundation | 5930k w/Black Noctua D15
Motherboard z490 Maximus XII Apex | x99 Sabertooth
Cooling oCool D5 res-combo/280 GTX/ Optimus Foundation/ gpu water block | Blk D15
Memory Trident-Z Royal 4000c16 2x16gb | Trident-Z 3200c14 4x8gb
Video Card(s) Titan Xp-water | evga 980ti gaming-w/ air
Storage 970evo+500gb & sn850x 4tb | 860 pro 256gb | Acer m.2 1tb/ sn850x 4tb| Many2.5" sata's ssd 3.5hdd's
Display(s) 1-AOC G2460PG 24"G-Sync 144Hz/ 2nd 1-ASUS VG248QE 24"/ 3rd LG 43" series
Case D450 | Cherry Entertainment center on Test bench
Audio Device(s) Built in Realtek x2 with 2-Insignia 2.0 sound bars & 1-LG sound bar
Power Supply EVGA 1000P2 with APC AX1500 | 850P2 with CyberPower-GX1325U
Mouse Redragon 901 Perdition x3
Keyboard G710+x3
Software Win-7 pro x3 and win-10 & 11pro x3
Benchmark Scores Are in the benchmark section
Hi,
Can't say I've ever run into avx-512 so far ?
Set it to 5 and clocks have never dropped that far.
 
Joined
Mar 18, 2015
Messages
2,963 (0.83/day)
Location
Long Island
Ya mean ... just like "more cores" ?. While more cores can be useful, having more than you actually need for your applications doesn't do anything for you.
 
Joined
Dec 16, 2017
Messages
2,950 (1.15/day)
System Name System V
Processor AMD Ryzen 5 3600
Motherboard Asus Prime X570-P
Cooling Cooler Master Hyper 212 // a bunch of 120 mm Xigmatek 1500 RPM fans (2 ins, 3 outs)
Memory 2x8GB Ballistix Sport LT 3200 MHz (BLS8G4D32AESCK.M8FE) (CL16-18-18-36)
Video Card(s) Gigabyte AORUS Radeon RX 580 8 GB
Storage SHFS37A240G / DT01ACA200 / ST10000VN0008 / ST8000VN004 / SA400S37960G / SNV21000G / NM620 2TB
Display(s) LG 22MP55 IPS Display
Case NZXT Source 210
Audio Device(s) Logitech G430 Headset
Power Supply Corsair CX650M
Software Whatever build of Windows 11 is being served in Canary channel at the time.
Benchmark Scores Corona 1.3: 3120620 r/s Cinebench R20: 3355 FireStrike: 12490 TimeSpy: 4624
Hi,
Can't say I've ever run into avx-512 so far ?
Set it to 5 and clocks have never dropped that far.

It's a relatively recent instruction set. It's rare to see it in use outside of scientific applications or others that get a real benefit out of using it.

Besides, AVX-512 is found only in high-end desktop processors (Core i7 or i9) or Xeons, and for whatever reason, on some specific mobile chips.

On top of that, while there is a subset that is sort of available on every Intel CPU that "supports" AVX-512, there are some instructions that are only found on specific CPUs. Tiger Lake has not even launched yet, if I remember correctly.

20200713-001443.png


No it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.

AVX impact is relative, apparently, according to this

TLDR, it seems to affect only Turbo frequencies, in the first place, and how much it will downclock will depend on the type and number of instructions executed. AVX512 does trigger this throttling a bit more, while AVX and AVX2 do it less or don't even do so at all.
 
Joined
Feb 20, 2020
Messages
9,340 (5.25/day)
Location
Louisiana
System Name Ghetto Rigs z490|x99|Acer 17 Nitro 7840hs/ 5600c40-2x16/ 4060/ 1tb acer stock m.2/ 4tb sn850x
Processor 10900k w/Optimus Foundation | 5930k w/Black Noctua D15
Motherboard z490 Maximus XII Apex | x99 Sabertooth
Cooling oCool D5 res-combo/280 GTX/ Optimus Foundation/ gpu water block | Blk D15
Memory Trident-Z Royal 4000c16 2x16gb | Trident-Z 3200c14 4x8gb
Video Card(s) Titan Xp-water | evga 980ti gaming-w/ air
Storage 970evo+500gb & sn850x 4tb | 860 pro 256gb | Acer m.2 1tb/ sn850x 4tb| Many2.5" sata's ssd 3.5hdd's
Display(s) 1-AOC G2460PG 24"G-Sync 144Hz/ 2nd 1-ASUS VG248QE 24"/ 3rd LG 43" series
Case D450 | Cherry Entertainment center on Test bench
Audio Device(s) Built in Realtek x2 with 2-Insignia 2.0 sound bars & 1-LG sound bar
Power Supply EVGA 1000P2 with APC AX1500 | 850P2 with CyberPower-GX1325U
Mouse Redragon 901 Perdition x3
Keyboard G710+x3
Software Win-7 pro x3 and win-10 & 11pro x3
Benchmark Scores Are in the benchmark section
Hi,
Yep my prior x299/ 7900x had it and so does my current 9940x
z490/ 10900k does not nor does x99/ 5930k.
 
Joined
Jun 10, 2014
Messages
3,006 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Besides, AVX-512 is found only in high-end desktop processors (Core i7 or i9) or Xeons, and for whatever reason, on some specific mobile chips.
If Ice Lake-S/-H hadn't been cancelled, the whole lineup* would have offered AVX-512 already, so this strangeness is not intentional segmentation.
Once client applications starts to utilize it, it will offer significant performance and efficiency gains, even for low-power laptops.

*) Except Atom, Pentium and Celeron of course.
 
Joined
Sep 26, 2018
Messages
47 (0.02/day)
AMD's Ryzen processors lagged behind Intel's chips significantly in earlier generations when they only supported 128-bit SIMD while Intel already had AVX-256. So when 10nm on the desktop finally gives Intel the thermals to put AVX-512 on the desktop, I've been expecting that Intel will take over the lead from AMD (although, as with the earlier generations of Ryzen, it won't be that far behind) once again. So I was pretty shocked to hear Linus' comments!
After all, faster integer performance will... let your computer send E-mail faster? Gaming uses floating-point too, so improving the power of chips for HPC applications will make them more powerful for everyone.
But maybe Linus Torvalds is at least partly right. Maybe it's time to split the processor line-up, to offer a choice between chips that have high floating-point performance, and other chips that tilt more towards integer performance, so that one can buy a processor appropriate to one's workload.
 
Joined
Apr 24, 2020
Messages
2,741 (1.60/day)
That's the core of his rant. He also is upset that they don't just make a better FPU.

Intel Skylake (non-X) 256-bit AVX already supports 3x 2x 256-bit multiply-and-adds, 2x 256-bit loads from L1 cache and 1x 256-bit store to L1 cache... per clock tick with like 5-cycle latency.

Outside of going to 512-bits, how exactly do you expect Intel to improve upon that? AVX512 simply change that to 3x 2x 512-bit multiply-and-adds, 2x 512-bit loads and 1x-512 bit stores. Its the most obvious way to improve the SIMD / FPU unit.

EDIT: Apparently 2x multiply-and-adds supported per clock on Skylake, according to https://software.intel.com/sites/la...e/#expand=3508,3922,2581&techs=FMA&text=fmadd. Still, that's 16 flops per cycle. Hard to imagine how to make this 2x better aside from the "obvious" extend to 512 bits.

------

SIMD FPU-multiply is higher performance than 64-bit integer-multiply, lol. (to be fair: SIMD FPU-multiply is easier at only 53-bits (Double precision), but still...)

except not everything is amenable to matrices

But virtually everything has a "memset(blah, 0, ...)" somewhere. And this memset code is almost always compiled into SIMD in my experience (be it 128-bit SSE, 256-bit AVX, or 512-bit AVX512 code)

GCC and Clang have surprisingly good auto-vectorizers that can change many simple for-loops into SIMD accelerated versions. AVX512 has literally double the performance with memset, memcmp, memcpy, strcmp, strcpy, etc. etc compared to 256-bit AVX2. (Note: AVX does NOT support integer operations. You need AVX2 in your compile flags, as well as an AVX2 CPU).

The 512-bit thick data-path extends all the way to L2 cache... meaning memcmp / memcpy / etc. etc. bonus applies to a huge amount of C code automatically.
 
Last edited:
Joined
Jun 10, 2014
Messages
3,006 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Outside of going to 512-bits, how exactly do you expect Intel to improve upon that? AVX512 simply change that to 3x 2x 512-bit multiply-and-adds, 2x 512-bit loads and 1x-512 bit stores. Its the most obvious way to improve the SIMD / FPU unit.
There is also the option of adding more execution ports and vector units.
This does however require the front-end to be able to decode and issue micro-ops faster, having a larger instruction window, etc., and even then run the risk of underutilization. I do expect that we will eventually move to 3 or even 4 FMA sets in desktop CPUs, but but the architectures will need to evolve a lot to facilitate that.

One interesting bit is the rumor about Zen 3 offering 50% higher FPU performance. If true, I do wonder if they added more units, or if they improved them somehow.

GCC and Clang have surprisingly good auto-vectorizers that can change many simple for-loops into SIMD accelerated versions…
They do, and software can get a good portion of free performance simply by enabling these instructions.
But still, the huge performance gains still requires tailored code using intrinsics, which is unfortunately a bit too difficult for most programmers. But I do hope we get to a point where the compilers are able to convert a bit more complex calculations into pretty optimal AVX, provided you have cache optimized etc.

One of the interesting things about AVX is the vast feature set which extends far beyond just arithmetics. It also support things like comparisons with masks, which essentially enables you to do conditionals without branching logic, and the feature set of AVX-512 is almost like a new instruction set. The potential here is huge, but it's still "inaccessible" to most programmers. If we get to a point where writing clean C code can be compiled into decent AVX instructions, even with more complex calculations and some basic conditionals, that would be huge for the adoption of AVX.

The 512-bit thick data-path extends all the way to L2 cache... meaning memcmp / memcpy / etc. etc. bonus applies to a huge amount of C code automatically.
One thing that comes to mind is the 512-bit vector size fits very well with the cache line size.
 
Top