• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AVX vs non-AVX

Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
In recent light of the Vulkan vs OpenGL comparison on why supposedly developers don't want to use Vulkan which was posted some time ago I thought it would make sense to show something similar on the CPU side of things :).

Scalar, non-AVX :

Code:
void interpolate(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j++)
        {
            mat[i][j] = mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j]);
        }
}

AVX :

Code:
void interpolate_avx(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j=j+8)
        {
            _mm256_storeu_si256((__m256i *)&mat[i][j], _mm256_cvtps_epi32(_mm256_add_ps(_mm256_mul_ps(_mm256_sub_ps(_mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i+1][j])), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j]))), _mm256_set1_ps(0.5f)), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j])))));
        }
}

That's why game developers stay away from optimizing games using AVX I guess :). In all seriousness, there is always a price to pay for better performance, the code above runs about 6.5 times faster even though it looks unintelligible. Same thing with Vulkan vs OpenGL vs DirectX or whatever, it makes no sense to compare things like this.
 
Joined
Apr 8, 2010
Messages
1,010 (0.19/day)
Processor Intel Core i5 8400
Motherboard Gigabyte Z370N-Wifi
Cooling Silverstone AR05
Memory Micron Crucial 16GB DDR4-2400
Video Card(s) Gigabyte GTX1080 G1 Gaming 8G
Storage Micron Crucial MX300 275GB
Display(s) Dell U2415
Case Silverstone RVZ02B
Power Supply Silverstone SSR-SX550
Keyboard Ducky One Red Switch
Software Windows 10 Pro 1909
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
 
Joined
Oct 2, 2015
Messages
3,144 (0.94/day)
Location
Argentina
System Name Ciel / Akane
Processor AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling ID-Cooling 224-XT Basic / Stock
Memory 2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s) Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s) AOC Q27G3XMN / Samsung S22F350
Case Cougar MX410 Mesh-G / Generic
Audio Device(s) Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply Aerocool KCAS-500W / Gigabyte P450B
Mouse EVGA X15 / Logitech G203
Keyboard VSG Alnilam / Dell
Software Windows 11
Shouldn't the compiler handle that on its own?
 

Solaris17

Super Dainty Moderator
Staff member
Joined
Aug 16, 2005
Messages
26,967 (3.83/day)
Location
Alabama
System Name RogueOne
Processor Xeon W9-3495x
Motherboard ASUS w790E Sage SE
Cooling SilverStone XE360-4677
Memory 128gb Gskill Zeta R5 DDR5 RDIMMs
Video Card(s) MSI SUPRIM Liquid X 4090
Storage 1x 2TB WD SN850X | 2x 8TB GAMMIX S70
Display(s) 49" Philips Evnia OLED (49M2C8900)
Case Thermaltake Core P3 Pro Snow
Audio Device(s) Moondrop S8's on schitt Gunnr
Power Supply Seasonic Prime TX-1600
Mouse Lamzu Atlantis mini (White)
Keyboard Monsgeek M3 Lavender, Moondrop Luna lights
VR HMD Quest 3
Software Windows 11 Pro Workstation
Benchmark Scores I dont have time for that.
Joined
Feb 18, 2020
Messages
33 (0.02/day)
Location
Russian Federation
Processor Ryzen 7 2700 (3900MHz@1.275V, Vsoc 1.025V)
Motherboard ASUS ROG Crosshair VIII Impact (BIOS 1302, Vsb 1.04V, SenseMI Skew 273)
Cooling Noctua NH-U12A (full custom loop building in progress)
Memory 2*8GB G.Skill Trident Z B&W (3600 14-14-14-14-28-42 1T 1.45V, GDM off, ProcODT 53, CAD 24-24-24-24)
Video Card(s) Radeon RX Vega64 Frontier Edition Air 16G (Core 1667MHz@1.07625V, Mem 1330@0.9375V)
Storage SSD: 120GiB + 240GiB + 480GiB. HDD: 1TiB + 2TiB. Backup: Ultrium LTO-6 6.25TiB.
Display(s) AOC C27G1
Case Fractal Design Define Nano S
Audio Device(s) Audio-technica ATH-AD900X
Power Supply BeQuiet! SFX-L Power 600W
Mouse A4Tech XL-747H
Keyboard Zalman ZM-K500
Software Gentoo-latest amd64-hardened, Windows 10 Pro WS x64.
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)
 
Joined
Nov 13, 2007
Messages
10,770 (1.73/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 CL30-36-36-76
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
The compiler and IDE can only babysit you so much.

Challenge accepted. AI IDE/compilers incoming. :p

Seriously though that looks awful lol.
 
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Shouldn't the compiler handle that on its own?

Theoretically, practically GCC didn't vectorize this using AVX. Probably because of the implicit conversion from integer to float so it decided it shouldn't touch it.


No YMM registers in sight for the non-AVX function which means no AVX instructions.

Seriously though that looks awful lol.

It does but it doesn't necessarily need to, there are libraries that do the same thing under much more legible code. The point was it makes no sense to compare things like that, one does something, the other something else.
 
Last edited:
Joined
Apr 8, 2010
Messages
1,010 (0.19/day)
Processor Intel Core i5 8400
Motherboard Gigabyte Z370N-Wifi
Cooling Silverstone AR05
Memory Micron Crucial 16GB DDR4-2400
Video Card(s) Gigabyte GTX1080 G1 Gaming 8G
Storage Micron Crucial MX300 275GB
Display(s) Dell U2415
Case Silverstone RVZ02B
Power Supply Silverstone SSR-SX550
Keyboard Ducky One Red Switch
Software Windows 10 Pro 1909
Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)
That's exactly what I mean lol, maybe I should replace "more work" with "more time" to make it clearer?
 
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Here's something funny : https://godbolt.org/z/tKVQXA

With those flags it decided it can use AVX 512 of all things, so probably it figured it wasn't fast enough under AVX2 ? You can't rely on the compiler to do this sort of thing more often than not and keep in mind this computation is extremely simple. Things aren't usually like that.
 
Joined
Aug 18, 2017
Messages
344 (0.13/day)
In previous godbolt link (#7) you didn't set any optimization flags, only instruction set ;) No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

This code may look obvious and simple, but when it comes to code generation there are many things to be taken into consideration. If something cannot be proven, extra code has to be generated to cover such case.
Can vectors used in inner loop alias?
Do vectors contain uneven numbers of elements, so that few in the end have to be treated one by one?
Are there any language rules that make certain optimizations not possible?

All of these require certain dialogue with compiler to make sure that it can both generate optimal code, and do not generate too much unnecessary code. It takes time and effort.
 
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
In previous godbolt link (#7) you didn't set any optimization flags, only instruction set ;) No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

But that's the thing compilers are inconsistent and unpredictable with these things. With -march=native it generated AVX512 code, it's impractical that you'd always need to look up the assembly to see if the right instructions are used.

Edit :

I can't actually get it to generate AVX with -O2 -ftree-vectorize https://godbolt.org/z/XMNrNC
 
Last edited:
Joined
Aug 18, 2017
Messages
344 (0.13/day)
It's because you didn't specify an instruction set. Either set it explictly (-mavx...), or set the target architecture that supports it with -march=skylake (or any other).
By the way - don't use -march=native on compiler explorer, as it uses the architecture the compiler runs on. Today it may run on avx512 machine, tomorrow it may not ;)

Major compilers provide additional flags for diagnostics when vectorization happens, and if it doesn't - provide a reason why.
 
Joined
Aug 20, 2007
Messages
21,469 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Damn straight: That's the compilers job.



But I like high level languages. Go home, code ricer.

Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.
 
Joined
Oct 2, 2015
Messages
3,144 (0.94/day)
Location
Argentina
System Name Ciel / Akane
Processor AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling ID-Cooling 224-XT Basic / Stock
Memory 2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s) Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s) AOC Q27G3XMN / Samsung S22F350
Case Cougar MX410 Mesh-G / Generic
Audio Device(s) Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply Aerocool KCAS-500W / Gigabyte P450B
Mouse EVGA X15 / Logitech G203
Keyboard VSG Alnilam / Dell
Software Windows 11
Nothing some good old machine language can't solve. Bonus points for using punching cards.
 
Joined
Aug 20, 2007
Messages
21,469 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.

If it can figure out the Linux kernel inlining optimizations it can figure out most anything. I daresay it'd be more challenging to write convoluted code it couldn't figure out than to just write normal code.

And that's GCC, the shittiest of the bundle honestly.

My opinion, but of course, it's right because I have no idea what I'm talking about and in this day and age that stops nobody (this is me razzing myself, I do indeed have an idea but not much of one).

I primarily write Java and C# and only dabble in low level, remember. These languages take care of everything, heck even garbage collection. You can be as messy as you want. It's good for lazy people like me.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
This is why devs reluctantly use AVX or FMA instructions.

This instructions significantly increase performance, but requires more time to debug and optimize code.
AVX is certainly harder to use, and the syntax of the intrinsics is nasty, and quickly can get out of hand for complex stuff, like in the example in post #1. I would have to really study that line to understand it, compared to the plain C++ one.

This might seem like an "impossible" task, but for most applications only a small part of the code is really performance critical, and often only a few lines of AVX intrinsics is needed to give a huge performance boost to the entire application.

Before even considering SIMD (like AVX), a few general optimizations are needed. Firstly, the data should be dense (and this may require restructuring in the non-AVX code too), and the data should be traversed linearly, this is a part of "cache optimizing". Secondly, bloat such as branching, function calls etc. inside the critical path (especially loops) should be reduced to a minimum. These are optimizations which will help a lot even before you use AVX, it also helps with what the compiler can do for you, but it is largely incompatible with the coding paradigms used in most applications today, which is the primary reason why we don't see low-level optimizations in many applications and especially games.

There are also some pitfalls with using some AVX features like FMA; while FMA is super fast, it will give higher precision than not using FMA (since it rounds after the calculation), which may be a problem for some applications, especially if you have two versions of the software.

And since this thread is angled towards games, there are a few other things to consider as well, and these are some of the reasons why AVX is rarely used in games;
- Most games use off-the-shelf engines and the studio do no engine code themselves.
- Most games are rushed "stitched together" code bases which are "discarded" after release. No one usually cares about code quality or maintenance, it's usually just "good enough". This is the reason why most shipping games today are "broken".
- Hardware support; AVX (SB or Bulldozer and newer), but not Celeron, Pentium, Atom and various low-end CPUs (correct me if I'm wrong here)
- In most games, the rendering thread(s) are usually not bottlenecked by math in the CPU code, or the math on the CPU side is not dense enough.
Don't get me wrong, I certainly wish AVX were used extensively, squeezing out every drop of performance we can get. But games are probably one of the use cases where the benefits are the least measurable. One good argument for AVX is energy efficiency, even if it doesn't necessarily help your FPS.
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.

You might be surprised.


The code you posted in the first post is clearly recognized by GCC's auto-vectorizer, and cleanly compiles into AVX2 instructions, including vaddps. Auto-vectorizers are getting surprisingly good these days, not as good as hand-crafted assembly, but good enough to learn how to use the "-mavx2 -O3" compiler flags at least.

EDIT: Wrong link the first time. It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.
 
Last edited:
Joined
Aug 20, 2007
Messages
21,469 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*

The REAL benefit to SIMD-programming requires a paradigm shift. There are specially designed programming languages, called CUDA or OpenCL (or for CPU-programmers, ISPC), which are better high-level representations that more easily compiles into highly optimized SIMD / SSE code.

The sad truth is, C / C++ are awkward to compile into highly optimized SIMD. However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD. Intel clearly got the message, and AVX512 includes a ton of stuff that's useful to the compiler instead of the programmer, kind of putting AVX512 on similar footing as NVidia PTX or AMD GCN.

---------

With that being said, a ton of effort has been put into autovectorizing standard C / C++ code. And its going to get easier, with OpenMP extensions (which are similar to the CUDA-extensions) providing a language-extension to C/C++ to more easily play with SIMD-compilers. This is relatively state-of-the-art stuff, but feel free to play with OpenMP's "#pragma omp for simd" statement in GCC or Clang if you don't believe me. Its surprisingly easy... well... easy if you know the paradigm of OpenMP to start with.
 
Last edited:
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.

Probably the integer-float conversion. This is a basic example though, few things are this explicit and easy to auto vectorize. Compilers also do a performance-cost analysis which might also be wrong and lead to unoptimized code. For instance they'll hardly ever touch memory I/O instructions unless they absolutely have to and by that I mean they'll never really try to optimized data movement as they can't know if they really are faster or not or safe.

However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD.

CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.

PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

If you don't believe me, then focus on the CPU-version of high-level SIMD. ISPC applies the lessons of CUDA except it outputs x86 AVX code instead. All the concepts of ISPC and CUDA remain compatible at a high level.

SIMD is almost identical to per-thread parallelism. That's the lesson of ISPC and CUDA. The compiler can convert per-thread parallelism into SIMD very, very easily. And OpenMP 4.5 proves that the approach works in standard C/C++ code.
 
Last edited:

bug

Joined
May 22, 2015
Messages
13,779 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
In the world of programming, few things are worse than making the code harder to read. It is never only about the cost of writing that code, but also maintaining it. The cost snowballs.
 
Joined
Jan 8, 2017
Messages
9,436 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

PTX is not assembly, there is no physical 32 wide vector register/instruction in the hardware, it's a software abstraction, GPUs aren't SIMD machines. They used to be but not anymore, this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely. It used to be that with OpenCL you had to use actual vector formats because the hardware was indeed configured like that.

SIMD is almost identical to per-thread parallelism.

It just isn't, while you can emulate SIMD architectures with SIMT with practically no performance penalty, the other way around can't be done efficiently because there are things that you simply can't do with SIMD. ISPC is basically a wrapper for SIMD, there is nothing special about it, both CUDA and ISPC are ways to express parallelism while writing essentially scalar code.

Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.

OK, let's clarify this : if there isn't a single register where one decoded instruction operates in lock-step on it until it's done then it's not SIMD.

It looks like it's you who doesn't understand what happens at the lowest level, basically SIMD means 1 instruction stream (aka thread), SIMT means multiple, that's the whole idea behind why GPUs are so fast. With SIMD when the I/O is blocked, nothing happens until the instruction finishes executing because it's part of just one thread.

With SIMT when one or more threads are blocked the SM can pause all threads within that warp and can switch to a different instruction where the data is available within a generic register space. That's right, there is a register space, not just 1 fixed register which is part of a set as it would be the case with SIMD. Moreover a SIMD EU would never proceed unless all data elements are loaded from memory, meanwhile a SIMT EU can, with limited performance, continue execution on a subset of the total threads within the warp (that's how GPUs do branching).

The fundamental difference is : SIMD depends exclusively on the data, SIMT does not.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
PTX is not assembly

Fine. I'll talk about SASS Assembly then. SASS assembly matches 1-to-1 with the actual machine code that NVidia GPUs execute, but it changes from generation to generation. SASS is undocumented because PTX is close enough to machine code that most people don't care. But if you care, I'm more than willing to talk about SASS instead. https://arxiv.org/pdf/1804.06826.pdf

there is no physical 32 wide vector register/instruction in the hardware

This is blatantly false. All NVidia Volta and Turing machines have a 32-wide warp, even at the SASS assembly level. The SASS assembly has warp-specific barriers, read/write ordering, and other very low-level details revealed. I suggest you study Chapter 2 of the paper carefully, to see how these 32-wide warps actually execute at the SASS / Machine code level for Volta.

this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely.

You are confused about independent thread scheduling. While a program counter is now tracked on a per-thread basis, execution continues to happen on a 32-wide warp basis. Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.


Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.

Its a bit of a difficult read. But let me emphasize this point:

A schedule optimizer determines how to group active threads from the same warp together into SIMT units.

The SIMT units are still executing in warps, even on Volta / Turing with Independent Thread Scheduling turned on. You are misunderstanding what is happening at the lowest level.

----------

Don't be swayed by NVidia's "SIMT" marketing. SIMT is damn near identical to late 80s SIMD code, including StarLisp or StarC. NVidia certainly is pushing the envelope with some of their designs, but a huge chunk of their literature is just marketing hype. I do think NVidia is the most advanced SIMD chips available today, but AMD GCN and Intel AVX512 are surprisingly close to its capabilities.
 
Last edited:
Top