# AVX vs non-AVX



## Vya Domus (Apr 9, 2020)

In recent light of the Vulkan vs OpenGL comparison on why supposedly developers don't want to use Vulkan which was posted some time ago I thought it would make sense to show something similar on the CPU side of things .

Scalar, non-AVX :


```
void interpolate(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j++)
        {
            mat[i][j] = mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j]);
        }
}
```

AVX :


```
void interpolate_avx(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j=j+8)
        {
            _mm256_storeu_si256((__m256i *)&mat[i][j], _mm256_cvtps_epi32(_mm256_add_ps(_mm256_mul_ps(_mm256_sub_ps(_mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i+1][j])), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j]))), _mm256_set1_ps(0.5f)), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j])))));
        }
}
```

That's why game developers stay away from optimizing games using AVX I guess . In all seriousness, there is always a price to pay for better performance, the code above runs about 6.5 times faster even though it looks unintelligible. Same thing with Vulkan vs OpenGL vs DirectX or whatever, it makes no sense to compare things like this.


----------



## Flanker (Apr 9, 2020)

The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free


----------



## GoldenX (Apr 9, 2020)

Shouldn't the compiler handle that on its own?


----------



## Solaris17 (Apr 9, 2020)

GoldenX said:


> Shouldn't the compiler handle that on its own?



The compiler and IDE can only babysit you so much.


----------



## R2DSF (Apr 9, 2020)

Flanker said:


> The question comes down to:
> Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free


Nope. It is all on developer's head. 
This is why devs reluctantly use AVX or FMA instructions. 
This instructions significantly increase performance, but requires more time to debug and optimize code. 

Source: four yrs of developing code for board computer of missile engines (РД type)


----------



## phanbuey (Apr 9, 2020)

Solaris17 said:


> The compiler and IDE can only babysit you so much.



Challenge accepted.  AI IDE/compilers incoming. 

Seriously though that looks awful lol.


----------



## Vya Domus (Apr 9, 2020)

GoldenX said:


> Shouldn't the compiler handle that on its own?



Theoretically, practically GCC didn't vectorize this using AVX. Probably because of the implicit conversion from integer to float so it decided it shouldn't touch it.









						Compiler Explorer - C++ (x86-64 gcc 7.1)
					

using namespace std;  using namespace std::chrono;    void interpolate(vector<vector<int>>& mat, vector<vector<int>>& aux) {      for(int i=2; i<mat.size()-1; i=i+2)         for(int j=0; j<mat[0].size(); j++)         {             mat[i][j] =mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j])...




					godbolt.org
				




No YMM registers in sight for the non-AVX function which means no AVX instructions.



phanbuey said:


> Seriously though that looks awful lol.



It does but it doesn't necessarily need to, there are libraries that do the same thing under much more legible code. The point was it makes no sense to compare things like that, one does something, the other something else.


----------



## Flanker (Apr 9, 2020)

R2DSF said:


> Nope. It is all on developer's head.
> This is why devs reluctantly use AVX or FMA instructions.
> This instructions significantly increase performance, but requires more time to debug and optimize code.
> 
> Source: four yrs of developing code for board computer of missile engines (РД type)


That's exactly what I mean lol, maybe I should replace "more work" with "more time" to make it clearer?


----------



## Vya Domus (Apr 9, 2020)

Here's something funny : https://godbolt.org/z/tKVQXA

With those flags it decided it can use AVX 512 of all things, so probably it figured it wasn't fast enough under AVX2 ? You can't rely on the compiler to do this sort of thing more often than not and keep in mind this computation is extremely simple. Things aren't usually like that.


----------



## Flaky (Apr 9, 2020)

In previous godbolt link (#7) you didn't set any optimization flags, only instruction set  No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

This code may look obvious and simple, but when it comes to code generation there are many things to be taken into consideration. If something cannot be proven, extra code has to be generated to cover such case.
Can vectors used in inner loop alias?
Do vectors contain uneven numbers of elements, so that few in the end have to be treated one by one?
Are there any language rules that make certain optimizations not possible?

All of these require certain dialogue with compiler to make sure that it can both generate optimal code, and do not generate too much unnecessary code. It takes time and effort.


----------



## Vya Domus (Apr 10, 2020)

Flaky said:


> In previous godbolt link (#7) you didn't set any optimization flags, only instruction set  No way it could've ended up vectorized.
> 
> -O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.



But that's the thing compilers are inconsistent and unpredictable with these things. With -march=native it generated AVX512 code, it's impractical that you'd always need to look up the assembly to see if the right instructions are used.

Edit :

I can't actually get it to generate AVX with -O2 -ftree-vectorize https://godbolt.org/z/XMNrNC


----------



## Flaky (Apr 10, 2020)

It's because you didn't specify an instruction set. Either set it explictly (-mavx...), or set the target architecture that supports it with -march=skylake (or any other).
By the way - don't use -march=native on compiler explorer, as it uses the architecture the compiler runs on. Today it may run on avx512 machine, tomorrow it may not 

Major compilers provide additional flags for diagnostics when vectorization happens, and if it doesn't - provide a reason why.


----------



## R-T-B (Apr 11, 2020)

Vya Domus said:


> That's why game developers stay away from optimizing games using AVX I guess



Damn straight: That's the compilers job.



Solaris17 said:


> The compiler and IDE can only babysit you so much.



But I like high level languages.  Go home, code ricer.


----------



## Vya Domus (Apr 11, 2020)

R-T-B said:


> Damn straight: That's the compilers job.
> 
> 
> 
> But I like high level languages.  Go home, code ricer.



Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.


----------



## GoldenX (Apr 11, 2020)

Nothing some good old machine language can't solve. Bonus points for using punching cards.


----------



## R-T-B (Apr 11, 2020)

Vya Domus said:


> Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.



If it can figure out the Linux kernel inlining optimizations it can figure out most anything.  I daresay it'd be more challenging to write convoluted code it couldn't figure out than to just write normal code.

And that's GCC, the shittiest of the bundle honestly.

My opinion, but of course, it's right because I have no idea what I'm talking about and in this day and age that stops nobody (this is me razzing myself, I do indeed have an idea but not much of one).

I primarily write Java and C# and only dabble in low level, remember.  These languages take care of everything, heck even garbage collection.  You can be as messy as you want.  It's good for lazy people like me.


----------



## efikkan (Apr 11, 2020)

R2DSF said:


> This is why devs reluctantly use AVX or FMA instructions.
> 
> This instructions significantly increase performance, but requires more time to debug and optimize code.


AVX is certainly harder to use, and the syntax of the intrinsics is nasty, and quickly can get out of hand for complex stuff, like in the example in post #1. I would have to really study that line to understand it, compared to the plain C++ one.

This might seem like an "impossible" task, but for most applications only a small part of the code is really performance critical, and often only a few lines of AVX intrinsics is needed to give a huge performance boost to the entire application.

Before even considering SIMD (like AVX), a few general optimizations are needed. Firstly, the data should be dense (and this may require restructuring in the non-AVX code too), and the data should be traversed linearly, this is a part of "cache optimizing". Secondly, bloat such as branching, function calls etc. inside the critical path (especially loops) should be reduced to a minimum. These are optimizations which will help a lot even before you use AVX, it also helps with what the compiler can do for you, but it is largely incompatible with the coding paradigms used in most applications today, which is the primary reason why we don't see low-level optimizations in many applications and especially games.

There are also some pitfalls with using some AVX features like FMA; while FMA is super fast, it will give higher precision than not using FMA (since it rounds after the calculation), which may be a problem for some applications, especially if you have two versions of the software.

And since this thread is angled towards games, there are a few other things to consider as well, and these are some of the reasons why AVX is rarely used in games;
- Most games use off-the-shelf engines and the studio do no engine code themselves.
- Most games are rushed "stitched together" code bases which are "discarded" after release. No one usually cares about code quality or maintenance, it's usually just "good enough". This is the reason why most shipping games today are "broken".
- Hardware support; AVX (SB or Bulldozer and newer), but not Celeron, Pentium, Atom and various low-end CPUs (correct me if I'm wrong here)
- In most games, the rendering thread(s) are usually not bottlenecked by math in the CPU code, or the math on the CPU side is not dense enough.
Don't get me wrong, I certainly wish AVX were used extensively, squeezing out every drop of performance we can get. But games are probably one of the use cases where the benefits are the least measurable. One good argument for AVX is energy efficiency, even if it doesn't necessarily help your FPS.


----------



## dragontamer5788 (Apr 24, 2020)

Vya Domus said:


> Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.



You might be surprised.

https://godbolt.org/z/Nz7dPj
The code you posted in the first post is clearly recognized by GCC's auto-vectorizer, and cleanly compiles into AVX2 instructions, including vaddps. Auto-vectorizers are getting surprisingly good these days, not as good as hand-crafted assembly, but good enough to learn how to use the "-mavx2 -O3" compiler flags at least.

EDIT: Wrong link the first time. It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.


----------



## R-T-B (Apr 24, 2020)

Yep.  Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*


----------



## dragontamer5788 (Apr 24, 2020)

R-T-B said:


> Yep.  Thanks for proving I did in fact have an inkling of an idea of what I was talking about.
> 
> *goes back to writing shitty java code*



The REAL benefit to SIMD-programming requires a paradigm shift. There are specially designed programming languages, called CUDA or OpenCL (or for CPU-programmers, ISPC), which are better high-level representations that more easily compiles into highly optimized SIMD / SSE code.

The sad truth is, C / C++ are awkward to compile into highly optimized SIMD. However, if you just *barely* change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD. Intel clearly got the message, and AVX512 includes a ton of stuff that's useful to the *compiler* instead of the programmer, kind of putting AVX512 on similar footing as NVidia PTX or AMD GCN.

---------

With that being said, a ton of effort has been put into autovectorizing standard C / C++ code. And its going to get easier, with OpenMP extensions (which are similar to the CUDA-extensions) providing a language-extension to C/C++ to more easily play with SIMD-compilers. This is relatively state-of-the-art stuff, but feel free to play with OpenMP's "#pragma omp for simd" statement in GCC or Clang if you don't believe me. Its surprisingly easy... well... easy if you know the paradigm of OpenMP to start with.


----------



## Vya Domus (Apr 24, 2020)

dragontamer5788 said:


> It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.



Probably the integer-float conversion. This is a basic example though, few things are this explicit and easy to auto vectorize. Compilers also do a performance-cost analysis which might also be wrong and lead to unoptimized code. For instance they'll hardly ever touch memory I/O instructions unless they absolutely have to and by that I mean they'll never really try to optimized data movement as they can't know if they really are faster or not or safe.



dragontamer5788 said:


> However, if you just *barely* change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD.



CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.


----------



## dragontamer5788 (Apr 24, 2020)

Vya Domus said:


> CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.



PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

If you don't believe me, then focus on the CPU-version of high-level SIMD. ISPC applies the lessons of CUDA except it outputs x86 AVX code instead. All the concepts of ISPC and CUDA remain compatible at a high level.

SIMD is almost identical to per-thread parallelism. That's the lesson of ISPC and CUDA. The compiler can convert per-thread parallelism into SIMD very, very easily. And OpenMP 4.5 proves that the approach works in standard C/C++ code.


----------



## bug (Apr 24, 2020)

Flanker said:


> The question comes down to:
> Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free


In the world of programming, few things are worse than making the code harder to read. It is never only about the cost of writing that code, but also maintaining it. The cost snowballs.


----------



## Vya Domus (Apr 24, 2020)

dragontamer5788 said:


> PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).



PTX is not assembly, there is no physical 32 wide vector register/instruction in the hardware, it's a software abstraction, GPUs aren't SIMD machines. They used to be but not anymore, this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely. It used to be that with OpenCL you had to use actual vector formats because the hardware was indeed configured like that.



dragontamer5788 said:


> SIMD is almost identical to per-thread parallelism.



It just isn't, while you can emulate SIMD architectures with SIMT with practically no performance penalty, the other way around can't be done efficiently because there are things that you simply can't do with SIMD. ISPC is basically a wrapper for SIMD, there is nothing special about it, both CUDA and ISPC are ways to express parallelism while writing essentially scalar code.



dragontamer5788 said:


> Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.



OK, let's clarify this : if there isn't a single register where one decoded instruction operates in lock-step on it until it's done then it's not SIMD.

It looks like it's you who doesn't understand what happens at the lowest level, basically SIMD means 1 instruction stream (aka thread), SIMT means multiple, that's the whole idea behind why GPUs are so fast. With SIMD when the I/O is blocked, nothing happens until the instruction finishes executing because it's part of just one thread.

With SIMT when one or more threads are blocked the SM can pause all threads within that warp and can switch to a different instruction where the data is available within a generic register space. That's right, there is a register space, not just 1 fixed register which is part of a set as it would be the case with SIMD. Moreover a SIMD EU would never proceed unless all data elements are loaded from memory, meanwhile a SIMT EU can, with limited performance, continue execution on a subset of the total threads within the warp (that's how GPUs do branching). 

The fundamental difference is : *SIMD depends exclusively on the data, SIMT does not.*


----------



## dragontamer5788 (Apr 24, 2020)

Vya Domus said:


> PTX is not assembly



Fine. I'll talk about SASS Assembly then. SASS assembly matches 1-to-1 with the actual machine code that NVidia GPUs execute, but it changes from generation to generation. SASS is undocumented because PTX is close enough to machine code that most people don't care. But if you care, I'm more than willing to talk about SASS instead. https://arxiv.org/pdf/1804.06826.pdf



> there is no physical 32 wide vector register/instruction in the hardware



This is blatantly false. All NVidia Volta and Turing machines have a 32-wide warp, even at the SASS assembly level. The SASS assembly has warp-specific barriers, read/write ordering, and other very low-level details revealed. I suggest you study Chapter 2 of the paper carefully, to see how these 32-wide warps actually execute at the SASS / Machine code level for Volta.



> this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely.



You are confused about independent thread scheduling. While a program counter is now tracked on a per-thread basis, execution continues to happen on a 32-wide warp basis. Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.



			https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#independent-thread-scheduling
		




> Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.



Its a bit of a difficult read. But let me emphasize this point:



> A schedule optimizer determines how to group active threads from the same warp together into SIMT units.



The SIMT units are still executing in warps, even on Volta / Turing with Independent Thread Scheduling turned on. You are misunderstanding what is happening at the lowest level.

----------

Don't be swayed by NVidia's "SIMT" marketing. SIMT is damn near identical to late 80s SIMD code, including StarLisp or StarC. NVidia certainly is pushing the envelope with some of their designs, but a huge chunk of their literature is just marketing hype. I do think NVidia is the most advanced SIMD chips available today, but AMD GCN and Intel AVX512 are surprisingly close to its capabilities.


----------

