AVX vs non-AVX

Vya Domus · Apr 9, 2020

In recent light of the Vulkan vs OpenGL comparison on why supposedly developers don't want to use Vulkan which was posted some time ago I thought it would make sense to show something similar on the CPU side of things

.

Scalar, non-AVX :

Code:

void interpolate(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j++)
        {
            mat[i][j] = mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j]);
        }
}

AVX :

Code:

void interpolate_avx(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j=j+8)
        {
            _mm256_storeu_si256((__m256i *)&mat[i][j], _mm256_cvtps_epi32(_mm256_add_ps(_mm256_mul_ps(_mm256_sub_ps(_mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i+1][j])), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j]))), _mm256_set1_ps(0.5f)), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j])))));
        }
}

That's why game developers stay away from optimizing games using AVX I guess

. In all seriousness, there is always a price to pay for better performance, the code above runs about 6.5 times faster even though it looks unintelligible. Same thing with Vulkan vs OpenGL vs DirectX or whatever, it makes no sense to compare things like this.

Flanker · Apr 9, 2020

The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free

GoldenX · Apr 9, 2020

Shouldn't the compiler handle that on its own?

Solaris17 · Apr 9, 2020

GoldenX said:
Shouldn't the compiler handle that on its own?

The compiler and IDE can only babysit you so much.

R2DSF · Apr 9, 2020

Flanker said:
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free

Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)

phanbuey · Apr 9, 2020

Solaris17 said:
The compiler and IDE can only babysit you so much.

Challenge accepted. AI IDE/compilers incoming.

Seriously though that looks awful lol.

Vya Domus · Apr 9, 2020

GoldenX said:
Shouldn't the compiler handle that on its own?

Theoretically, practically GCC didn't vectorize this using AVX. Probably because of the implicit conversion from integer to float so it decided it shouldn't touch it.

Compiler Explorer - C++ (x86-64 gcc 7.1)

using namespace std; using namespace std::chrono; void interpolate(vector<vector<int>>& mat, vector<vector<int>>& aux) { for(int i=2; i<mat.size()-1; i=i+2) for(int j=0; j<mat[0].size(); j++) { mat[i][j] =mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j])...

godbolt.org

No YMM registers in sight for the non-AVX function which means no AVX instructions.

phanbuey said:
Seriously though that looks awful lol.

It does but it doesn't necessarily need to, there are libraries that do the same thing under much more legible code. The point was it makes no sense to compare things like that, one does something, the other something else.

Flanker · Apr 9, 2020

R2DSF said:
Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)

That's exactly what I mean lol, maybe I should replace "more work" with "more time" to make it clearer?

Vya Domus · Apr 9, 2020

Here's something funny : https://godbolt.org/z/tKVQXA

With those flags it decided it can use AVX 512 of all things, so probably it figured it wasn't fast enough under AVX2 ? You can't rely on the compiler to do this sort of thing more often than not and keep in mind this computation is extremely simple. Things aren't usually like that.

Flaky · Apr 9, 2020

In previous godbolt link (#7) you didn't set any optimization flags, only instruction set

No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

This code may look obvious and simple, but when it comes to code generation there are many things to be taken into consideration. If something cannot be proven, extra code has to be generated to cover such case.
Can vectors used in inner loop alias?
Do vectors contain uneven numbers of elements, so that few in the end have to be treated one by one?
Are there any language rules that make certain optimizations not possible?

All of these require certain dialogue with compiler to make sure that it can both generate optimal code, and do not generate too much unnecessary code. It takes time and effort.

Vya Domus · Apr 10, 2020

Flaky said:
In previous godbolt link (#7) you didn't set any optimization flags, only instruction set No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

But that's the thing compilers are inconsistent and unpredictable with these things. With -march=native it generated AVX512 code, it's impractical that you'd always need to look up the assembly to see if the right instructions are used.

Edit :

I can't actually get it to generate AVX with -O2 -ftree-vectorize https://godbolt.org/z/XMNrNC

Flaky · Apr 10, 2020

It's because you didn't specify an instruction set. Either set it explictly (-mavx...), or set the target architecture that supports it with -march=skylake (or any other).
By the way - don't use -march=native on compiler explorer, as it uses the architecture the compiler runs on. Today it may run on avx512 machine, tomorrow it may not

Major compilers provide additional flags for diagnostics when vectorization happens, and if it doesn't - provide a reason why.

R-T-B · Apr 11, 2020

Vya Domus said:
That's why game developers stay away from optimizing games using AVX I guess

Damn straight: That's the compilers job.

Solaris17 said:
The compiler and IDE can only babysit you so much.

But I like high level languages. Go home, code ricer.

Vya Domus · Apr 11, 2020

R-T-B said:
Damn straight: That's the compilers job.

But I like high level languages. Go home, code ricer.

Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.

GoldenX · Apr 11, 2020

Nothing some good old machine language can't solve. Bonus points for using punching cards.

R-T-B · Apr 11, 2020

Vya Domus said:
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.

If it can figure out the Linux kernel inlining optimizations it can figure out most anything. I daresay it'd be more challenging to write convoluted code it couldn't figure out than to just write normal code.

And that's GCC, the shittiest of the bundle honestly.

My opinion, but of course, it's right because I have no idea what I'm talking about and in this day and age that stops nobody (this is me razzing myself, I do indeed have an idea but not much of one).

I primarily write Java and C# and only dabble in low level, remember. These languages take care of everything, heck even garbage collection. You can be as messy as you want. It's good for lazy people like me.

efikkan · Apr 11, 2020

R2DSF said:
This is why devs reluctantly use AVX or FMA instructions.

This instructions significantly increase performance, but requires more time to debug and optimize code.

AVX is certainly harder to use, and the syntax of the intrinsics is nasty, and quickly can get out of hand for complex stuff, like in the example in post #1. I would have to really study that line to understand it, compared to the plain C++ one.

This might seem like an "impossible" task, but for most applications only a small part of the code is really performance critical, and often only a few lines of AVX intrinsics is needed to give a huge performance boost to the entire application.

Before even considering SIMD (like AVX), a few general optimizations are needed. Firstly, the data should be dense (and this may require restructuring in the non-AVX code too), and the data should be traversed linearly, this is a part of "cache optimizing". Secondly, bloat such as branching, function calls etc. inside the critical path (especially loops) should be reduced to a minimum. These are optimizations which will help a lot even before you use AVX, it also helps with what the compiler can do for you, but it is largely incompatible with the coding paradigms used in most applications today, which is the primary reason why we don't see low-level optimizations in many applications and especially games.

There are also some pitfalls with using some AVX features like FMA; while FMA is super fast, it will give higher precision than not using FMA (since it rounds after the calculation), which may be a problem for some applications, especially if you have two versions of the software.

And since this thread is angled towards games, there are a few other things to consider as well, and these are some of the reasons why AVX is rarely used in games;
- Most games use off-the-shelf engines and the studio do no engine code themselves.
- Most games are rushed "stitched together" code bases which are "discarded" after release. No one usually cares about code quality or maintenance, it's usually just "good enough". This is the reason why most shipping games today are "broken".
- Hardware support; AVX (SB or Bulldozer and newer), but not Celeron, Pentium, Atom and various low-end CPUs (correct me if I'm wrong here)
- In most games, the rendering thread(s) are usually not bottlenecked by math in the CPU code, or the math on the CPU side is not dense enough.
Don't get me wrong, I certainly wish AVX were used extensively, squeezing out every drop of performance we can get. But games are probably one of the use cases where the benefits are the least measurable. One good argument for AVX is energy efficiency, even if it doesn't necessarily help your FPS.

dragontamer5788 · Apr 24, 2020

Vya Domus said:
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.

You might be surprised.

https://godbolt.org/z/Nz7dPj

The code you posted in the first post is clearly recognized by GCC's auto-vectorizer, and cleanly compiles into AVX2 instructions, including vaddps. Auto-vectorizers are getting surprisingly good these days, not as good as hand-crafted assembly, but good enough to learn how to use the "-mavx2 -O3" compiler flags at least.

EDIT: Wrong link the first time. It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.

R-T-B · Apr 24, 2020

Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*

dragontamer5788 · Apr 24, 2020

R-T-B said:
Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*

The REAL benefit to SIMD-programming requires a paradigm shift. There are specially designed programming languages, called CUDA or OpenCL (or for CPU-programmers, ISPC), which are better high-level representations that more easily compiles into highly optimized SIMD / SSE code.

The sad truth is, C / C++ are awkward to compile into highly optimized SIMD. However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD. Intel clearly got the message, and AVX512 includes a ton of stuff that's useful to the compiler instead of the programmer, kind of putting AVX512 on similar footing as NVidia PTX or AMD GCN.

---------

With that being said, a ton of effort has been put into autovectorizing standard C / C++ code. And its going to get easier, with OpenMP extensions (which are similar to the CUDA-extensions) providing a language-extension to C/C++ to more easily play with SIMD-compilers. This is relatively state-of-the-art stuff, but feel free to play with OpenMP's "#pragma omp for simd" statement in GCC or Clang if you don't believe me. Its surprisingly easy... well... easy if you know the paradigm of OpenMP to start with.

Vya Domus · Apr 24, 2020

dragontamer5788 said:
It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.

Probably the integer-float conversion. This is a basic example though, few things are this explicit and easy to auto vectorize. Compilers also do a performance-cost analysis which might also be wrong and lead to unoptimized code. For instance they'll hardly ever touch memory I/O instructions unless they absolutely have to and by that I mean they'll never really try to optimized data movement as they can't know if they really are faster or not or safe.

dragontamer5788 said:
However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD.

CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.

dragontamer5788 · Apr 24, 2020

Vya Domus said:
CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.

PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

If you don't believe me, then focus on the CPU-version of high-level SIMD. ISPC applies the lessons of CUDA except it outputs x86 AVX code instead. All the concepts of ISPC and CUDA remain compatible at a high level.

SIMD is almost identical to per-thread parallelism. That's the lesson of ISPC and CUDA. The compiler can convert per-thread parallelism into SIMD very, very easily. And OpenMP 4.5 proves that the approach works in standard C/C++ code.

bug · Apr 24, 2020

Flanker said:
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free

In the world of programming, few things are worse than making the code harder to read. It is never only about the cost of writing that code, but also maintaining it. The cost snowballs.

Vya Domus · Apr 24, 2020

dragontamer5788 said:
PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

PTX is not assembly, there is no physical 32 wide vector register/instruction in the hardware, it's a software abstraction, GPUs aren't SIMD machines. They used to be but not anymore, this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely. It used to be that with OpenCL you had to use actual vector formats because the hardware was indeed configured like that.

dragontamer5788 said:
SIMD is almost identical to per-thread parallelism.

It just isn't, while you can emulate SIMD architectures with SIMT with practically no performance penalty, the other way around can't be done efficiently because there are things that you simply can't do with SIMD. ISPC is basically a wrapper for SIMD, there is nothing special about it, both CUDA and ISPC are ways to express parallelism while writing essentially scalar code.

dragontamer5788 said:
Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.

OK, let's clarify this : if there isn't a single register where one decoded instruction operates in lock-step on it until it's done then it's not SIMD.

It looks like it's you who doesn't understand what happens at the lowest level, basically SIMD means 1 instruction stream (aka thread), SIMT means multiple, that's the whole idea behind why GPUs are so fast. With SIMD when the I/O is blocked, nothing happens until the instruction finishes executing because it's part of just one thread.

With SIMT when one or more threads are blocked the SM can pause all threads within that warp and can switch to a different instruction where the data is available within a generic register space. That's right, there is a register space, not just 1 fixed register which is part of a set as it would be the case with SIMD. Moreover a SIMD EU would never proceed unless all data elements are loaded from memory, meanwhile a SIMT EU can, with limited performance, continue execution on a subset of the total threads within the warp (that's how GPUs do branching).

The fundamental difference is : SIMD depends exclusively on the data, SIMT does not.

dragontamer5788 · Apr 24, 2020

Vya Domus said:
PTX is not assembly

Fine. I'll talk about SASS Assembly then. SASS assembly matches 1-to-1 with the actual machine code that NVidia GPUs execute, but it changes from generation to generation. SASS is undocumented because PTX is close enough to machine code that most people don't care. But if you care, I'm more than willing to talk about SASS instead. https://arxiv.org/pdf/1804.06826.pdf

there is no physical 32 wide vector register/instruction in the hardware

This is blatantly false. All NVidia Volta and Turing machines have a 32-wide warp, even at the SASS assembly level. The SASS assembly has warp-specific barriers, read/write ordering, and other very low-level details revealed. I suggest you study Chapter 2 of the paper carefully, to see how these 32-wide warps actually execute at the SASS / Machine code level for Volta.

this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely.

You are confused about independent thread scheduling. While a program counter is now tracked on a per-thread basis, execution continues to happen on a 32-wide warp basis. Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#independent-thread-scheduling

Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.

Its a bit of a difficult read. But let me emphasize this point:

A schedule optimizer determines how to group active threads from the same warp together into SIMT units.

The SIMT units are still executing in warps, even on Volta / Turing with Independent Thread Scheduling turned on. You are misunderstanding what is happening at the lowest level.

----------

Don't be swayed by NVidia's "SIMT" marketing. SIMT is damn near identical to late 80s SIMD code, including StarLisp or StarC. NVidia certainly is pushing the envelope with some of their designs, but a huge chunk of their literature is just marketing hype. I do think NVidia is the most advanced SIMD chips available today, but AMD GCN and Intel AVX512 are surprisingly close to its capabilities.

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	Intel Core i5 8400
Motherboard	Gigabyte Z370N-Wifi
Cooling	Silverstone AR05
Memory	Micron Crucial 16GB DDR4-2400
Video Card(s)	Gigabyte GTX1080 G1 Gaming 8G
Storage	Micron Crucial MX300 275GB
Display(s)	Dell U2415
Case	Silverstone RVZ02B
Power Supply	Silverstone SSR-SX550
Keyboard	Ducky One Red Switch
Software	Windows 10 Pro 1909

System Name	Ciel / Akane
Processor	AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard	Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling	ID-Cooling 224-XT Basic / Stock
Memory	2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s)	Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage	NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s)	AOC Q27G3XMN / Samsung S22F350
Case	Cougar MX410 Mesh-G / Generic
Audio Device(s)	Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply	Aerocool KCAS-500W / Gigabyte P450B
Mouse	EVGA X15 / Logitech G203
Keyboard	VSG Alnilam / Dell
Software	Windows 11

System Name	RogueOne
Processor	Xeon W9-3495x
Motherboard	ASUS w790E Sage SE
Cooling	SilverStone XE360-4677
Memory	128gb Gskill Zeta R5 DDR5 RDIMMs
Video Card(s)	MSI SUPRIM Liquid 5090
Storage	1x 2TB WD SN850X \| 2x 8TB GAMMIX S70
Display(s)	49" Philips Evnia OLED (49M2C8900)
Case	Thermaltake Core P3 Pro Snow
Audio Device(s)	Moondrop S8's on Schitt Gunnr
Power Supply	Seasonic Prime TX-1600
Mouse	Razer Viper mini signature edition (mercury white)
Keyboard	Wooting 80 HE White, Gateron Jades
VR HMD	Quest 3
Software	Windows 11 Pro Workstation
Benchmark Scores	I dont have time for that.

Processor	Ryzen 7 2700 (3900MHz@1.275V, Vsoc 1.025V)
Motherboard	ASUS ROG Crosshair VIII Impact (BIOS 1302, Vsb 1.04V, SenseMI Skew 273)
Cooling	Noctua NH-U12A (full custom loop building in progress)
Memory	2*8GB G.Skill Trident Z B&W (3600 14-14-14-14-28-42 1T 1.45V, GDM off, ProcODT 53, CAD 24-24-24-24)
Video Card(s)	Radeon RX Vega64 Frontier Edition Air 16G (Core 1667MHz@1.07625V, Mem 1330@0.9375V)
Storage	SSD: 120GiB + 240GiB + 480GiB. HDD: 1TiB + 2TiB. Backup: Ultrium LTO-6 6.25TiB.
Display(s)	AOC C27G1
Case	Fractal Design Define Nano S
Audio Device(s)	Audio-technica ATH-AD900X
Power Supply	BeQuiet! SFX-L Power 600W
Mouse	A4Tech XL-747H
Keyboard	Zalman ZM-K500
Software	Gentoo-latest amd64-hardened, Windows 10 Pro WS x64.

System Name	stress-less
Processor	9800X3D @ 5425 MHZ
Motherboard	MSI PRO B650M-A Wifi
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 6200 1:1 CL30-36-36, FCLK 2067
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	65% HE Keyboard
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	PNY RTX 5080 OC
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise (yes it's legit)

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

AVX vs non-AVX

Super Dainty Moderator