Agree!The good news is that many workloads are essentially vectorized in nature, but unfortunately the coding practices taught in school these days tell people to ignore the real problem (the data) and instead focus on building a complex "architecture" to hide and scatter the data and state all over the place, making SIMD (or any decent performance really) virtually impossible.
Vectorizing the data is always a good start, but it may not be enough. Whenever you see a loop iterating over a dense array, there is some theoretical potential there, but the solution may not be obvious. Often, the solution is to restructure the data with similar data grouped together (the data oriented approach), rather than the typical "world modelling", then potential for SIMD often will become obvious. But as we know, a compiler can never do this, the developer still has to do the ground work.
I haven't had time to look into the details of AMX yet, but my gut feeling tells me that any 2D data should have some potential here, like images, video, etc.
In my experience, performance does not come from getting an IPC improvements of 10%-20% executing the same stupid sequence of instructions slightly fast, shaving of a 1/5 of a clock cycle here and there or even adding 50% more cores - that's not where it is.
Performance come from:
1) understanding the nature of the problem your are trying to solve -> options for organizing data and what algorithms to use. Vectorizing the problem quickly becomes the best way to get stepwise improvement: e.g. 2x+ improvement. Vectorizing is a more efficient way to parallelize than just adding more general cores (having more independent uncoordinated cores fighting for the same memory resources) - more cores works best if they work on unrelated tasks (different data), while vectorization works best to parallelize on one task with one data (potentially huge) set and get that task done fastest.
2) understanding the instruction set of the processor you are targeting -> focus on the instruction set that can do the job done fastest (currently that is AVX-512)
3) understanding caching and memory access and how data should be organized and accessed to maximize data throughput to feed the CPU optimally
If you do this with a Skylake or Cascadelake AVX-512 capable CPU (e.g. Cascadelake X: core i9 10980XE with 2 FMAs per core) , the bottle neck is NOT the CPU but RAM (and memory hierarchy). Focus shifts to how you read and write data fastest possibly, maximizing use of the different cache levels doing as many ops as possible on the data while it's still hot in cache before you move on the the next sub set of data (fitted to cache sizes).
Designing the data structures and algorithms around optimal date flow becomes the way you get max performance.
The point is that a Cascadelake-X core i9-10980XE CPU is not the bottle neck, the memory subsystem is - you just can't feed this beast fast enough - it chews thru all you can through at it quickly utilizing the full memory band width - and that is as good as it gets! Assuming you vectorize data and algorithms
I'll try to make the point here again using the 10980XE as an example: Due to the super scalar nature of the 10980XE where each core can reorder (from independent instructions in the pipeline) and execute instructions in parallel (10 parallel but differently capable execution ports) each core has incredible parallelism as it is. BUT most importantly for what I talk about here, a 10980XE has 36 vector FMAs/ALUs (2 per core). Each core has effectively two 16 float wide (or 16 32-bit integer wide) vector ALU/FMA that can perform 2 SIMD instructions in parallel (if programmed correctly and that is the big if). with propper programming you effectively have 36 FMA or ALU vector cores (depending on float or integer).
From a pure number crunching perspective that makes the 10980 a 36 core 16 float wide number crunching CPU (not 18)!
So why aren't we seeing this reflected gains in most of today's apps (given that AVX-512 has been on the market for 3 years now: core i9-7980XE we should see more right)?
Answer: They (developers) have simply not taken the time to vectorzie their code. Why is that? a few reasons I can think of:
1) They are under too much budget pressure so they cannot go back and redesign their inefficient sequential code (commercial dev houses). This is inexcusable poor management! Why? Well if the is a business case for faster CPU (and that's the whole model for AMD and Intel so yeah there is) then there is also a BC for faster s/w.
if this is the case dev houses are basically relying on that gen over gen 10%-20% IPC improvement to continue, giving users a minuscule, imperceptible performance increases gen over gen. btw it is really hard to shave off another 1/5 fraction of clock cycle of a each instruction when you have already been doing that gen over gen for decades.
2) Incompetent s/w developers (no a python class does not make you a Computer Scientist)
3) Lazy developers and NO, a compiler will NOT compensate for laziness
4) Also Intel fault - they could have spent time develop efficient API layers that uses the new instructions the best way (encapsulate some of the most common algorithms and data structures in vectorized form) and make it broadly available for free and "sell" it to the dev houses (i mean educate them not make money from selling the api's) - they did some, but sorry MKL (Intel Math Kernel Library) is not enough.
This problem will be even bigger with AMX because if AVX-512 is a s/w redesign then AMX is even more of a redesign: You really need to "vectorize/matricize" and put data into vectors and/or matrices to get the max performance and throughput for the Sapphire Rapids. It will be even harder for a compiler to this (compared to AVX-512) - not going to happen!
In summary it will be down to the individual programmer to get 2x-4x performance out of the next gen of AVX-512 / AMX Intel CPUs. If not we will just get another 10%-20% IPC improvement as usual and have lots of wasted silicon with AVX-512 and AMX.
The bigger point I'm making is that design and programming has to change to "vectorize/matricize" data and algorithms by default NOT as some sort of obscure after thought.
Last edited: