The main idea I wanted to pass is that you don't have to, consciously, call Intel MKL. That's the whole point of high-level programming after all.
Some people seem to think utilizing AVX-512 needs rewriting of programs. Or that the code will only work on Intel, because it's full of "multiplyUsingAvx512" or whatever.
AVX(2) have proven to scale very well on AMD too, sometimes even better relatively speaking vs. no AVX, which has probably to do with the preparations needed in order to use any kind of SIMD, which actually helps to lighten the load for the front-end of the CPU. Too bad AMD is still not implementing AVX-512. Once client applications starts to utilize it properly, people will not look back.
As to optimizing programs in general; in most code bases only a fraction of the code is actually performance critical. Even for those rare cases where applications contain assembly optimizations, it's usually just a couple of tiny spots at the choke point in an algorithm in a tight loop somewhere, where using assembly results in a 50% gain or something. Most such cases are ideal for SIMD, which means you would use AVX intrinsics instead of assembly instructions (even though these are just macros mapping to AVX assembly instructions), and now you may get a 10-50x gain instead.
Optimizing applications usually focuses on optimizing algorithms and some of the core engine surrounding them. Most performance critical paths are fairly limited in terms of the code involved, and throughout the rest of the application code your high-level abstractions will not make any significant impact on performance at all. But in that critical path, all bloat like abstractions, function calls, non-linear memory accesses and branching will come at a cost. So the first step of optimizing this is removal of abstractions and branching (to the extent possible), then cache optimize it with linear memory layouts (this step may require rewriting code surrounding the algorithm). By this point you should have tight loops without function calls, memory allocations etc. inside them. And only then you are ready to use SIMD, but will also get huge gains from doing so. Doing this kind of optimizations is generally not possible in languages higher than C++. You may still be able to interface with libraries containing low-level optimizations and get some gains there, but that would be it.
One of the good news about writing code for SIMD is that the preparations are the same, so upgrading an existing code from SSE to AVX or a newer version of AVX is simple, just change a few lines of code and tweak some loops. The hard work is the preparations for SIMD.
The OS - yes. The software and libraries? Not really.
That's where you're wrong.
I mentioned Intel Clear Linux, one of the main features is the optimizations to
libc, the runtimes for the C standard library. Almost every native application uses this library. The gains are usually in the 5-30% range, so nothing fantastic, but that's free performance gains for most of your applications, and who doesn't want that? The only reason why I don't use it is that this Linux distro is an experimental rolling release distro, and I need a stable machine for work, and don't have time to spend all day troubleshooting bugs. If it was properly supported by e.g. Ubuntu, I would switch to it.
The counterpart of libc for MS Visual Studio is msvcrt.dll and msvcpxx.dll, which most of your heavier applications and games rely on. There are of course
some optimizations in MS's standard library, and they even open sourced some of it recently, but from what I can see it's mostly older SSE. If these were updated to utilize AVX2 or even AVX-512, I'm sure many Windows users would appreciate it. The problem is compatibility; so they would either have to ship two versions or drop hardware support.
On Windows everything is usually supplied with software, so one doesn't really have to care.
On Linux programs tend to use OS-wide libraries / interpreters / compilers, so sometimes using Intel MKL requires a bit of work. But it shouldn't be an issue for people who already tolerate Linux.
Another misconception.
Most Windows software rely on MS Visual Studio's runtime libraries, including everything bundled with Windows itself, which is why most Windows applications don't have to "care". The only times they do, is when they are compiled with a more recent Visual Studio version, as Visual Studio choose to duplicate the library for every new version, which I assume is their approach for compatibility.
This is no different than Linux, except it uses libc instead, which comes bundled with every Linux distro. Most Linux software is also POSIX compliant, which makes it easy to port to BSD, MacOS, and even Windows (which is have partial compliance).
The differences which may have confused you are when it comes to GUI libraries etc., since there is not
one "GUI API" like in Windows. Linux applications may rely on GTK, Qt, wxWidgets and many others. When such applications are ported to Windows, they usually needs runtimes for those libraries, examples of such applications which you may be familiar with includes; Firefox, VLC, LibreOffice, GIMP, Handbrake, VirtualBox, Teamviewer etc.