I don't even know if AVX-512 has any advantages over GPU compute in any usage scenarios. Especially now that GPU clocks are reaching AVX-512 clocks. But Intel will still use the AVX-512 benchmarks to "prove" that they're beating the competition.
There should be no doubt that AVX-512 is much faster than AVX2. Not only does it have twice the width, it also adds a lot of more operations and flexibility, which also should allow compilers to autovectorize even more code (in cases where the programmers don't use intrinsics directly).
But keep in mind that running AVX2 code through AVX-512 units will have no real benefit.
Even VIA has implemented AVX-512 in their latest design, despite running it through two fused 256-bit vector units. This may seem pointless to some of you, but it still will gain benefits such as; 1) new types of operations in AVX-512, 2) improved instruction cache utilization and 3) better ISA compatibility with future software. This is kind of analogous to when Sandy Bridge added AVX(1) support, despite having only fused 128-bit vector units. (or Zen1)
Quoting Linus Torvalds - "I'd much rather see that transistor budget used on other things that are much more relevant. Even if it's still FP math (in the GPU, rather than AVX-512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX-512) like AMD did."
And people who don't know better will use this quote forever, despite it being total BS.
SIMD inside the CPU has basically no latency, and can be mixed in with other operations. Communicating with a GPU is only worth it for huge bactches of data, due to the extreme latency.
It doesn't. Everyone would like to say that in latency sensitive applications it does, however that makes little sense in reality because the type of application that can be sped up using SIMD is likely to be highly data independent and for those sort of algorithms throughput is more important than latency.
What?
Basically there is a contingency between something that needs high levels of parallelization and something that needs low latency, those two properties are highly orthogonal with respects to each other, in other words applications that "need" both parallel processing and low latency don't really exist. Wide SIMD support in CPUs is stupid, it's a development that should have never been taken this far, there are simply better ways to do massively parallel computation.
There are many types of parallelism. SIMD in the CPUs are for parallelism on a smaller scale intermixed with a lot of logic, while multithreading is for larger independent chunks of work, and GPUs for even larger computationally dense (but little logic) chunks of work.