I think people needing AVX512 is quite a bit smaller than those needing extra cores. The only people I think would benefit from it would be those who compile their own software (ex. Gentoo users) because it will be years before most software has it compiled in for you.
That depends on the timeframe.
In the short term, only custom software will use AVX-512. Within a couple of years professional production software will gradually start using it, and then those users will demand it.
When it comes to compiling software manually; the scope of SIMD optimizations in compilers are very limited, they are basically limited to specific patterns. So if you want gains from AVX you'll have to redesign the software and use low-level intrinsics. I know Intel recently submitted "AVX1" optimizations for glibc, so you can get some small gains there, but software in general can't be automatically optimized to use AVX.
When it comes to extra cores, see below.
...and the benefit is very application specific. It would really only apply to software doing a lot of floating point math and even more than than, math that doesn't have a dependency on earlier calculations, so instruction-level parallelism and its constraints would apply to the kind of benefit AVX would have. Basically, the only workload that would benefit from this is high volume floating point math designed to stream values through something that does several independent floating point ops in a row. As I said earlier, the people who care about this are likely doing machine learning or statistical analysis if they're not already using a GPU.
AVX is cute because...
It's funny that you compare AVX with GPUs just because both use some kind of SIMD. The magnitude of latency completely different for the options. Converting data between normal CPU registers and vector registers have a cost, but synchronizing threads is still 10.000-100.000× more costly. And communicating between CPU threads and the GPU is easily 10-100× more costly than that.
Three of the ways modern (x86) CPUs achieve parallelism are the following:
1 - Superscalar processing
Each core have multiple ALUs and/or FPUs, and is able to execute (independent) instructions on these per clock. Superscalar processing is essential for high IPC, but performance scaling are dependent on the quality of the software and the CPU's own ability to do out-of-order execution, branch prediction, cache prefetches and dependency analysis.
Superscalar scaling will be one of the primary sources for IPC gains going forward, but it only scales well when code is cache/data optimized. Provided the software is fairly well written, it will continue to scale with increasingly superscalar processors, and the more instruction-level parallelism is possible in the code, the more existing software will scale without any recompilation or redesign.
2 - SIMD
SIMD/vector instructions has existed in many forms for x86, ranging from MMX, SSE to AVX and more. Execution of operations with SIMD costs much less than more cores or GPUs in terms of transistors, energy consumption and latency. AVX (and similar) can easily be applied with minimal cost (provided the code is well structured); you have most of the program in normal C/C++ code, and within a few loops or whole functions you process a stream of data with SIMD with up to ~50× speedup, with very low overhead and have no issues with mixing normal instructions and vector instructions.
GPUs on the other hand, will perform well when you have a workload which is "separate" from your CPU code and only requires minimal interaction.
And SIMD is not a "machine learning" thing, it's essential for most professional workloads, like graphics, video, CAD, simulation, etc.
3 - Multithreading
Everyone talks a lot about multithreading, but few knows how it even works. Multithreading only works well when there is minimal interaction between threads. It's easy to scale linearly with multithreading when you have
n chunks of independent work you can run on
n threads. But contrary to superscalar and SIMD which works on a local scope, multithreading only works well on a larger scope. If your workload have dependencies so it can't be split into larger independent chunks, then you'll be left with threads wasting a lot of time with synchronization overhead and hazards. This is why many such workloads achieve diminishing returns with more threads, and the overall performance will eventually reach a plateau, even if you keep adding more cores.
It's important to understand that many workloads simply can't scale linearly with thread count. This misconception has existed ever since we all got cheap dual-core Athlon64s, and people were disappointed it didn't double their performance. Not everything can scale that way, even with "infinite" developer resources.
-----
So which type of parallelism is best, you might ask? It depends on the workload, and since most desktop workloads need all of the above, we need to continue scaling on
all three areas.