Thursday, October 19th 2017
Intel "Cannon Lake" Could Bring AVX-512 Instruction-set to the Mainstream
Intel's next-generation "Cannon Lake" CPU micro-architecture could see the introduction of the AVX-512 instruction-set to the mainstream segments (MSDT or mainstream-desktop, and mobile). It is currently available on the company's Core X "Skylake-X" HEDT processors, and on the company's Xeon "Skylake-W," Xeon Scalable "Skylake-SP," and in a limited form on the Xeon Phi Knights Landing and Knights Mill scalar compute chips.
The upcoming "Cannon Lake" mainstream silicon will feature AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL instructions, and will support AVX512_IFMA and AVX512_VBMI commands, making it a slightly broader implementation of AVX-512 than the "Skylake-SP" silicon. The new AVX-512 will vastly improve performance of compute-intensive applications that take advantage of it. It will also be a key component of future security standards.
Source:
Anandtech
The upcoming "Cannon Lake" mainstream silicon will feature AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL instructions, and will support AVX512_IFMA and AVX512_VBMI commands, making it a slightly broader implementation of AVX-512 than the "Skylake-SP" silicon. The new AVX-512 will vastly improve performance of compute-intensive applications that take advantage of it. It will also be a key component of future security standards.
52 Comments on Intel "Cannon Lake" Could Bring AVX-512 Instruction-set to the Mainstream
But the problem is, if they have already implemented this on Skylake-SP, why not coffee lake or even kaby lake? And why did they change their mind with Cannonlake? From a timeline perspective, doubtful.
Intel needs to solve that before I'll get excited about AVX512 (as lets be honest, its only generally useful for the 1% of stuff I can't send to the GPU in the first place).
While the main x86 operations work with 64-bit numbers, SSE does 128-bit, AVX does 256-bit and AVS-512 does 512-bit numbers. This increases both range for integer calculations and precision for floating point calculations.
Usefulness of these extensions relies heavily on both compilers and software being aware of these and using these. On the other hand, these operations are useful for some types of software (productivity software usually, lots of calculations, especially requiring high precision) and less useful for others.
SSE is from Pentium 3 era.
AVX has been there since Sandy Bridge (and Bulldozer).
AVX2 since Haswell (and Excavator).
Wiki is actually pretty good on the topic:
en.wikipedia.org/wiki/Advanced_Vector_Extensions
en.wikipedia.org/wiki/Streaming_SIMD_Extensions
If I look at local shops, I could go and buy i7 8700 or i5 8400 right now for what seem to be MSRP prices. i7 8700K is not in stock but that is not very surprising.
Things like AVX-512 will probably be a near requirement for x360/PS3 emulation, if we ever get there.
The real problem is it will take years before consumer software will utilize it.
As with other CPU instructions the software has to be compiled to use this feature. In some cases compilers can automatically vectorize certain structures (requires compiler flags), but usually the programmer has to apply specific intrinsics, which basically are macros mapping almost directly to assembly. A vector unit is able to process multiple pieces of data at once, e.g. a AVX-512 unit can process up to 1x512-bit, 2x256-bit, 4x128-bit, 8x64-bit, 16x32-bit, etc. operations per clock. Each CPU core may contain multiple AVX and FMA units on different execution ports, some only do e.g. multiplication. If you need AVX-512…
Also, Ice Lake will be the next desktop archtecture.
Core configurations are usually decided during tapeout, and clocks and model names closer to launch. Even Intel doesn't know yet what the models will look like.
And I like the socket names; old socket +10 :p
-----
The source from Anandtech is actually quite an interesting read. It also provides some early indications on what Ice Lake will bring, both in terms of new AVX features and other instructions.
What I find most interesting is the "Fast Short REP MOV". Those of you with experience with assembly, knows a CPU spends a lot of cycles not only moving data from memory to CPU registers, but also shuffling around the registers to be able to execute the next ALU or FPU instruction. A single ALU/FPU operation may require up to 3-4 MOV operations. It may seem very wasteful to spend clock cycles just moving a few bits instead of spending them calculating stuff, so anything which helps reduce these "wasteful" operations will help throughput without increasing computational resources.
Additionally Cannon Lake will add support for SHA-NI, which brings acceleration of SHA and MD5. Surely this will bring like a 100× acceleration for such algorithms, but I'm a firm believer that algorithm-specific instructions don't belong in a general purpose CPU. Whether it's algorithms for cryptography or compression, these algorithms keep evolving making acceleration quickly outdated. SHA and MD5 are already outdated in cryptography, so these are surely added just to show some gains in some specific benchmarks for enterprise customers. For general purpose use, this acceleration is mostly a waste of die space and energy consumption. How much of your CPU time is really spent on AES, SHA, MD5, etc? Probably less than 1%, unless you run some kind of web server, which is why I believe these features belong in specialized processors for such workloads. Back in the 80s, Intel made specialized co-processors for math(8087, etc.), I think they should have used this approach for special enterprise features.