Intel NPU 4 Makes Lunar Lake Copilot+ Ready
This is Intel's second processor generation to feature a neural processing unit (NPU), a component that efficiently accelerates AI, however, Intel is referring to the one in Lunar Lake as NPU 4. This is because it's counting its pre-NPU AI acceleration hardware implementations, such as GFNI and AVX512 or VNNI, as primordial NPUs, even though you can't get any of today's NPU-specific workloads to work on them.
The big story here is that the NPU 4 powering Lunar Lake has four times the AI inferencing performance as the NPU 3 found in Meteor Lake. The AI inferencing performance leaps from 12 TOPS in Meteor Lake, to 48 TOPS on Lunar Lake. This meets and exceeds the 40 TOPS requirement set by Microsoft to accelerate local sessions of Copilot+, and qualify for the Copilot+ AI PC certification.
Such linear scaling in AI inferencing performance comes from not just architectural improvements (which work to reduce the NPU's power footprint); but also increasing the NCE (neural compute engine) counts from 2 on NPU 3 (Meteor Lake) to 6 on NPU 4 (Lunar Lake); with proportionate increases in the scratchpad RAM, DMA bandwidth, and L2 cache.
The NPU 4 matrix multiplication and convolution (MAC) array supports INT8 and FP16 data types, with 2048 MAC/cycle INT8, or 1024 MAC/cycle FP16. Intel claims a doubling in performance/Watt over NPU 3, thanks to improvements in the activation functions, data conversion, upgrades to the SHAVE DSP, and a doubling in the bandwidth of the DMA engine.
The raw vector performance of NPU 4 is now 12 times that of NPU 3, with 4 times the AI TOPS, and 2 times the bandwidth of the NPU to the fabric.