AMD saw client AI acceleration on the PC coming as far back as 2019, back when generative AI hadn't made landfall, and client applications for AI included background blur, noise cancellation, and the likes. In the years that followed, AMD acquired Xilinx, an acquisition as important to the company as its mid-2000s acquisition of ATI. This gave AMD a wealth of Xilinx IP, including FPGAs, and the compute elements behind the XDNA architecture driving its NPUs. With the new Ryzen AI 300 series processors, AMD is introducing the new XDNA 2 NPU architecture, as it attempts to create a 50 TOPS-class NPU that meets Microsoft Copilot+ requirements within the confines of its 15 W TDP.
AMD now accelerates AI along the entire computing stack, from the cloud, to enterprise, to the AI PC edge, and so it's able to serve both kinds of markets—cloud-based and client-based.
AMD lists out the advantages of locally accelerated AI over cloud accelerated AI, and now the 50 TOPS NPU on Ryzen AI 300 processors. The "North Star Experience" is what they think future AI models will require for best experiences—so we're not even close yet.
The Ryzen AI 300 Strix Point processor has AI acceleration on all three of its compute engines—the Zen 5 cores with their 512-bit FPU accelerate AI-relevant ISA such as AVX-512 and VNNI; the RDNA 3.5 iGPU has 1,024 stream processors and 32 AI accelerators of its own; and then there's also the NPU.
What's interesting though, is that Ryzen 9000 Granite Ridge lacks an NPU, because the processor reuses the cIOD of Ryzen 7000 Raphael, which lacks an NPU. This could be a problem for AMD down the line, as competing Intel Arrow Lake desktop chips will have at least a 16 TOPS-class NPU, which is not enough for Copilot+, but at least Copilot.
XDNA 2 Architecture
The 50 TOPS XDNA 2 NPU of the Ryzen AI 300 can offer a whopping 35x performance/Watt speed-up for an AI model over a CPU core, which is why Microsoft stresses on the presence of an NPU to qualify for Copilot+, not even an iGPU with a 8x performance/Watt increase will do.
The XDNA 2 NPU of Ryzen AI 300 is the first to hit 50 TOPS. Microsoft isn't far behind with the NPU 4 of Lunar Lake peaking at 48 TOPS, and Apple M4 at 40 TOPS.
AMD's XDNA architecture sees a deployment of memory-sensitive logic deployed in a manner most optimal for AI deep-learning neural networks.
The original XDNA NPU on Phoenix and Hawk Point featured 20 AI engine tiles for 10 TOPS on Phoenix, and 16 TOPS on Hawk Point (thanks to increases in clock speeds). The XDNA 2 NPU on Strix Point increases the AIE tile count to 32, with a doubling in MACs per tile, a 60% increase in on-chip memory, and support for the new Block Float 16 (not to be confused with BFloat16). These changes result in a 5x AI inferencing performance increase over Phoenix, and a 2x performance/Watt increase.
Block FP16
Block Float 16 is a numerical data format used to represent numbers in a way that makes calculations faster and more efficient. Please note that Block Float 16 is NOT BFloat16, which we've been hearing about for a while, too.
Traditionally, floating point numbers are stored as either 64-bit (FP64, double), 32-bit (FP32, float) or 16-bit (FP16, half), but these are relatively slow. Block Float 16 uses 16 bits to store a number just like FP16, but instead of storing an exponent for each number (the number of digits), it builds groups of eight numbers and uses a shared exponent for all of them. A (slightly simplified) example is storing the numbers 1,000,000, 2,000,000, 3,000,000, 4,000,000, 5,000,000, 6,000,000, 7,000,000, 8,000,000. Block Float 16 stores them as 1, 2, 3, 4, 5, 6, 7, 8 and "x1,000,000 for all numbers." This takes up considerably less memory, which not only reduces storage requirements, it also lowers the memory bandwidth required—a very important improvement, especially for NPU devices.
The exact memory requirement is 9-bits for each Block Float 16 value, an almost 50% improvement over FP16. Each value stores 8 bits of precision (aka significand, mantissa or fraction), plus a shared 8 bits of exponent for each group of eight numbers. Thus, a group of eight Block Float 16 uses 8 (numbers) x 8 (bit per number) + 8 (bits shared exponent) = 72 bits = 9 bits per Block Float 16.
While alternate ways to store data are nothing new, the most well-known is probably INT8, which uses just eight bits to store a number, most of them require that your model is quantized for this specific data type. Quantization is a computationally intensive process that everyone wants to avoid. Especially for smaller vendors of AI hardware, like AMD, it won't be easy to convince software developers to optimize their models for specific hardware—the reality is that everyone will optimize for NVIDIA and probably not care about the rest. Block Float 16 solves this by letting you use models that are optimized for FP16, without any quantization, and the results will still be very good, virtually indistinguishable to the results with native FP16, but at almost twice the performance.
Ryzen AI Software
Any AI acceleration hardware is only as good as its software stack, AMD knows this, which is why it has started its journey toward becoming a software-first company. Learn all about it here. AMD supports all prevalent models and algorithms, optimization and quantization tools, and execution through its ONNX runtime.