109

AMD Zen 5 Technical Deep Dive

Name: AMD Zen 5 Technical Deep Dive
Brand: AMD

W1zzard

on Jul 15th, 2024,

in Processors.

Manufacturer: AMD

Closing Thoughts »

Machine / Learning AI

AMD saw client AI acceleration on the PC coming as far back as 2019, back when generative AI hadn't made landfall, and client applications for AI included background blur, noise cancellation, and the likes. In the years that followed, AMD acquired Xilinx, an acquisition as important to the company as its mid-2000s acquisition of ATI. This gave AMD a wealth of Xilinx IP, including FPGAs, and the compute elements behind the XDNA architecture driving its NPUs. With the new Ryzen AI 300 series processors, AMD is introducing the new XDNA 2 NPU architecture, as it attempts to create a 50 TOPS-class NPU that meets Microsoft Copilot+ requirements within the confines of its 15 W TDP.

AMD now accelerates AI along the entire computing stack, from the cloud, to enterprise, to the AI PC edge, and so it's able to serve both kinds of markets—cloud-based and client-based.

AMD lists out the advantages of locally accelerated AI over cloud accelerated AI, and now the 50 TOPS NPU on Ryzen AI 300 processors. The "North Star Experience" is what they think future AI models will require for best experiences—so we're not even close yet.

The Ryzen AI 300 Strix Point processor has AI acceleration on all three of its compute engines—the Zen 5 cores with their 512-bit FPU accelerate AI-relevant ISA such as AVX-512 and VNNI; the RDNA 3.5 iGPU has 1,024 stream processors and 32 AI accelerators of its own; and then there's also the NPU.

What's interesting though, is that Ryzen 9000 Granite Ridge lacks an NPU, because the processor reuses the cIOD of Ryzen 7000 Raphael, which lacks an NPU. This could be a problem for AMD down the line, as competing Intel Arrow Lake desktop chips will have at least a 16 TOPS-class NPU, which is not enough for Copilot+, but at least Copilot.

XDNA 2 Architecture

The 50 TOPS XDNA 2 NPU of the Ryzen AI 300 can offer a whopping 35x performance/Watt speed-up for an AI model over a CPU core, which is why Microsoft stresses on the presence of an NPU to qualify for Copilot+, not even an iGPU with a 8x performance/Watt increase will do.

The XDNA 2 NPU of Ryzen AI 300 is the first to hit 50 TOPS. Microsoft isn't far behind with the NPU 4 of Lunar Lake peaking at 48 TOPS, and Apple M4 at 40 TOPS.

AMD's XDNA architecture sees a deployment of memory-sensitive logic deployed in a manner most optimal for AI deep-learning neural networks.

The original XDNA NPU on Phoenix and Hawk Point featured 20 AI engine tiles for 10 TOPS on Phoenix, and 16 TOPS on Hawk Point (thanks to increases in clock speeds). The XDNA 2 NPU on Strix Point increases the AIE tile count to 32, with a doubling in MACs per tile, a 60% increase in on-chip memory, and support for the new Block Float 16 (not to be confused with BFloat16). These changes result in a 5x AI inferencing performance increase over Phoenix, and a 2x performance/Watt increase.

Block FP16

Block Float 16 is a numerical data format used to represent numbers in a way that makes calculations faster and more efficient. Please note that Block Float 16 is NOT BFloat16, which we've been hearing about for a while, too.

Traditionally, floating point numbers are stored as either 64-bit (FP64, double), 32-bit (FP32, float) or 16-bit (FP16, half), but these are relatively slow. Block Float 16 uses 16 bits to store a number just like FP16, but instead of storing an exponent for each number (the number of digits), it builds groups of eight numbers and uses a shared exponent for all of them. A (slightly simplified) example is storing the numbers 1,000,000, 2,000,000, 3,000,000, 4,000,000, 5,000,000, 6,000,000, 7,000,000, 8,000,000. Block Float 16 stores them as 1, 2, 3, 4, 5, 6, 7, 8 and "x1,000,000 for all numbers." This takes up considerably less memory, which not only reduces storage requirements, it also lowers the memory bandwidth required—a very important improvement, especially for NPU devices.

The exact memory requirement is 9-bits for each Block Float 16 value, an almost 50% improvement over FP16. Each value stores 8 bits of precision (aka significand, mantissa or fraction), plus a shared 8 bits of exponent for each group of eight numbers. Thus, a group of eight Block Float 16 uses 8 (numbers) x 8 (bit per number) + 8 (bits shared exponent) = 72 bits = 9 bits per Block Float 16.

While alternate ways to store data are nothing new, the most well-known is probably INT8, which uses just eight bits to store a number, most of them require that your model is quantized for this specific data type. Quantization is a computationally intensive process that everyone wants to avoid. Especially for smaller vendors of AI hardware, like AMD, it won't be easy to convince software developers to optimize their models for specific hardware—the reality is that everyone will optimize for NVIDIA and probably not care about the rest. Block Float 16 solves this by letting you use models that are optimized for FP16, without any quantization, and the results will still be very good, virtually indistinguishable to the results with native FP16, but at almost twice the performance.

Ryzen AI Software

Any AI acceleration hardware is only as good as its software stack, AMD knows this, which is why it has started its journey toward becoming a software-first company. Learn all about it here. AMD supports all prevalent models and algorithms, optimization and quantization tools, and execution through its ONNX runtime.

Jul 12th, 2025 09:17 CDT change timezone

Latest GPU Drivers

New Forum Posts

09:15 by Macro Device
No offense, here are some things that bother me about your understanding of fans. (36)
09:12 by remixedcat
Stupid buggy POS Realtek WiFi RTL8852BE (11)
09:12 by remixedcat
Swapping existing router w/ a replacement; any issues? (14)
09:04 by Waldorf
'NVIDIA App' not usable offline? (13)
08:53 by gasolin
Chrome has removed uBlock Origin 1.64.0 (remove google search suggestions) (12)
08:44 by jm_bmw
Share your AIDA 64 cache and memory benchmark here (3097)
08:43 by remixedcat
The Official Linux/Unix Desktop Screenshots Megathread (778)
08:33 by chrcoluk
[GPU-Z Test Build] New Kernel Driver, Everyone: Please Test (90)
08:27 by BoggledBeagle
Gigabyte graphic cards - TIM gel SLIPPAGE problem (150)
08:26 by Chomiq
NVIDIA App (55)

Popular Reviews

Jul 9th, 2025 Fractal Design Epoch RGB TG Review
Jul 11th, 2025 Lexar NM1090 Pro 4 TB Review
Jul 8th, 2025 Corsair FRAME 5000D RS Review
Jul 4th, 2025 NVIDIA GeForce RTX 5050 8 GB Review
Jul 7th, 2025 NZXT N9 X870E Review
Jul 11th, 2025 Our Visit to the Hunter Super Computer
Jun 20th, 2025 Sapphire Radeon RX 9060 XT Pulse OC 16 GB Review - An Excellent Choice
Nov 6th, 2024 AMD Ryzen 7 9800X3D Review - The Best Gaming Processor
May 13th, 2025 Upcoming Hardware Launches 2025 (Updated May 2025)
Jul 10th, 2025 Chieftec Iceberg 360 Review