Tuesday, October 29th 2024
Social Media Imagines AMD "Navi 48" RDNA 4 to be a Dual-Chiplet GPU
A Chinese tech forum ChipHell user who goes by zcjzcj11111 sprung up a fascinating take on what the next-generation AMD "Navi 48" GPU could be, and put their imagination on a render. Apparently, the "Navi 48," which powers AMD's series-topping performance-segment graphics card, is a dual chiplet-based design, similar to the company's latest Instinct MI300 series AI GPUs. This won't be a disaggregated GPU such as the "Navi 31" and "Navi 32," but rather a scale-out multi-chip module of two GPU dies that can otherwise run on their own in single-die packages. You want to call this a multi-GPU-on-a-stick? Go ahead, but there are a couple of changes.
On AMD's Instinct AI GPUs, the chiplets have full cache coherence with each other, and can address memory controlled by each other. This cache coherence makes the chiplets work like one giant chip. In a multi-GPU-on-a-stick, there would be no cache coherence, the two dies would be mapped by the host machine as two separate devices, and then you'd be at the mercy of implicit or explicit multi-GPU technologies for performance to scale. This isn't what's happening on AI GPUs—despite multiple chiplets, the GPU is seen by the host as a single PCI device with all its cache and memory visible to software as a contiguously addressable block.We imagine the "Navi 48" is modeled along the same lines as the company's AI GPUs. The graphics driver sees this package as a single GPU. For this to work, the two chiplets are probably connected by Infinity Fabric Fanout links—an interconnect with a much higher amount of bandwidth than a serial bus like PCIe. This is probably needed for the cache coherence to be effective. The "Navi 44" is probably just one of these chiplets sitting its own package.
In the render, the substrate and package is made to resemble that of the "Navi 32," which tends to agree with the theory that "Navi 48" will be a performance segment GPU, and a successor to the "Navi 32," "Navi 22," and "Navi 10," rather than being a successor to enthusiast-segment GPUs like the "Navi 21" and "Navi 31." This much was made clear by AMD in its recent interviews with the media.
Do we think the ChipHell rumor is plausible? Absolutely, considering nobody took the very first such renders about the AM5 package having an oddly-shaped IHS seriously. The "Navi 48" being a chiplet-based GPU is something within character for a company like AMD, which loves chiplets, MCMs, and disaggregated devices.
Sources:
ChipHell Forums, HXL (Twitter)
On AMD's Instinct AI GPUs, the chiplets have full cache coherence with each other, and can address memory controlled by each other. This cache coherence makes the chiplets work like one giant chip. In a multi-GPU-on-a-stick, there would be no cache coherence, the two dies would be mapped by the host machine as two separate devices, and then you'd be at the mercy of implicit or explicit multi-GPU technologies for performance to scale. This isn't what's happening on AI GPUs—despite multiple chiplets, the GPU is seen by the host as a single PCI device with all its cache and memory visible to software as a contiguously addressable block.We imagine the "Navi 48" is modeled along the same lines as the company's AI GPUs. The graphics driver sees this package as a single GPU. For this to work, the two chiplets are probably connected by Infinity Fabric Fanout links—an interconnect with a much higher amount of bandwidth than a serial bus like PCIe. This is probably needed for the cache coherence to be effective. The "Navi 44" is probably just one of these chiplets sitting its own package.
In the render, the substrate and package is made to resemble that of the "Navi 32," which tends to agree with the theory that "Navi 48" will be a performance segment GPU, and a successor to the "Navi 32," "Navi 22," and "Navi 10," rather than being a successor to enthusiast-segment GPUs like the "Navi 21" and "Navi 31." This much was made clear by AMD in its recent interviews with the media.
Do we think the ChipHell rumor is plausible? Absolutely, considering nobody took the very first such renders about the AM5 package having an oddly-shaped IHS seriously. The "Navi 48" being a chiplet-based GPU is something within character for a company like AMD, which loves chiplets, MCMs, and disaggregated devices.
59 Comments on Social Media Imagines AMD "Navi 48" RDNA 4 to be a Dual-Chiplet GPU
At least, we will know soon.
www.pcgamesn.com/amd/rdna-4-2025
10.5 TB/s read - 5.3 TB/s write - 10.0 TB/s copy.
buildapc/comments/15ury94 Bandwidth is essential. Otherwise, the chiplets won't work as expected and will fail, because of low performance.
Learn about inter GPU bandwidths.
github.com/te42kyfo/gpu-benches
Memory access patterns on GPUs are almost always contiguous, each core read/writes to a separate chunk of VRAM, if you break up a monolithic die into chiplets the memory bandwidth requirements stay the same. GPU threads do not communicate between each other the same way CPU cores do, they don't even have the proper hardware for complex synchronization besides simple barriers, you can't even synchronize threads globally, it's simply not how these things are designed. You can really only communicate between threads on the same GPU core, which will always access the same chunk of memory that it's memory controller has access to. GPU cores on different chiplets would not need to access VRAM that's only accessible though a different chiplet. You're the one that needs to read more on GPU architectures.
CPUs with chiplets do need more memory bandwidth either, this makes no sense. But on CPUs it is a different matter, there usually is a lot of inter thread commutation, this does pose a problem with threads on different cores but it's more a matter of latency rather than bandwidth.
Btw SLI works in a totally different matter, it's completely irrelevant to this subject. Each GPU stored a copy of what was contained in the VRAM, so every time frame buffers were updated that had to be copied between the cards.
1. Navi 31 failed, just in the same way like CrossFire failed. Higher bandwidth means lower latency.
So, now explain how cutting a monolithic chip into partitions improves latency ?