Tuesday, October 29th 2024
Social Media Imagines AMD "Navi 48" RDNA 4 to be a Dual-Chiplet GPU
A Chinese tech forum ChipHell user who goes by zcjzcj11111 sprung up a fascinating take on what the next-generation AMD "Navi 48" GPU could be, and put their imagination on a render. Apparently, the "Navi 48," which powers AMD's series-topping performance-segment graphics card, is a dual chiplet-based design, similar to the company's latest Instinct MI300 series AI GPUs. This won't be a disaggregated GPU such as the "Navi 31" and "Navi 32," but rather a scale-out multi-chip module of two GPU dies that can otherwise run on their own in single-die packages. You want to call this a multi-GPU-on-a-stick? Go ahead, but there are a couple of changes.
On AMD's Instinct AI GPUs, the chiplets have full cache coherence with each other, and can address memory controlled by each other. This cache coherence makes the chiplets work like one giant chip. In a multi-GPU-on-a-stick, there would be no cache coherence, the two dies would be mapped by the host machine as two separate devices, and then you'd be at the mercy of implicit or explicit multi-GPU technologies for performance to scale. This isn't what's happening on AI GPUs—despite multiple chiplets, the GPU is seen by the host as a single PCI device with all its cache and memory visible to software as a contiguously addressable block.We imagine the "Navi 48" is modeled along the same lines as the company's AI GPUs. The graphics driver sees this package as a single GPU. For this to work, the two chiplets are probably connected by Infinity Fabric Fanout links—an interconnect with a much higher amount of bandwidth than a serial bus like PCIe. This is probably needed for the cache coherence to be effective. The "Navi 44" is probably just one of these chiplets sitting its own package.
In the render, the substrate and package is made to resemble that of the "Navi 32," which tends to agree with the theory that "Navi 48" will be a performance segment GPU, and a successor to the "Navi 32," "Navi 22," and "Navi 10," rather than being a successor to enthusiast-segment GPUs like the "Navi 21" and "Navi 31." This much was made clear by AMD in its recent interviews with the media.
Do we think the ChipHell rumor is plausible? Absolutely, considering nobody took the very first such renders about the AM5 package having an oddly-shaped IHS seriously. The "Navi 48" being a chiplet-based GPU is something within character for a company like AMD, which loves chiplets, MCMs, and disaggregated devices.
Sources:
ChipHell Forums, HXL (Twitter)
On AMD's Instinct AI GPUs, the chiplets have full cache coherence with each other, and can address memory controlled by each other. This cache coherence makes the chiplets work like one giant chip. In a multi-GPU-on-a-stick, there would be no cache coherence, the two dies would be mapped by the host machine as two separate devices, and then you'd be at the mercy of implicit or explicit multi-GPU technologies for performance to scale. This isn't what's happening on AI GPUs—despite multiple chiplets, the GPU is seen by the host as a single PCI device with all its cache and memory visible to software as a contiguously addressable block.We imagine the "Navi 48" is modeled along the same lines as the company's AI GPUs. The graphics driver sees this package as a single GPU. For this to work, the two chiplets are probably connected by Infinity Fabric Fanout links—an interconnect with a much higher amount of bandwidth than a serial bus like PCIe. This is probably needed for the cache coherence to be effective. The "Navi 44" is probably just one of these chiplets sitting its own package.
In the render, the substrate and package is made to resemble that of the "Navi 32," which tends to agree with the theory that "Navi 48" will be a performance segment GPU, and a successor to the "Navi 32," "Navi 22," and "Navi 10," rather than being a successor to enthusiast-segment GPUs like the "Navi 21" and "Navi 31." This much was made clear by AMD in its recent interviews with the media.
Do we think the ChipHell rumor is plausible? Absolutely, considering nobody took the very first such renders about the AM5 package having an oddly-shaped IHS seriously. The "Navi 48" being a chiplet-based GPU is something within character for a company like AMD, which loves chiplets, MCMs, and disaggregated devices.
59 Comments on Social Media Imagines AMD "Navi 48" RDNA 4 to be a Dual-Chiplet GPU
The issue with crossfire and sli is that both needed to connect over a bus, so step one was to figure out which GPU was doing what. You then lost time in communication and population of data to the correct memory space...and it all leaves a sour taste in the mouth when 2 mid-level GPUs we 1+1 = 1.5, whereas you could instead buy a GPU two levels higher and get 1.8x the performance instead of 1.5...and things worked without all of the driver shenanigans. That, for the record, is why crossfire and sli died. It'd be nice to see that come back...but with 3060 GPUs still selling for almost $300 on ebay I cannot see the benefit to pursuing it.
So we are clear, my optimism lies with AMD pumping out a dual chip version that works in the middle segment and blows monolithic silicon out of the water. If AMD is rational they give most of that cost savings back to the customer, and sell to the huge consumer base that wants more power without having to take out a mortgage to do so. Good QHD gameplay, and 120hz+ 1080p performance, with compromise 4k performance at the $500 mark would be something for everyone to love without really getting into a fight with Nvidia. Not sure where Intel comes out in all of this, but Battlemage has to be less bumpy than Arc. Hopefully they can also release something, that makes the gigantic and under served middle market happy...and gives the Ngreedia side of Nvidia a black eye.
All of this is fluff anyways...so why not indulge in a fantasy?
The closest we got was SLI/Crossfire and that was a bunch of driver magic from both sides. SLI/Crossfire died due to new incoming rendering methods that made the whole thing expensive to maintain. Plus, incoming DX12 and Vulkan with their own ways to handle multi-GPU - the implicit and explicit mentioned in the article. Which basically no game developers tried to properly implement. The ingenious bit was figuring out which parts of a GPU can be separated. The problem always has been and still is that splitting up the compute array is not doable, at least has not been so far. It has been 15+ years since AMD first publicly said they are trying to go for that. Nvidia has been rumored to look into the same thing for 10+ years as well. Both AMD and Nvidia occasionally publish papers about how to split a GPU but the short version of conclusions has been that you really can't.
Again, the context here is gaming GPU. Something that is very latency-sensitive. Well, that actually did directly bring out the bandwidth problem as well. Couple orders of magnitude higher than what they did on CPUs. Dedicated connector had a purpose - it was dedicated and under direct control of GPUs. It was not about bandwidth but more about latency and guaranteed availability. Remember, PCIe does not guarantee that GPU is able to send stuff to the other one over it quickly enough. Plus, in some situations the PCIe interface of the GPU could be busy with something else - reading stuff from RAM for example, textures or whatnot. That was a consideration of whether it was worth doing a separate thing for that and it did seem to benefit for a long while. I guess in the end PCIe simply got fast enough :) Nice in theory but interconnects are not magic. The level of interconnect to actually replace some internal connection in GPU - say the connections to shader engines the video was talking about - does come with a cost. And that cost is probably power, given the bandwidth requirements - a lot of power. Unless there is some new breakthrough development that makes everything simple, this makes no sense to do in mid segment. The added complexity and overhead starts justifying itself when die sizes get very large. In practice - when die sizes are nearing either the reticle limit or yield limits. And this is not a cheap solution for a GPU.
The savings might not be quite as significant as you imagine. RX7000 series is a chiplet design and while cheaper than Nvidia it is basically relatively same as it has always been. These did not end up dominating the generation due to being cheap.
But, from a different perspective, multiGPU is dual or quad or whatever GPUs. Unfortunately nobody really has gone through the effort to write software to run on these. DX12 and implicit/explicit allow doing it just fine in theory :) I did not mean memory. More internal than that - cache and work coordination basically. Ideally we want what today, under 10ms per frame? Any delays will add up quickly.
Workloads in datacenter generally do not care about that aspect as much.
Multi-GPU resource coherency over InfinityFabric has existed since at least RDNA 2.
Because every data item takes the same time to process and you always know which data items come next it's very easy to hide memory access like so :
... ->data transfer -> processing
--------------------> data transfer -> processing
------------------------------ ----- -> data transfer -> ....
So if you increase the latency but the execution time is always more that what it takes to initiate the next data transfer it makes no real difference to the overall performance, it's far more important how many memory transfers you can initiate than how long they each take.
I mean everyone seems to be expecting something more like MI300X .But for some reason AMD has not done even a gaming demonstration with something like that. Nor has Nvidia. Both have been doing chiplets in data center for several generations now.
InfinityFabric has already been used for 'bonding' the resources of 2 (otherwise discrete) GPUs into one, and that it's viable for more than strictly Machine Learning.
Potentially, both dies sharing package and membus, would mean less latency. Those Radeon Instinct chiplets are connected via IF, too. It's the same concept/technology, differing scale and implementation.
I think this 'rumor' has some basis in reality, given how AMD's been using its technologies: Modularity and Scalability.
Shared membus typically does not mean less latency.
Edit:
If you meant W6800X Duo with the memory bus sharing then that is not the case. On that card, both 6800 GPUs manage 32GB VRAM for the total of 64GB on the card. Yes, the interconnect makes the connection between GPUs faster but that is not membus sharing.
I was merely pointing out that the technology has already been demonstrated on existing 'for-graphics' Navi21 silicon (non-MI/AI use hardware, using the W6800 X Duo).
Meaning, that this hypothetical dual-die Navi4x is not entirely unrealistic.
My comments on latency and membus were explicitly pointed towards the topic's prospective dual-die GPU.
Physical proximity decreases latency; sharing a membus rather than having to communicate over 2 (distant) memory buses, will be lower latency.
(Ex. Dual 2.6Ghz Troy Opteron 252s, ea. w/ single-channel RAM vs. a single 2.6Ghz dual-core Toledo FX-60 w/ dual-channel RAM) P.S. InfinityFabric is a superset of HyperTransport
Just look at how much intel ARC GPUs improvdd from drivers alone.
2. People forgot about that mGPU is already feesable???
3 All of the RNDA1/2/3 supports mGPU without any phyical concections
4. The real probelm with mGPU. as I've found out is that mGPU is heavily CPU bound/bottlenecked on most games that have it properly done. Unlike a few older games that moved to vuklan for it.
6. Stop comapling about scaling fof gpus, becuase even dual cores cpus didn't get 200% increases till things started using more than 2 cores. beside gous don't always output every frame they can.nvidia literally has a fast option in the control panel to dump frames the gpu knows the monitor can't render fast enough. Its udner ths v-sync settings. Frame geration does the oppoiaits of that setting
7. All upscaling requires far more work in drivers, patches for games, updates & intergation to the game than sl.i or crossfire ever needed. (Let alonw the A.I that nvidia has to run it through before makw a driver or a patch fof a game)
Basically you would envision something similar to chiplet Zen where Memory controller is separate from compute and connected to both/all compute dies via IF? Isn't that the other way around? DX12 is a lower-level API than say DX11. Developer needs to do more heavy lifting there and API implementation in drivers does less and has less of an impact.
Intel ARC is a strange example here. ARC had DX12 drivers working quite well when they came out. It was DX11, DX9 etc that were a big problem. The thing with this example is though that it does not really say too much about drivers and APIs, creating a driver stack from relatively scratch is a daunting task and Intel had to prioritize and catch up later. This really isn't about cards but APIs and what developers choose to implement. SLI and Crossfire were a convenient middle ground where Nvidia and AMD respectively held some hands to make things easier.
Both current frontrunners in graphics APIs - DX12 and Vulkan - can work with multiple GPUs just fine. The problem is that writing this seems to be more trouble than it is worth so far.
No without the jokes, that chiplet is likely being different then SLI. You have no PCI-E constraint as the latency's and bandwidth will exceed 100 times more then what SLI provides. So no more micro stuttering in the first place.
I think it's a double die working as one. Remember the whole idea of chiplets is to build the latest tech for GPU's / CPU's but things like an IO die on older tech, as they don't scale that well or add too much cost in regards of how many bad die's are out there.
The sole reason why Ryzen or EPYC is so damn efficient is that it simply creates CCD's that can be re-used instead of throwing out half of the chip like that (in the older case of Intel).
Nvidia is still doing monolithic die's with a huge cost per wafer and efficiency thrown out of the window (TDP: 600W coming).
What chiplets bring is efficient manufacturing - smaller individual dies, better yields. The main reason for chiplet design making sense is the additional cost in packaging being smaller than what it would cost to actually go for the monolithic die. Not the only reason of course, reuse is another one, exemplified by Ryzen's IO Die. Or straight up limits in manufacturing, reticle limit for example.
When it comes to power efficiency and performance benefit though, chiplet design is a straight up negative. Some of this can be mitigated but not completely negated.
So, the Multi-GPU causing stuttering is a complete MYTH!
And 2.5 TB/s of bandwith are enough for Apple to glue two dies together so the Ultra SKU functions basically as a monolithic chip.
If the latency is low enough this rumor isnt even that unrealistic.
developer.nvidia.com/blog/autodmp-optimizes-macro-placement-for-chip-design-with-ai-and-gpus/
Crossfire was easier to purchase due to not needing special, sometimes nvidia-only boards.
But getting crossfire to actually do its job and improve performance was a mess. SLI was not much better, but I remember with AMD I had to use specific versions of their drivers because they'd regress per game.