Monday, November 7th 2022
AMD RDNA3 Navi 31 GPU Block Diagram Leaked, Confirmed to be PCIe Gen 4
An alleged leaked company slide details AMD's upcoming 5 nm "Navi 31" GPU powering the next-generation Radeon RX 7900 XTX and RX 7900 XT graphics cards. The slide details the "Navi 31" MCM, with its central graphics compute die (GCD) chiplet that's built on the 5 nm EUV silicon fabrication process, surrounded by six memory cache dies (MCDs), each built on the 6 nm process. The GCD interfaces with the system over a PCI-Express 4.0 x16 host interface. It features the latest-generation multimedia engine with dual-stream encoders; and the new Radiance display engine with DisplayPort 2.1 and HDMI 2.1a support. Custom interconnects tie it with the six MCDs.
Each MCD has 16 MB of Infinity Cache (L3 cache); and a 64-bit GDDR6 memory interface (two 32-bit GDDR6 paths). Six of these add up to the GPU's 384-bit GDDR6 memory interface. In the scheme of things, the GPU has a contiguous and monolithic 384-bit wide memory bus, because every modern GPU uses multiple on-die memory controllers to achieve a wide memory bus. "Navi 31" hence has a total Infinity Cache size of 96 MB—which may be less in comparison to the 128 MB on "Navi 21," but AMD has shored up cache sizes across the GPU. The L0 caches on the compute units is now increased numerically by 240%. The L1 caches by 300%, and the L2 cache shared among the shader engines, by 50%. The RX 7900 XTX is confirmed to use 20 Gbps GDDR6 memory in this slide, for 960 GB/s of memory bandwidth.The GCD features six Shader Engines, each with 16 compute units (or 8 dual compute units), which work out to 1,024 stream processors. AMD claims to have doubled the IPC of these stream processors over RDNA2. The new RDNA3 ALUs also support BF16 instructions. The SIMD engine of "Navi 31" has an FP32 throughput of 61.6 TFLOP/s, a 168% increase over the 23 TFLOP/s of the "Navi 21." The slide doesn't quite detail the new Ray Tracing engine, but references new RT features, larger caches, 50% higher ray intersection rate, for an up to 1.8X RT performance increase at 2.505 GHz engine clocks; over the RX 6950 XT. There are other major upgrades to the GPU's raster 3D capabilities, including a 50% increase in prim/clk rates, and 100% increase in prim/vertex cull rates. The pixel pipeline sees similar 50% increases in rasterized prims/clock and pixels/clock; and synchronous pixel-wait.
Source:
VideoCardz
Each MCD has 16 MB of Infinity Cache (L3 cache); and a 64-bit GDDR6 memory interface (two 32-bit GDDR6 paths). Six of these add up to the GPU's 384-bit GDDR6 memory interface. In the scheme of things, the GPU has a contiguous and monolithic 384-bit wide memory bus, because every modern GPU uses multiple on-die memory controllers to achieve a wide memory bus. "Navi 31" hence has a total Infinity Cache size of 96 MB—which may be less in comparison to the 128 MB on "Navi 21," but AMD has shored up cache sizes across the GPU. The L0 caches on the compute units is now increased numerically by 240%. The L1 caches by 300%, and the L2 cache shared among the shader engines, by 50%. The RX 7900 XTX is confirmed to use 20 Gbps GDDR6 memory in this slide, for 960 GB/s of memory bandwidth.The GCD features six Shader Engines, each with 16 compute units (or 8 dual compute units), which work out to 1,024 stream processors. AMD claims to have doubled the IPC of these stream processors over RDNA2. The new RDNA3 ALUs also support BF16 instructions. The SIMD engine of "Navi 31" has an FP32 throughput of 61.6 TFLOP/s, a 168% increase over the 23 TFLOP/s of the "Navi 21." The slide doesn't quite detail the new Ray Tracing engine, but references new RT features, larger caches, 50% higher ray intersection rate, for an up to 1.8X RT performance increase at 2.505 GHz engine clocks; over the RX 6950 XT. There are other major upgrades to the GPU's raster 3D capabilities, including a 50% increase in prim/clk rates, and 100% increase in prim/vertex cull rates. The pixel pipeline sees similar 50% increases in rasterized prims/clock and pixels/clock; and synchronous pixel-wait.
79 Comments on AMD RDNA3 Navi 31 GPU Block Diagram Leaked, Confirmed to be PCIe Gen 4
One can hope that Navi 33 (June 2023?) will fix these problems but I think they need Navi 34 with PCIe 4.0 x8, AV1, DisplayPort 2.1 and a price tag of $129.
They weren't designed to be standalone products - they're literally in the AMD roadmaps from almost 4 years ago as laptop-only auxilliary GPU cores to complement an AMD IGP.
No pandemics can erase certain plans - and when there had never been plans for decent offers, you can't blame the external factors.
Its roadmap is lackluster.
There are market segments which require improving gaming performance (RX 6500 XT is stupid to begin with because it has 0% improvement over the older RX 5500 XT) on the entry discrete cards, which can go with any system Intel or AMD...
iGPUs are cool, but they require you to buy the entire platform. You can have an older PC and you might want to upgrade just the graphics card to get some new features and a bit more performance. They do not have to make a lot of those cards.
The performance gap between RX 6400 and RTX 4090 is too wide.
AMD can be back in business if it succeeds to move this down to only 400%.
AMD also needs to start making pipecleaners. How about a 50 sq. mm GPU on TSMC N3 with chiplets on N7 made now, NOW?
I don't care much for RT anyway, and prefer 240hz on a 3k screen. Time to retire the 3090 and 7900xtx here I come.
This setup was fine from RDNA2. It just needed refinement RDNA3 doesn't refine this setup at all it slow it down, espscially in raytracing. If AMD added in a raytracing instruction cache & raytracing data cache to RDNA2 that's seprate for the RT core on the back end. The Rt cores would be unbottlenecked by the texture being linked to them. You could unlink the shader clocks from the RT cores then size it appropriately. Instead of a single double size front end being unlinked from shaders like it is in RDNA3. The shaders are under clocked, because the Instruction cache can't feed them fast enough at the same clock since they're more them now too. Or just double the instruction cache like they did to improve the raytracing on the backend without the adding of the shaders to each unit. If you want proof All you have to do is look a Navi vs Navi 2. They doubled textures units, double R.O.P's, doubled Shaders, & double Compute units, added infinity Cache all on the same 7nm node,Plus the power going from 225 watts to 330 watts or a 50% increase in power. They couldn't do it in Navi 3 at all, because it's front end can't feed the increase in shaders & doubled RT cores now It wouldn't be able to support double the texture units inside the Wgp/compute". That is why there is only a 20% increase in texture units this time.
P.S Anyone feel like modifying this in photoshop toshow the new way RDNA3 is setup?
Nvidia will never replace the 1650 because nothing at that size has a snowball's chance in hell of raytracing or running DLSS acceptably. RTX and DLSS are Nvidia's competitive edge and they won't undermine that no matter what.
I'd love to see a GA108 die, maybe 1920 CUDA cores, clocked at ~1.8GHz, 6GB of cheap, slow 12Gbps GDDR6, and all running at <75W on Samsung N8 or TSMC6. There'd be no need for Tensor cores or RT cores and the aim would be to make the die as small and cheap to produce as possible.
It'll never happen though. Nvidia have abandoned the low-end to AMD and Intel IGPs, just above that they can still make profit for zero effort by churning out 1050Ti and 1650. They're neither modern nor efficient but Nvidia doesn't care about that, they only care about profit.
An incredibly important distinction to make is fabric bandwidth from VRAM theoretical bandwidth. Getting 960GB/s at the PHYs does NOT equate to pushing terabytes per second of data across the die. One of the major steps forward for RDNA3's design is that increased fabric bandwidth. This is something that other companies are absolutely killing right now, not to name any fruit related names, and leaving people confused about how they can manage near linear scaling with increased shader grouping.
Why do you try posting fact & prove it.
"Shader utilization" on RDNA 3 will be less than what it is on RDNA2. I already shown why, with facts.
5700 XT Reference on the left, 6900 XT reference on the right. The front end CAN feed the shader engines because it runs at higher clocks. Also, doubled RT cores? Where did you read that? Navi 31's ray accelerators were only increased by 50% according to the top post detailing the block diagram, and AMD themselves don't even mention a full 2x performance increase of the cores. Where did you get "double" from?
The reason there has only been an increase of 20% to TMU count is because there has only been an increase of 20% to CU count. From 80 to 96. The same reason why TMU count from 5700 XT to 6900 XT doubled, because they went from 40 to 80 CUs. That's how the architecture is laid out, I'm sorry you don't like it?
Great analysis BTW. Looks like quite a few people can read behind the slides AMD shown. I've been reading/hearing it on a few analysis of RDNA3.
Indeed it looks like that AMD's "worst" spot on their design is the front end.
They know it, hence the decoupled FE/Shader clocks. They are trying to distribute die space, and keeping cost low by not using dirty expensive (on power and cost) and far too complex components = GDDR6X/HighBandwidthMemoryControllers*/Dedicated RT-AI cores. But they use an enhanced/larger cache system to substitute GDDR6X/HBMC*.
Their strongest point is cache system and MCD >> GCD feed as inside the GCD also (L1/L2). In fact it got so much stronger (effective bandwidth) that they even reduced the L3 infinity cache count (128MB >> 96MB) while its faster overall from RDNA2.
If in future (2023) we see a 7950XTX it will probably be a larger GCD with larger Front-End at least, besides maybe a few more CUs (for example +100~150mm²).
They already know how to make a 520mm² GCD.
They can match or pass the 600mm² of ADA in total die area (GCD+MCDs) with less cost still. They can also move to 4nm for this one
Imagine a 1500$ (450~500W) 7950XTX competing with a 2500$ (550~600W) 4090Ti.
Pure speculation, but not far fetched IMHO. I saw that... he also said that the 7900XTX should cost 500$ just to make sense... :kookoo::banghead:
Wishful thinking or just pure absurdity?
not to mention trying to emulate a british accent while obviously being an ESL... plus the voice; I can't stand his manner of speaking