Monday, November 7th 2022

AMD RDNA3 Navi 31 GPU Block Diagram Leaked, Confirmed to be PCIe Gen 4
An alleged leaked company slide details AMD's upcoming 5 nm "Navi 31" GPU powering the next-generation Radeon RX 7900 XTX and RX 7900 XT graphics cards. The slide details the "Navi 31" MCM, with its central graphics compute die (GCD) chiplet that's built on the 5 nm EUV silicon fabrication process, surrounded by six memory cache dies (MCDs), each built on the 6 nm process. The GCD interfaces with the system over a PCI-Express 4.0 x16 host interface. It features the latest-generation multimedia engine with dual-stream encoders; and the new Radiance display engine with DisplayPort 2.1 and HDMI 2.1a support. Custom interconnects tie it with the six MCDs.
Each MCD has 16 MB of Infinity Cache (L3 cache); and a 64-bit GDDR6 memory interface (two 32-bit GDDR6 paths). Six of these add up to the GPU's 384-bit GDDR6 memory interface. In the scheme of things, the GPU has a contiguous and monolithic 384-bit wide memory bus, because every modern GPU uses multiple on-die memory controllers to achieve a wide memory bus. "Navi 31" hence has a total Infinity Cache size of 96 MB—which may be less in comparison to the 128 MB on "Navi 21," but AMD has shored up cache sizes across the GPU. The L0 caches on the compute units is now increased numerically by 240%. The L1 caches by 300%, and the L2 cache shared among the shader engines, by 50%. The RX 7900 XTX is confirmed to use 20 Gbps GDDR6 memory in this slide, for 960 GB/s of memory bandwidth.The GCD features six Shader Engines, each with 16 compute units (or 8 dual compute units), which work out to 1,024 stream processors. AMD claims to have doubled the IPC of these stream processors over RDNA2. The new RDNA3 ALUs also support BF16 instructions. The SIMD engine of "Navi 31" has an FP32 throughput of 61.6 TFLOP/s, a 168% increase over the 23 TFLOP/s of the "Navi 21." The slide doesn't quite detail the new Ray Tracing engine, but references new RT features, larger caches, 50% higher ray intersection rate, for an up to 1.8X RT performance increase at 2.505 GHz engine clocks; over the RX 6950 XT. There are other major upgrades to the GPU's raster 3D capabilities, including a 50% increase in prim/clk rates, and 100% increase in prim/vertex cull rates. The pixel pipeline sees similar 50% increases in rasterized prims/clock and pixels/clock; and synchronous pixel-wait.
Source:
VideoCardz
Each MCD has 16 MB of Infinity Cache (L3 cache); and a 64-bit GDDR6 memory interface (two 32-bit GDDR6 paths). Six of these add up to the GPU's 384-bit GDDR6 memory interface. In the scheme of things, the GPU has a contiguous and monolithic 384-bit wide memory bus, because every modern GPU uses multiple on-die memory controllers to achieve a wide memory bus. "Navi 31" hence has a total Infinity Cache size of 96 MB—which may be less in comparison to the 128 MB on "Navi 21," but AMD has shored up cache sizes across the GPU. The L0 caches on the compute units is now increased numerically by 240%. The L1 caches by 300%, and the L2 cache shared among the shader engines, by 50%. The RX 7900 XTX is confirmed to use 20 Gbps GDDR6 memory in this slide, for 960 GB/s of memory bandwidth.The GCD features six Shader Engines, each with 16 compute units (or 8 dual compute units), which work out to 1,024 stream processors. AMD claims to have doubled the IPC of these stream processors over RDNA2. The new RDNA3 ALUs also support BF16 instructions. The SIMD engine of "Navi 31" has an FP32 throughput of 61.6 TFLOP/s, a 168% increase over the 23 TFLOP/s of the "Navi 21." The slide doesn't quite detail the new Ray Tracing engine, but references new RT features, larger caches, 50% higher ray intersection rate, for an up to 1.8X RT performance increase at 2.505 GHz engine clocks; over the RX 6950 XT. There are other major upgrades to the GPU's raster 3D capabilities, including a 50% increase in prim/clk rates, and 100% increase in prim/vertex cull rates. The pixel pipeline sees similar 50% increases in rasterized prims/clock and pixels/clock; and synchronous pixel-wait.
79 Comments on AMD RDNA3 Navi 31 GPU Block Diagram Leaked, Confirmed to be PCIe Gen 4
If you like really need it Threadripper is your friend or even a decent EPYC platform. They all excell in more lanes.
If you wait for it, you might end with grey hairs, don't expect any 8900 before 2025 as earliest :D
You know Moore's law is dead, is extremely expensive to use state-of-the-art manufacturing processes IF they exist at all :D
But good design by AMD - don't waste your time putting ridiculous and needless PCIe version inflated digits :D
Alphacool Unveils HDX Pro Water M.2 NVMe SSD Water Block | TechPowerUp
If AMD is able to sell massively the 7900XTX at 900$ with huge margin, they will become even more profitable, they will have more R&D budget for future gen, more budget to buy new wafer, that also, need less of them. They will be able to produce more cards for cheaper. They will be able to wage price wars and win long term.
Right now, Nvidia must ensure they keep their mindshare as intact as possible and must ensure to win flagship no matter the cost to preserve it. Their current chip is huge and that affect defect rate. With the currently advertised defect rate of TSMC for 5/4 nm, they probably have around 56% yield or 46 good die for 82 candidate. Depending on where the defect is, it's possible that the yield could be a bit higher if they can disable a portion of the chip but don't except something huge.
AMD on the other hand can produce 135 good die per wafer for every 180 candidate for a 74.6% yield. Then you need the MCD that are on a much cheaper node, On those, you would get 1556 good die for 1614 candidate for a 96.34% yield. Another advantages are that it's probably the same MCD for Navi 31 and Navi 32, allowing AMD to quickly switch them from where they are the most needed.
AMD can produce around 260 7950XTX with 2 5nm wafer and 1 6nm wafer.
Nvidia can produce around 150 4090 with 3 4nm wafer.
The best comparison would be with ADA 103, for 139 candidate, there would be 96 good candidate for a 69.3% yield. for 3 4nm wafer, that mean 288 cards.
Then we add the cost of each nodes, 5 and 4 nm cost way more than 6/7 nm. I was not able to find 2022/2023 accurate price, and in the end it would be based on what AMD and Nvidia negociated. But those days, people are moving to 5nm and there is capacity available on 7/6 nm. (It's also why they still producing Zen 3 in masses). Rumors are that TSMC 5 nm cost around 17k and 6nm cost around 7-9K but take this with a grain of salt. (and each vendor have their own negociated price too).
So right now, it look that except Ray Tracing, they are able to produce close to as much 4080 cards with one of the wafer on a much cheaper nodes. They can reuse a portion of that wafer and allocate it to other GPU model easily.
This is what is disruptive. It's disruptive for AMD, not for the end users. (sadly), but overtime those will add up and Nvidia will suffer. Their margin will lower, etc. But Nvidia is huge and rich. This will take many gen and Nvidia will probably have the time to get their own chiplet strategy by then.
Intel is a little bit more in hot water because money already drying and they are barely competitive in desktop platform, they struggle in laptop/notebook and they suffer in data centers. They still compete in performance but their margin have shrinked so much that they need to cancel product to stay afloat.
I think it's easy to miss that with all the hype and the clickbait titles we had. But also, it's normal for the end users to look for what will benefits the most.
Actually we had "chiplets" called Radeon HD 3870 X2, Radeon HD 4870 X2, Radeon HD 6990, Radeon HD 7990, Radeon R9 295X2, Radeon Pro Duo. And their support was abandoned.
ARF, all your example are actually just crossfire on a single board. They are not even chiplets.
But the main problem is how you make it so the system only see 1 GPU. On CPU, you can add as many core as you want and the OS will be able to use all those cores. With the I/O die, you can have a single memory domain, but even if you don't, Operating System support Non-Unified Memory Architecture (NUMA) for decades.
For it to be possible on a GPU, you would need to have 1 common front end, with compute die attached. I am not sure if you would need to have the MCD connected to the front end or to the compute die, but i am under the impression it would be to the front end.
The thing is you want edge space to communicate with your other die. MCD, Display output, PCIE-E. There is no more edge space available. The biggest space users is the compute units, not the front end. I think this is why they did it that way.
In the future, we could see a large front end with a significant amount of cache, with multiple independent compute die connected to it. Cache do not scale well with newer nodes so that would mean the front end could be made on a older nodes (if that is worth it. some area might need to be faster).
Anyway, Design are all about tradeoff. Multichips could be done, but the tradeoff are probably not worth it yet so they went with this unperfect design that still do not allow having multiple GCD.
It was called crossfire-on-a-stick because it was just that. You could actually expose each GPU separately in some OpenCL workloads and have them run independently.
How well is AMDs "huge margins" going with the release of Zen 4? Remind me the economics of a product with 50% margin that sells 1000 units, compared to a product with 30% margin that sells 10,000 units. Defect 4090s can become 4080ti. Their defect rate also doesn't matter that much when they can charge whatever they like, due to no competition. Except for RTRT and a driver and software development team that is competent and huge (compared to AMD).
I guess we'll see what the performance is like when the reviews come out.
AMD doesn't seem to understand their position in the GPU market - they're the underdog, with worse development resources, technology (primarily software, but also dedicated hardware units) that is several years behind the competition, significantly lower market share etc. It's not a reasonable prediction to say "NVIDIA will suffer", it's AMD that needs to actually compete. They needed to go balls to the wall and release a multi die card that blew the 4090 out of the water, while being the same price or cheaper to change this "mindshare", instead they've released another generation of cards that on paper is "almost as good", while being a bit cheaper and without the NVIDIA software stack advantage.
Not all defect are equal, If you have a slight deflect in one of the compute unit, you can just disable it, same thing with one of the memory controler. But if it's in area with no redundancy, you cannot reuse the chip. And that is there is only one defect per chip.
Cheaper SKU can also be just chip that cannot clock as high as other. Nothing say by example that a 6800 have defect, but they might have found that it didn't clocked high enough to be a 6800 XT or 6900XT/6950XT. It's also possible that the chip had one of the compute unit that really didn't clock very high so they had to disable it to reach clock (so a good chip without defect but not good enough).
That affect everyone, but bigger chip are more at risk of defect. in the end, the number can be taken at face value because the larger chip, the higher the chance something is wrong and that it's not reusable by deactivating some portion of it.
2. Perhaps the infinity fabric isn't quite there yet to let them have the CUs on multiple dies? Or perhaps the front end requires 5nm to perform effectively which means that AMD would have to do two different designs on 5nm wafers which would complicate their production schedule and increased costs?
In other words, AMD is taking advantage of their cost savings and releasing their top tier product at $600 less than Nvidia's top tier product and at $200 less than what the apparent competitive GPU is launching at. Things get even worse for Nvidia in markets outside of the USA. For example, here in Australia, if AMD does not price gouge us like Nvidia is doing then the 7900XTX will be roughly half the price of a 4090 and that price saving will be enough to build the rest of your PC (e.g. 5800X3D, 16GB/32GB DDR4, b550 motherboard, 850W PSU, $200 case).
Their top tier product competes in raster to Nvidia's 3rd/4th product down the stack (4090ti, 4090, 4080ti, 4080), therefore the fact it's cheaper is borderline irrelevant. That's without getting started on the non-raster advantages NVIDIA has.
In other words, the situation hasn't changed since the 6950xt and the 3090ti/6900 and 3090.
Next year I am planning on building a separate strictly gaming PC and I want to convert my current build to an HTPC that will handle everything else, including recording with a capture card. I would definitely want a cheap graphics card with AV1 encoding. If NVIDIA or AMD do not offer such a product, I might go with Intel (A380 or whatever).
I personally won't trade tiny amount of RT Cores, newer Encoder-Decoder & HDMI 2.1 40Gbps for their predecessor 5500 XT 8GB. Even 470 4GB from 2016 brutalises 6400 in all scenarios.