For AMD, a lot is riding on the success of the new RDNA 2 graphics architecture as it powers not just the Radeon RX 6000 series graphics cards, but also the GPU inside next-generation game consoles designed for 4K Ultra HD gaming with raytracing—a really tall engineering goal. AMD was first to market with a 7 nm GPU more than 15 months ago, with the original RDNA architecture and Navi. The company hasn't changed its process node, but implemented a host of new technologies, having acquired experience with the node. At the heart of the Radeon RX 6900 XT is the 7 nm Navi 21 silicon, which has been fondly referred to as "Big Navi" over the past year or so. This is a massive 519.8 mm² die with 26.8 billion transistors, which puts it roughly in the same league as NVIDIA's 8 nm "GA102" (28.3 billion transistors on a 628.4 mm² die). The die talks to the outside world with a 256-bit wide GDDR6 memory interface, a PCI-Express 4.0 x16 host interface, and display I/O that's good for multiple 4K or 8K displays due to DSC.
Through new design methodologies and component-level optimization throughout the silicon, along with new power-management features, AMD claims to have achieved two breakthroughs that enabled it to double the compute unit counts over the previous generation while staying within a reasonable power envelope. Firstly, the company managed to halve the power draw per CU while adding a 30% increase in engine clocks, which can both be redeemed for performance gain per CU.
The RDNA 2 compute unit is where a bulk of the magic happens. Arranged in groups of two called Dual Compute Units, which share instruction and data caches, the RDNA 2 compute unit still packs 64 stream processors (128 per Dual CU) and has been optimized for increased frequencies, new kinds of math precision, new hardware that enable the Sampler Feedback feature, and the all-important Ray Accelerator, a fixed-function hardware component that calculates up to one triangle or four box ray intersections per clock cycle. AMD claims the Ray Accelerator makes intersection performance up to ten times faster than if it were performed over compute shaders.
AMD also redesigned the render backends of the GPU from the ground up, towards enabling features such as Variable Rate Shading (both tier-1 and tier-2). The company has doubled ROP counts over Navi by giving the chip 128 ROPs. The RX 6800 XT and RX 6900 XT enjoy all 128 ROPs, while the RX 6800 gets 96.
Overall, the Navi 21 silicon has essentially the same component hierarchy as Navi 10. The Infinity Fabric interconnect is the link that binds all the components together. At the outermost level, you have the chip's 256-bit GDDR6 memory controllers, a PCI-Express 4.0 x16 host interface, and the multimedia and display engines (which have been substantially updated from RDNA). A notch inside is the chip's 128-megabyte Infinity Cache, which we detail below. This cache is the town square for the GPU's high-speed 4 MB L2 caches and the graphics command processor, which dispatches the workload among four shader engines. Each of these shader engines packs 10 RDNA 2 Dual Compute Units (or 20 CUs) along with the updated render backends and L1 cache. Combined, the silicon has 5,120 stream processors across 80 CUs, 80 Ray Accelerators (1 per CU), 320 TMUs, and 128 ROPs.
The Radeon RX 6900 XT maxes out the Navi 21 silicon by enabling all 80 RDNA 2 compute units. This works out to 5,120 stream processors, 80 Ray Accelerators, 320 TMUs, and 128 ROPs. The card comes with 16 GB of GDDR6 memory running at 16 Gbps (GDDR6-effective), across the chip's 256-bit wide memory interface, which works out to 512 GB/s of memory bandwidth. The Infinity Cache runs at the highest possible 2 TB/s data-rate; while AMD has increased the overdrive engine clock speed slider limit to 3.00 GHz (up from 2.80 GHz on the RX 6800 XT).
Infinity Cache, or How AMD is Blunting NVIDIA's G6X Advantage
Despite its lofty design goals and a generational doubling in memory size to 16 GB, the RX 6900 XT has a rather unimpressive memory setup compared to NVIDIA's RTX 3090. That is, at least on paper, with just a 256-bit bus width and JEDEC-standard 16 Gbps GDDR6, which works out to 512 GB/s raw bandwidth. NVIDIA has increased bus width to 384-bit on the RTX 3090 and innovated 19.5 Gbps GDDR6X memory to go with its cards, offering bandwidth rivaling those of 4096-bit HBM2 setups. Memory compression secret sauce can at best increase effective bandwidth by a high single-digit percent.
AMD took a frugal approach to this problem, not wanting to invest in expensive HBM+interposer based solutions, which would throw overall production costs way off balance. It looked at how AMD's "Zen" processor team leveraged large last-level caches on EPYC processors to significantly improve performance and carried the idea over to the GPU. About 20% of the Navi 21 silicon die area now holds what AMD calls the "Infinite Cache," which is really just a new L3 cache that is 128 MB in size and talks to the GPU's four shader engines at 1024 bits per pin, per cycle. This cache has an impressive bandwidth of 2 TB/s and can be used as a victim cache by the 4 MB L2 caches of the four shader engines.
The physical media of Infinity Cache is the same class of SRAM as for the L3 cache on Zen processors. It offers four times the density of 4 MB L2 caches, lower bandwidth in comparison, but four times the bandwidth over GDDR6. It also significantly reduces energy consumption, by 1/6th for the GPU to fetch a byte of data compared to doing so from GDDR6 memory. I'm sure the questions on your mind are what difference 128 MB makes and why no one has done this earlier.
To answer the first question, even with just 128 MB spread across two slabs of 64 MB, each, Infinity Cache takes up roughly 20% of the die area of the Big Navi silicon, and AMD's data has shown that much of the atomic workloads involved in raytracing and raster operations are bandwidth rather than memory-size intensive. Having a 128 MB fast victim cache running at extremely low latencies (compared to DRAM) helps. As for why AMD didn't do this earlier, it's only now that there's an alignment of circumstances where the company can afford to go with a fast 128 MB victim cache as opposed to just cramming in more CUs to get comparable levels of performance, but for less power consumption—as a storage rather than a logic device, spending 20% of the die area on Infinity Cache instead of 16 more CUs does result in power savings.