• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Announces the Radeon RX 6000 Series: Performance that Restores Competitiveness

Plus, while AMD might feel encouraged to slow things down a bit on the CPU side, since they are starting to compete with themselves a bit
Plus, they are focused on a new socket. The Ryzen 5000 series of CPU's are the last for socket AM4. The next will likely be AM5.
 
Last edited:
Plus, they are focused on a new socket. The Ryzen 5000 series of CPU's is the last for socket AM4. The next will likely be AM5.
And they can't rush that out before PCIe 5.0 is at least technically viable (likely needs new on-board hardware to ensure signal integrity, which might not be available at consumer price levels for a while) and DDR5 has wide availability. Definitely good reasons to hold off AM5 for a while yet.

But using that as an argument that AMD will try to quicken their GPU development pace? Nah, sorry, not buying that. 16 months between RDNA 1 and RDNA 2. Now we're supposed to get RDNA 3 in < 14 months? And remember, a launch later in the year than this isn't happening no matter what. It's either pre holiday season or CES. Which makes that 12 months, not 14. I really don't see that as likely. I'll be more than happy to be proven wrong, but I'm definitely sticking to a more cautious approach here.
 
Why do you think they'll just straight up go with PCIe 5.0 ? They most certainly can skip on that.

DDR5 is a given, PCIe 5.0 is not much of a necessity even on servers. Of course with Xilinx (acquisition) they might surprise us or something.
 
Why do you think they'll just straight up go with PCIe 5.0 ? They most certainly can skip on that.

DDR5 is a given, PCIe 5.0 is not much of a necessity even on servers. Of course with Xilinx they might surprise us or something.
I don't think it's necessary at all, but launching a new long-term platform ~a year before the availability of a I/O standard is generally a bad idea. Of course it's possible that they could launch AM5 with the promise of future PCIe 5.0 support (i.e. first-gen motherboards and CPUs will have 5.0, but will be compatible with next-gen CPUs and mobos that have 5.0 support, just at 4.0 speeds when mixed), but again, that's rather sloppy.
 
I think they are able to cut/disable CUs by 2. If you look RNDA1/2 full dies you will see 20 and 40 same rectangular respectively. Each one of these rectangular are 2CUs.

Note: CU is now a bit of a historical artifact. RDNA and RDNA 2 are organized into WGPs, or "Dual Compute Units" (because each WGP has the resources of 2x CUs of old). That's why there are 40 RDNA clusters, which count as 80 "CUs" (even though CUs don't really exist anymore).

CUs were in Vega, and are a decent unit to think about while programming the GPU. WGPs work really hard to "pretend" to work like 2x CUs for backwards compatibility purposes... but they're really just one unit now.

-----

As such: the proper term for those 40x clusters on your RDNA2 die shot is Workgroup Processor (WGP)... or "Dual-compute units" (if you want to make a comparison to Vega).
 
But using that as an argument that AMD will try to quicken their GPU development pace? Nah, sorry, not buying that. 16 months between RDNA 1 and RDNA 2. Now we're supposed to get RDNA 3 in < 14 months? And remember, a launch later in the year than this isn't happening no matter what. It's either pre holiday season or CES. Which makes that 12 months, not 14. I really don't see that as likely. I'll be more than happy to be proven wrong, but I'm definitely sticking to a more cautious approach here.
You forget that during these 16 months they effectively launched 3 architectures, RDNA2 + 2 custom APUs, with different architectures and features for consoles. Now the whole GPU design team is free to work on the new GPU generation.
 
I don't think it's necessary at all, but launching a new long-term platform ~a year before the availability of a I/O standard is generally a bad idea. Of course it's possible that they could launch AM5 with the promise of future PCIe 5.0 support (i.e. first-gen motherboards and CPUs will have 5.0, but will be compatible with next-gen CPUs and mobos that have 5.0 support, just at 4.0 speeds when mixed), but again, that's rather sloppy.
They added pciex4 into zen later.
 
You forget that during these 16 months they effectively launched 3 architectures, RDNA2 + 2 custom APUs, with different architectures and features for consoles. Now the whole GPU design team is free to work on the new GPU generation.
"The whole design team" is at least four separate design teams (two for Zen). It's not like all the Zen design engineers can just slot into a GPU design team without a significant retraining period. The semi-custom team is no doubt already working on 5nm refreshes for both console makers, but some of their engineers could have been moved to a field closer to their expertise, whether that's CPU, GPU, I/O, fabric, etc. Ryzen is under continuous development; one team just finished Zen 3, the other is hard at work with Zen 4, and no doubt the Zen 2 team is now ramping up development of Zen 5. There might be some minor shuffling, but nothing on the scale you are indicating.

They added pciex4 into zen later.
That's true. But that was quite a long time after AM4 launched, not a year or less.
 
"The whole design team" is at least four separate design teams (two for Zen). It's not like all the Zen design engineers can just slot into a GPU design team without a significant retraining period. The semi-custom team is no doubt already working on 5nm refreshes for both console makers, but some of their engineers could have been moved to a field closer to their expertise, whether that's CPU, GPU, I/O, fabric, etc. Ryzen is under continuous development; one team just finished Zen 3, the other is hard at work with Zen 4, and no doubt the Zen 2 team is now ramping up development of Zen 5. There might be some minor shuffling, but nothing on the scale you are indicating.
I wonder where do you get the info on the console 5nm refreshes, do you have any source, or are you just guessing? Sony made it clear there will be no refreshes this generation, at least, and there is no leak or hint of that yet, if any of that will come, it will most probably be way later after RDNA3.
 
I wonder where do you get the info on the console 5nm refreshes, do you have any source, or are you just guessing? Sony made it clear there will be no refreshes this generation, at least, and there is no leak or hint of that yet, if any of that will come, it will most probably be way later after RDNA3.
Doesn't mean they won't evolve what's out for a cheaper BOM, it's what they do.
 
My, absolutely based on (my) logic, estimation is that AMD will stay away from GDDR6X. Because they can get away with the new IC implementation. And second for the all kinds of expenses. GDDR6X is more expensive, draws almost X3 the power from “simple” GDDR6, and the memory controller need to be more complex too (=more expenses on die area and fab cost).

This I “heard” partially...
The three 6000 we’ve seen so far is based on the Navi21 right? 80CUs full die. They may have one more N21 with further less CUs, don’t know how many, probably 56 or even less active with 8GB(?) and probably same 256bit bus. But this isn’t coming soon I think because they may have to make inventory first (because of present good fab yields) and also see how things will go with nVidia.

Further down they have Navi22. Probably (?)40CUs full die with 192bit bus, (?)12GB, and clocks up to 2.5GHz, 160~200W, with who knows how much IC it will have. That will be better than 5700XT.
And also cutdown versions of N22 with 32~36CUs 8/10/12GB 160/192bit (for 5600/5700 replacements) and so on, but at this point is all on full speculations and things may change in future.

Also rumors for Navi23 with 24~32CUs but... it’s way too soon.

Navi21: 4K
Navi22: 1440p and ultrawide
Navi23: 1080p only
That does make sense on the GDDR6X situation on the cost, complexity, and power situation relative to GDDR6 and with the infinity cache being so effective. I'd like to think with 192-bit they'd have more than 40CU's considering the infinity cache. If it were 128-bit with 64MB infinity cache I could see something like 36CU even being quite reasonable. I think trying to aim higher than RNDA1 is in AMD's best interest for both longevity and margins or at least matching it at better efficiency and cost to produce.

Yep, CUs are grouped two by two in ... gah, I can't remember what they call the groups. Anyhow, AMD can disable however many they like as long as it's a multiple of 2.
Looking at them I actually wouldn't expect them to cut that few realistically for a few reasons obviously SKU differentiation is one obvious reason, but the other is heat distribution balance. I'm not sure that's really ideal cutting 4CU's in total with slices of 2CU's diagonal from each other on opposite sides of the die itself kind of makes more sense. That said AMD has a lot of tech packed into their circuitry these days with precision boost and granular management over them that they probably cut only 2CU's if they felt inclined and not have to worry drastically about the heat management and hot spots becoming a real concerning aspect. If it were me I'd probably approach like I described trying to keep heat distribution most efficient when cutting the CU's down. The SKU differentiation is really the biggest concern I feel though I don't think they are going to slice these up 50 ways to kingdom come myself unless they were trying stirr up a bit of a bidding contract war between the AIB's for slightly better binned SKU's of die's in rather finely incremental differentiating ways. I suppose it could happen, but depends on added time and cost to sort thru all that.
 
Last edited:
I wonder where do you get the info on the console 5nm refreshes, do you have any source, or are you just guessing? Sony made it clear there will be no refreshes this generation, at least, and there is no leak or hint of that yet, if any of that will come, it will most probably be way later after RDNA3.
No source, but every single console generation since the PS1 has had some sort of refresh. I'm not talking about the new tier, mid-generation upgrades that we saw with the current generation. Refresh = same specs, new process, smaller, cheaper die with lower power draw. The PS1 had at least one slim version. The PS2 had at least 2. I don't think the OG Xbox had one, but the 360 had two, and the One had one (the S). The PS3 had at least a couple, and the PS4 had one (the Slim). Given that 5nm is already in volume production today, it stands to reason that it'll be cheap enough in 2-3 years that console makers will want to move to it. Even if the cost per die is the same due to the more advanced process, they'll save on the BOM through lower power draw = smaller PSU and heatsink.

Looking at them I actually wouldn't expect them to cut that few realistically for a few reasons obviously SKU differentiation is one obvious reason, but the other is heat distribution balance. I'm not sure that's really ideal cutting 4CU's in total with slices of 2CU's diagonal from each other on opposite sides of the die itself kind of makes more sense. That said AMD has a lot of tech packed into their circuitry these days with precision boost and granular management over them that they probably cut only 2CU's if they felt inclined and not have to worry drastically about the heat management and hot spots becoming a real concerning aspect. If it were me I'd probably approach like I described trying to keep heat distribution most efficient when cutting the CU's down. The SKU differentiation is really the biggest concern I feel though I don't think they are going to slice these up 50 ways to kingdom come myself unless they were trying stirr up a bit of a bidding contract war between the AIB's for slightly better binned SKU's of die's in rather finely incremental differentiating ways. I suppose it could happen, but depends on added time and cost to sort thru all that.
I didn't say they would be cutting 2 off anything, I said they can cut any number as long as it's 2x something. I.e. 2, 4, 6, 8, 10, 12... Even numbered cuts only, in other words. Nor did I say anything about where they would be cut from - that is either decided by where on the die there are defects, or if there aren't any, whatever is convenient engineering-wise. To quote myself, this is my (very rough and entirely unsourced) guess for the Navi 2 lineup in terms of CUs:
80-72-60-(new die)-48-40-32-(new die)-28-24-20 sounds like a likely lineup to me, which gives us everything down to a 5500 non-XT, with the possibility of 5400/5300 SKUs with disabled memory, lower clocks, etc.
 
Seen this yet?


Yeah not sure if it was posted but AMD up benchmarks with SAM enabled but no rage mode.


Results chop and change a bit, but it gives an idea what to expect.
 
I didn't say they would be cutting 2 off anything, I said they can cut any number as long as it's 2x something. I.e. 2, 4, 6, 8, 10, 12... Even numbered cuts only, in other words. Nor did I say anything about where they would be cut from - that is either decided by where on the die there are defects, or if there aren't any, whatever is convenient engineering-wise. To quote myself, this is my (very rough and entirely unsourced) guess for the Navi 2 lineup in terms of CUs:
I was injecting my thoughts on the 2CU situation or twin units whatever you wish to call them or abbreviate them. What I was saying is it's unlikely AMD would bother with a SKU that differentiates by as few as 2 of the CU's seems most probably it would be someplace between 6 to 12 between two different SKU's to me at this point. I do see AMD leaning toward cutting less CU's where possible though and charging a higher premium for better performance and CU count is probably greatly more important than the bandwidth with the current design it's needed to make full advantage of the bandwidth available. Much of what happens hinges on the infinity cache size and bus width in any future SKU's even outside VRAM that also change things a fair bit HBM2 with infinity cache for new SKU's with even more CU's is a real possible scenario to consider too even w/o changing the bus width that's tons of extra bandwidth and more CU's to go along with it and the HBM2 is more power friendly than the GDDR6 if I'm not mistaken along with occupying less space so a bigger chip is rather tangible though I don't know about the yields of that. That said they could do 3SKU's lower initially then try to build a bigger higher CU count chip with HBM2 in that order to maximize the yields situation because TSMC's node will continue to mature more over time. The cost factor would be the concern with HBM2, but it would be better power, bandwidth, and space savings.

Yeah not sure if it was posted but AMD up benchmarks with SAM enabled but no rage mode.


Results chop and change a bit, but it gives an idea what to expect.
That's quite interesting once you drop from 4K to 1440p RNDA2 performance pulls ahead rapidly relative to Ampere. I'd really like to see AMD add 1080p results to this list of benchmarks. The infinity cache seems to really flex it's benefit the most at lower resolutions in perticular which makes sense given the limited amount of cache to work with and huge latency reduction and bandwidth increase it provides better mileage of it naturally. It's actually very much akin to the Intel situation at 1080p so long for eleague high refresh rate gaming. I presume these cards are going to sell like hot cakes to that crowd of users because these cards will scream along nicely at 1080p high refresh rate far as I'm seeing relative to the cost. It'll be interesting to see what happens with RTRT at different resolutions. That infinity cache seems really well effective at lower resolutions.
 
Last edited:
That's quite interesting once you drop from 4K to 1440p RNDA2 performance pulls ahead rapidly relative to Ampere. I'd really like to see AMD add 1080p results to this list of benchmarks. The infinity cache seems to really flex it's benefit the most at lower resolutions in perticular which makes sense given the limited amount of cache to work with and huge latency reduction and bandwidth increase it provides better mileage of it naturally. It's actually very much akin to the Intel situation at 1080p so long for eleague high refresh rate gaming. I presume these cards are going to sell like hot cakes to that crowd of users because these cards will scream along nicely at 1080p high refresh rate far as I'm seeing relative to the cost.
If you think AMDs latest performance across resolutions relatively to Ampere seems that it doesn’t do well on the higher/highest.

It’s not really that RDNA2 architecture/IC doesn’t scale well on different resolutions. Or that it does better at lower ones. It’s the Ampere architecture that doesn’t scale well across resolutions.
And you can see that from benchmarks comparing Turing vs Ampere. Turing and RDNA2 have a more “normal” scaling across the 3 well known 1080p, 1440p and 4K.

Seeing benchmarks of Turing vs Ampere across the 3 res you can identify that as you going up Ampere is getting away from Turing to reach the avg relative perf gains of around 30% on 4K. But on 1080p that difference is “only” 20%.
It’s a matter of Ampere’s architecture.

Also, this relative comparison (we don’t actually have full benches between Turing and RDNA2) short of confirms that AMD’s IC with the high (effective) bandwidth is working well and delivers its promises as a real wide bus.
 
If you think AMDs latest performance across resolutions relatively to Ampere seems that it doesn’t do well on the higher/highest.

It’s not really that RDNA2 architecture/IC doesn’t scale well on different resolutions. Or that it does better at lower ones. It’s the Ampere architecture that doesn’t scale well across resolutions.
And you can see that from benchmarks comparing Turing vs Ampere. Turing and RDNA2 have a more “normal” scaling across the 3 well known 1080p, 1440p and 4K.

Seeing benchmarks of Turing vs Ampere across the 3 res you can identify that as you going up Ampere is getting away from Turing to reach the avg relative perf gains of around 30% on 4K. But on 1080p that difference is “only” 20%.
It’s a matter of Ampere’s architecture.

Also, this relative comparison (we don’t actually have full benches between Turing and RDNA2) short of confirms that AMD’s IC with the high (effective) bandwidth is working well and delivers its promises as a real wide bus.
AFAIK that is mainly because it's only at 4k (and higher) that you can make any real use of the increased FP32 of Ampere, while at lower resolutions you're bottlenecked by other parts of the arch (which weren't doubled).
 
I'll assume you're probably right about Ampere, but far as the resolution scaling is concerned for RDNA2 1080p will be better use of the bandwidth available than 4K more frames for the same amount of bandwidth assuming the CPU can keep pace and the GPU's CU's can keep all that bandwidth availability fed well enough. All I know is RDNA2 relative to Ampere the scaling on RDNA2 did noticeably better when the resolution scaled down from 4K to 1440p and I suspect that follows thru to 1080p as well because it wasn't like a anomaly from the looks of it at all it was across all the tests the gaps narrows or RNDA2 pulls ahead or pulled away even further. You might be right about Ampere, but the infinity cache could be playing a role on top of that much like a SSD with overprovisioning at a lower resolution you'll have more infinity cache overprovisioning to work with so to speak.
 
I guess this “issue” will be cleared as benchmarks will go public with all architectures in them on all resolutions.
 
I'm confusing myself trying to think about it now honestly. I get what you're saying about Ampere, but at the same time the infinity cache is drastically better on bandwidth and I/O. At lower resolution it could come into play more in terms of being readily obvious to the frame rate impact over a given time frame if the CPU/GPU's other requirements and needs can still lift their weight in accordance as well. I need to see a clearer picture of what's happening and understanding of why. I'm sure "Tech Jesus" at Gamer's Nexus will explain it all in over-provisioned deep analysis.

AFAIK that is mainly because it's only at 4k (and higher) that you can make any real use of the increased FP32 of Ampere, while at lower resolutions you're bottlenecked by other parts of the arch (which weren't doubled).
Honeslty while contributing perhaps for certain the infinity cache works a 2.17x bandwidth increase with a 108.5% I/O improvement or 54.25% reduced latency in essence which more pronounced than adjusting for more FP32 workloads rather than FP16 for example. I think the Ampere aspect comes into play as well, but perhaps the infinity cache is the bigger element unless I'm way off basis on my assessment of the situation.
 
Last edited:
Honeslty while contributing perhaps for certain the infinity cache works a 2.17x bandwidth increase with a 58.5% I/O latency reduction in essence which more pronounced than adjusting for more FP32 workloads rather than FP16 for example. I think the Ampere aspect comes into play as well, but perhaps the infinity cache is the bigger element unless I'm way off basis on my assessment of the situation.
I was only speaking of how Ampere scales in comparison to Turing. Comparing how a so far unreleased architecture with a never before seen feature scales to how two other architectures scale ... that's impossible. We know that Ampere does relatively better at 4k than lower resolutions. From what we've seen from AMD so far, the same is not true for RDNA 2 - it seems to scale much more traditionally. But we can't know anything for sure until we have reviews in. Still, AMD's 1440p numbers looks quite a lot better when compared to Ampere than their 4k ones do.
 
We sure need a more technical explanation and approach to this new thing. I’m also interested in the more technical parts and details of any technology that comes.

For my simple non-technical (let alone professional) understanding and explanation, I’m thinking that if IC is truly delivering wide bandwidth (800+bit effective) across different workload levels (up to 4K that is more common than 8K) and scale well across them then the real bottleneck for any better performance is, as you also stated indirectly or not, the cores of the GPU and its surrounding I/O. And if really true they’ve manage to remove bandwidth bottleneck completely, up to 4K at least.

It’s radical! But also not a discovery of the wheel. I can’t think that nVidia’s engineers haven’t think of such implementation. But I can compare nVidia’s approach to the one of Intel. AMD has done steps to CPU world for a unified arch with chiplets that scale really well from just 1 to a large number of them. With its cons.

Intel does not do that but rather was always betting on a more strong arch in its core but couldn’t scale well beyond a number. Today’s nVidia’s approach is doing the same on reverse. It performs better on heavy workloads but does not scale well on lighter ones.

nVidia can’t implement such large cache because doesn’t have room for it in its arch, occupied by Tensor and RT cores. That’s why they need the super high speed 6X VRAM to keep up feeding the cuda cores with data.
In a far edged sense, you can say that AMDs arch (both CPU/GPU) is more of opened sourced and nVidia’s more closed and proprietary. Also RDNA in general is a more of a gaming approach and Ampere(starting with Turing) is more of a work load one that can do well in other loads than gaming, like CGN that was really strong outside gaming.

Rumors say that the next RDNA3 will be more close to ZEN2/3 approach. Chunks of cores/dies tied together with large pools of cache.
That’s why I believe it will not come soon. It will be way more than a year.
 
For my simple non-technical (let alone professional) understanding and explanation, I’m thinking that if IC is truly delivering wide bandwidth (800+bit effective) across different workload levels (up to 4K that is more common than 8K) and scale well across them then the real bottleneck for any better performance is, as you also stated indirectly or not, the cores of the GPU and its surrounding I/O. And if really true they’ve manage to remove bandwidth bottleneck completely, up to 4K at least.

Okay, people tend to think of bandwidth as a constant thing (I'm always pushing 18Gbps or whatever the hell it is) at all times, and that if I'm not pushing the most amount of data at all times the GPU is going to stall.

The reality is only a small subset of data is all that necessary to keeping the GPU fed to not stall. The majority of the data (in a gaming context anyway) isn't anywhere near as latency sensitive and can be much more flexible for when it comes across the bus. IC helps by doing two things. It
A: Stops writes and subsequent retrievals from going back out to general memory for the majority of that data (letting it exist in cache, where its likely a shader is going to retrieve that information from again), and
B: It helps act as a buffer for further deprioritising data retrieval, letting likely needed data be retrieved earlier, momentarily held in cache, then ingested to the shader pipeline than written back out to VRAM.

As for Nvidia, yep, they would have, but the amount of die space being chewed for even 128mb of cache is pretty ludicrously large. AMD has balls chasing such a strategy tbh (but is probably why we saw 384 bit Engineering Sample cards earlier in the year, if IC didn't perform, they could fall back to a wider bus).
 
Last edited:
I'm confusing myself trying to think about it now honestly. I get what you're saying about Ampere, but at the same time the infinity cache is drastically better on bandwidth and I/O. At lower resolution it could come into play more in terms of being readily obvious to the frame rate impact over a given time frame if the CPU/GPU's other requirements and needs can still lift their weight in accordance as well. I need to see a clearer picture of what's happening and understanding of why. I'm sure "Tech Jesus" at Gamer's Nexus will explain it all in over-provisioned deep analysis.

Honeslty while contributing perhaps for certain the infinity cache works a 2.17x bandwidth increase with a 108.5% I/O improvement or 54.25% reduced latency in essence which more pronounced than adjusting for more FP32 workloads rather than FP16 for example. I think the Ampere aspect comes into play as well, but perhaps the infinity cache is the bigger element unless I'm way off basis on my assessment of the situation.
I think this also encapsulates the gist of it somewhat.
Prior to this, AMD struggled with instruction pipeline functions. Successively, they streamlined the pipeline operation flow, dropped instruction latency to 1 and started implementing dual issued operations. That, or I don't know how they can increase shader speed by 7.9x folds implementing simple progressions to the same architecture.

As for Nvidia, yep, they would have, but the amount of die space being chewed for even 128mb of cache is pretty ludicrously large. AMD has balls chasing such a strategy tbh (but is probably why we saw 384 bit Engineering Sample cards earlier in the year, if IF didn't perform, they could fall back to a wider bus).
And remember, this is only because they had previously experimented with it, otherwise there would be no chance that they know first hand how much power budget it would cost them. Sram has a narrow efficiency window.
There used to be a past notice which compared AMD and Intel's cell to transistor ratios, with the summary being AMD had integrated higher and more efficient transistor count units. All because of available die space.
 
In case anyone missed it.:roll::roll:

 
Back
Top