• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

RDNA 2.0/Big Navi will be aplenty at launch as confirmed by an AMD employee

And you guys wouldn't believe me @R-T-B, such a noob I was...

It's not that I don't/didn't believe you, it's that I don't believe anyone until the facts are present.

So, you guessed right? Woohoo, you still guessed.
 
Last edited:
If these are as good as some of the hopeful seem to think they will.. I wonder if they will last more than a second (and the sites attacked by bots).

Oh dear... did you post news before TPU did? Shame on you............ lolol :rolleyes: lolol
 
512 GB/s... good luck with that.

I'm struggling to see how AMD is going to push past a 3070 with the same bandwidth...
 
I mean they do beat the pants out of Intel for most tasks not called "gaming" with their secret sauce SMT, so it's not like they need to do anything revolutionary do they? They just have to take a better/more efficient approach than Nvidia, part of it could include huge caches or perhaps something else!
 
Last edited:
Oh dear... did you post news before TPU did? Shame on you............ lolol :rolleyes: lolol

Shit that stuff was available all weekend when the news crew takes off.

so it's not like they need to do anything revolutionary do they? They just have to take a better/more efficient approach than Nvidia, part of it could include huge caches or perhaps something else!

They have been trying to do that for the last 6 years and have been fairly short. My gut is they won't surpass a 3080 but be close enough again. They lucked out they can punt on power efficiency this time.
 
Last edited:
512 GB/s... good luck with that.

I'm struggling to see how AMD is going to push past a 3070 with the same bandwidth...
Navi 21 is between 3080 and 3090 with less power consumption. Navi 22 has more performance than the 3070 with also less power consumption.
 
512 GB/s... good luck with that.

I'm struggling to see how AMD is going to push past a 3070 with the same bandwidth...

RDNA really has a much better memory structure than GCN. I can imagine that AMD will work on making it even better on RDNA2.

But you're right, this is an area that AMD has traditionally done much worse than NVidia in. Its why AMD needed 1TBps on Radeon VII or 500GB/s on Vega to keep up in the last few generations. Still, given how good the 5700XT is in synthetics, I think AMD is taking good steps forward in RDNA in general.
 
RDNA really has a much better memory structure than GCN.
Is it? I thought double the amount of schedulers did the deed. Console developers were seeing huge amounts of uplift upon coding for latency in contrast to thoroughput. What RDNA did was halve the amount of shaders per threadset and double the amount of schedulers. Now they can push 2 thread scalar operations instead of one.
Scalars allow less than the whole simd workloads to be retired every cycle. They don't have to wait for the whole simd operation to finish on all items +4 cycles. Although scheduling has changed altogether, now that everything is issued at +1 cycle.
It looks like AMD just changed the default gpu compiler to do work based on scalar execution mask instead of vector mask every cycle. I'm sorry if that sounds too stupid to be true. However, I do remember scalars still were every cycle at GCN. +4 cycle instruction latency were for the vector mask I believe.

Compute shaders changed and enabled a lot of things. Previously, fixed function units had their own cache and did their own thing.
That created a problem in the scheduling department. 4 cycles of latency is a lot. What is more, the hardware payed for that in having less register instances, if different pointers were loaded.
The way I understand is, it didn't matter to the hardware whether 4 simd16 units accessed 4 different wavefronts, or 1 wavefront. It still took +4 cycles to retire them when it could do in 1, but in manual compute mode. That is what AMD had to convince developers to do before. Otherwise, the default compiler didn't do this on its own since it needs to be toggled once the data is loaded. Otherwise, it is not for memory set management.

If you retired stuff faster, the registry would pick up speed running more wavefronts simultaneously and not waiting on the common memory writeback interval.
I think they set its granularity according to the memory byte access pattern and forgot to tune it according to the gpu execution pattern.
 
Is it? I thought double the amount of schedulers did the deed. Console developers were seeing huge amounts of uplift upon coding for latency in contrast to thoroughput. What RDNA did was halve the amount of shaders per threadset and double the amount of schedulers. Now they can push 2 thread scalar operations instead of one.
Scalars allow less than the whole simd workloads to be retired every cycle. They don't have to wait for the whole simd operation to finish on all items +4 cycles. Although scheduling has changed altogether, now that everything is issued at +1 cycle.
It looks like AMD just changed the default gpu compiler to do work based on scalar execution mask instead of vector mask every cycle. I'm sorry if that sounds too stupid to be true. However, I do remember scalars still were every cycle at GCN. +4 cycle instruction latency were for the vector mask I believe.

.

This helped

It looks like it improved shader execution

which is what I'm pretty the Ampere is trying to do vs Turning.
 
which is what I'm pretty the Ampere is trying to do vs Turning.
Nvidia stuff? That is hard. Specially I don't understand how they partition 'live' threads. I get it on AMD, the execution mask doesn't let a wavefront unless all flags are cleared. On Nvidia, the way they can resection stuff sounds like brain surgery.
098bab52-8b41-4f77-b95c-854fb906d092.png

The person charting that graph looks like missing some landmarks.
 
Is it? I thought double the amount of schedulers did the deed. Console developers were seeing huge amounts of uplift upon coding for latency in contrast to thoroughput. What RDNA did was halve the amount of shaders per threadset and double the amount of schedulers. Now they can push 2 thread scalar operations instead of one.
Scalars allow less than the whole simd workloads to be retired every cycle. They don't have to wait for the whole simd operation to finish on all items +4 cycles. Although scheduling has changed altogether, now that everything is issued at +1 cycle.
It looks like AMD just changed the default gpu compiler to do work based on scalar execution mask instead of vector mask every cycle. I'm sorry if that sounds too stupid to be true. However, I do remember scalars still were every cycle at GCN. +4 cycle instruction latency were for the vector mask I believe.

Compute shaders changed and enabled a lot of things. Previously, fixed function units had their own cache and did their own thing.
That created a problem in the scheduling department. 4 cycles of latency is a lot. What is more, the hardware payed for that in having less register instances, if different pointers were loaded.
The way I understand is, it didn't matter to the hardware whether 4 simd16 units accessed 4 different wavefronts, or 1 wavefront. It still took +4 cycles to retire them when it could do in 1, but in manual compute mode. That is what AMD had to convince developers to do before. Otherwise, the default compiler didn't do this on its own since it needs to be toggled once the data is loaded. Otherwise, it is not for memory set management.

If you retired stuff faster, the registry would pick up speed running more wavefronts simultaneously and not waiting on the common memory writeback interval.
I think they set its granularity according to the memory byte access pattern and forgot to tune it according to the gpu execution pattern.

1. RDNA has 3-levels of cache: L0, L1, and L2.

L1 and L2 were from GCN: L2 shared across all CUs, L1 shared across 4CUs at a time. L0 finally adds a cache-per-CU. Yeah, GCN didn't have per-CU caches, crazy.

2. RDNA can schedule reads-and-writes out-of-order of each other. This is because "S_WAITCNT VM_CNT" and "S_WAITCNT VS_CNT" (vector-memory count, and vector-store count) is split up, so the stores can "float" out-of-order of the VM_CNT. GCN only had "VM_CNT", for both loads-and-stores, reducing the ability for the CUs to float memory operations out of order.

3. Bigger caches: the usual. Caches always improve whenever there's a process advancement, so this was to be expected. #1 and #2 are the bigger architectural differences.

What is more, the hardware payed for that in having less register instances, if different pointers were loaded.

RDNA has 512-VGPRs per CU now and 20-wavefronts per CU.

Compared to 256-VGPRs per vALU with 10-wavefronts, RDNA is "tied" with GCN with regards to register sets. (Yeah, RDNA "lost half" its wavefronts by switching to Wave32, but doubling the register count almost certainly "brings them back" and counter-acts the issue).
 
(Yeah, RDNA "lost half" its wavefronts by switching to Wave32,
Don't look at it that way. It issues a 32 item simd32 at once, now. Previously, it issued 16 thread item simd16's, so it didn't care whether you accessed 4 different 16/64 item wavefronts consecutively or 1 single 64/64 wavefront operating. When you look at it, it makes sense since both halves of the old wavefront is free from each other, it doesn't take 2 cycles, but one whereas 4 simd units issued one after the other incurred a 4 cycle penalty to retire either wavefront.
It is crazy how things could thus be. I can recall sebbbi giving notion on a 5 way split giving memory accesses isolation from runtime. It must be tight having to code for 48 vpgrs for memory unimpediment...

I always wondered when there would be an a.i. for those kinds of linear software tuning jobs.
 
Last edited:
Don't look at it that way. It issues a 32 item simd32 at once, now. Previously, it issued 16 thread item simd16's, so it didn't care whether you accessed 4 different 16/64 item wavefronts consecutively or 1 single 64/64 wavefront operating. When you look at it, it makes sense since both halves of the old wavefront is free from each other, it doesn't take 2 cycles, but one whereas 4 simd units issued one after the other incurred a 4 cycle penalty to retire either wavefront.
It is crazy how things could thus be. I can recall sebbbi giving notion on a 5 way split giving memory accesses isolation from runtime. It must be tight having to code for 48 vpgrs for memory unimpediment...

My perspective was from a "number of vGPR bits on the CU" perspective. Which btw, I got the numbers wrong earlier. Looking at the RDNA ISA again, there's 1024 VGPRs per WGP. But my overall point is correct :) Let me explain my point more carefully this time (and with correct numbers this time!!).

A GCN CU 4x vALUs. (I'm going to ignore the sALU on both GCN and RDNA for simplicity). Each vALU had 256 vGPRs for 64-threads. Only 16 of these threads ever executed at a time, but all 4 vALUs x 256 VGPRs x 32-bit x 64-thread registers had to exist in memory. Thats 2-million bits per GCN CU.

Lets perform a similar calculation for RDNA. RDNA is organized as WGPs (a "dual compute unit"), with 4 vALUs. Normalizing for GCN, there's only 2 vALUs (One WGP is roughly comparable to 2x CUs from GCN). That is to say, an RX 580 (40 CUs) is upgraded to a 5700 XT (20 WGPs, sold as a "40 CU" card). For RDNA: that's 2 vALUs x 1024 vGPRs x 32-bits x 32-threads == 2-million bits per "Half WGP"

For both GCN and RDNA, there's "2-million VGPR-bits per CU". (Whether its an "old GCN CU" or a "new RDNA CU - half WGP").

I always wondered when there would be an a.i. for those kinds of linear software tuning jobs.

To steal your phrase: Don't look at it that way :)

* Kernel A uses 70 vGPRs
* Kernel B uses 200 vGPRs
* Kernel C uses 400 vGPRs.

That means you can run 1x Kernel C (400 vGPRs) + 2x Kernel B (400 vGPRs) + 3x Kernel A (210 vGPRs) RDNA SIMD (4x SIMDs per WGP). Dispatching GPU code is basically a "malloc for VGPRs".

Once KernelC completes, you can fill up the remaining space with 5x KernelA (for a total of 8x KernelA running).

---------

Damn it, it looks like I misremembered a few things. It was the Scalar-L1 cache that changed, not the vGPR-L1 cache. Whatever, here's the answer (from https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf) :

1601337098682.png
 
Last edited:
That means you can run 1x Kernel C (400 vGPRs) + 2x Kernel B (400 vGPRs) + 3x Kernel A (210 vGPRs) RDNA SIMD (4x SIMDs per WGP). Dispatching GPU code is basically a "malloc for VGPRs".
Not run, "load".
Running them takes much more time - in fact, just as much time as necessary. Just think how fast 10 waves per simd16 would retire in GCN (about 640 work items when all 4 at the same instruction) in its runtime, 3240 cycles for a fullhd framebuffer. Comparatively, having just '1' will increase the frame composition time by 10 fold.
That is what changed between the generations, imo. Developers did care, but up until RDNA AMD didn't care how long the code time took to finish.
They should give a rough estimate on tasklife benchmarks.
 
Navi 21 is between 3080 and 3090 with less power consumption. Navi 22 has more performance than the 3070 with also less power consumption.

Ah, now I see what you mean. Of course! o_O:laugh::kookoo: Thanks for explaining things so clearly, great evidence to support that claim as well. Mind if I nag you with it for the duration of 2021 when it all turns out 'not quite' like it?

RDNA really has a much better memory structure than GCN. I can imagine that AMD will work on making it even better on RDNA2.

But you're right, this is an area that AMD has traditionally done much worse than NVidia in. Its why AMD needed 1TBps on Radeon VII or 500GB/s on Vega to keep up in the last few generations. Still, given how good the 5700XT is in synthetics, I think AMD is taking good steps forward in RDNA in general.

Its not even about being worse than Nvidia. Even if they get the same efficiency on it as Nvidia they're still stuck right at 3070 level, and not above it, with 512Gb/s.

Steps forward is fine, but what I'm seeing is that Navi won't just do on the PC what its' made to do on consoles. There is some magical cache... and yet no word or live preview of its performance advantages... smoke > fire... This is supposed to release this fall, right? Not that Nvidia's approach is much different btw, RTX IO is of a similar nature I believe and in a similar state.
 
Last edited:
Ah, now I see what you mean. Of course! Thanks for explaining things so clearly, great evidence to support that claim as well. Mind if I nag you with it for the duration of 2021 when it all turns out 'not quite' like it?
You don’t follow the news? Or you don’t like the fact that AMD is in the high-end (3080/3090 level)?
You can buy the 3080 for example but then you can’t game :laugh: it only crash.
 
You don’t follow the news? Or you don’t like the fact that AMD is in the high-end (3080/3090 level)?
You can buy the 3080 for example but then you can’t game :laugh: it only crash.
I've had an RTX3080 since the day after launch day and not hand a single crash, and as for AMD matching/exceeding 3080 performance, time will tell, you say it as if they're already there and the specific products aren't even announced yet.
 
You can buy the 3080 for example but then you can’t game :laugh: it only crash.

yeah it only happens when you push the GPU to hard, if left to it's own devices there shouldn't be any CTD's which really isn't nVidia's fault but the fault of the board partners going cheap on the power filtering side by choosing not to stick to nVidia's reference design

1601114029_founders-1_story.jpg

this is the minimum nVidia said to go with 4x 1 POSCAPs and 2x 10 MLCC banks

1601114045_zotac-trinity_story.jpg

Instead this is what alot of board partners did instead and these are the cards that are suffering from CTD's

I've had an RTX3080 since the day after launch day and not hand a single crash, and as for AMD matching/exceeding 3080 performance, time will tell, you say it as if they're already there and the specific products aren't even announced yet.

You bought the card done the right way an all MLCC filtering setup

1601114445_asus-tuf-1_story.jpg
 
  • Like
Reactions: Rei
You don’t follow the news? Or you don’t like the fact that AMD is in the high-end (3080/3090 level)?
You can buy the 3080 for example but then you can’t game :laugh: it only crash.

The 'news' you say? AMD rumor mill does not appear in any half serious news paper, site or tech outlet with half a brain. And if it does, its usually a mistake.

If your definition of 'news' is some sweaty nerds and waifu lovers on Youtube, you're doing it wrong.

Get a grip
 
sorry little offtopic:
@wolf does your 3080 boost over 2000mhz-if not could ya try it plz?
TiA
 
sorry little offtopic:
@wolf does your 3080 boost over 2000mhz-if not could ya try it plz?
TiA
This is what it did for W1z...

 
sorry little offtopic:
@wolf does your 3080 boost over 2000mhz-if not could ya try it plz?
TiA
It doesn't stock, I've oc'd it mildly above 2000 with no issue, it just needs so much more power compared to sub 1900mhz undervolted so it makes little sense to run it there.
 
The 'news' you say? AMD rumor mill does not appear in any half serious news paper, site or tech outlet with half a brain. And if it does, its usually a mistake.

If your definition of 'news' is some sweaty nerds and waifu lovers on Youtube, you're doing it wrong.

Get a grip
Yea news, you know what it means? Tech news not Nvidia news.:laugh:
 
Yea news, you know what it means? Tech news not Nvidia news.:laugh:

All of the rumors I have seen slot it between 3080 and 3070, closer to the former. Nothing I have seen shows it beating a 3080. Can you link your sources because I would honestly like to see them. I thought I kept up on rumors pretty good.
 
Back
Top