Tuesday, October 20th 2020

AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

AMD is preparing to launch its Radeon RX 6000 series of graphics cards codenamed "Big Navi", and it seems like we are getting more and more leaks about the upcoming cards. Set for October 28th launch, the Big Navi GPU is based on Navi 21 revision, which comes in two variants. Thanks to the sources over at Igor's Lab, Igor Wallossek has published a handful of information regarding the upcoming graphics cards release. More specifically, there are more details about the Total Graphics Power (TGP) of the cards and how it is used across the board (pun intended). To clarify, TDP (Thermal Design Power) is a measurement only used to the chip, or die of the GPU and how much thermal headroom it has, it doesn't measure the whole GPU power as there are more heat-producing components.

So the break down of the Navi 21 XT graphics card goes as follows: 235 Watts for the GPU alone, 20 Watts for Samsung's 16 Gbps GDDR6 memory, 35 Watts for voltage regulation (MOSFETs, Inductors, Caps), 15 Watts for Fans and other stuff, and 15 Watts that are used up by PCB and the losses found there. This puts the combined TGP to 320 Watts, showing just how much power is used by the non-GPU element. For custom OC AIB cards, the TGP is boosted to 355 Watts, as the GPU alone is using 270 Watts. When it comes to the Navi 21 XL GPU variant, the cards based on it are using 290 Watts of TGP, as the GPU sees a reduction to 203 Watts, and GDDR6 memory uses 17 Watts. The non-GPU components found on the board use the same amount of power.
When it comes to the selection of memory, AMD uses Samsung's 16 Gbps GDDR6 modules (K4ZAF325BM-HC16). The bundle AMD ships to its AIBs contains 16 GB of this memory paired with GPU core, however, AIBs are free to put different memory if they want to, as long as it is a 16 Gbps module. You can see the tables below and see the breakdown of the TGP of each card for yourself.
Sources: Igor's Lab, via VideoCardz
Add your own comment

153 Comments on AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

#151
mtcn77
dragontamer5788Really, AMD needs to put out a new optimization guide that contains information like this (which they haven't written one since the 7950 series
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.
dragontamer5788If you are shader-launch constrained, it isn't a big deal to have a for(int i=0; i<16; i++){} statement wrapping your shader code. Just loop your shader 16 times before returning.
I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
Posted on Reply
#152
mahirzukic2
mtcn77Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.

I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
If this is really possible, it would be really awesome.
Posted on Reply
#153
mtcn77
mahirzukic2If this is really possible, it would be really awesome.
Yes, just power gate them until they are ready for full operation with no delay since they denote it is already an established problem to keep pipelines full rather than to empty them. If it helps, turning off shader array could provide a overclock ceiling expansion which also speeds up the idle recovery.
Funny thing is the rgp looks like a tapered trapezoid at the time distal end, so they ought to work on the retiring speed also.
I don't get it, still. All thread blocks are limited to 1024 size. Even an ai could pattern all possible permutations of a 1024 units workgroup. They aren't trying hard enough, haven't they even played any Starcraft... build orders are everything. Just 4pool, gg wp.
Posted on Reply
Add your own comment
Jan 21st, 2025 19:09 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts