AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

squallheart · Oct 21, 2020

RedelZaVedno said:
Performance per watt did go up on Ampere, but that's to be expected given that Nvidia moved from TSMCs 12nm to Samsung’s 8nm 8LPP, a 10nm extension node. What is not impressive is only 10% performance per watt increase over Turing while being build on 25% denser node. RDNA2 arch being on 7 nm+ looks to be even worse efficiency wise given that density of 7nm+ is much higher, but let's wait for the actual benchmarks.

Did you literally just completely ignore the chart that was a few posts above you? 100/85 = 117.6% so still a 17.6% improvement in performance/watt over the most efficient Turing GPU.

mtcn77 · Oct 22, 2020

dragontamer5788 said:
Really, AMD needs to put out a new optimization guide that contains information like this (which they haven't written one since the 7950 series

Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.

dragontamer5788 said:
If you are shader-launch constrained, it isn't a big deal to have a for(int i=0; i<16; i++){} statement wrapping your shader code. Just loop your shader 16 times before returning.

I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better. :oops:

mahirzukic2 · Oct 22, 2020

mtcn77 said:
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.

I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.

If this is really possible, it would be really awesome.

mtcn77 · Oct 22, 2020

mahirzukic2 said:
If this is really possible, it would be really awesome.

Yes, just power gate them until they are ready for full operation with no delay since they denote it is already an established problem to keep pipelines full rather than to empty them. If it helps, turning off shader array could provide a overclock ceiling expansion which also speeds up the idle recovery.
Funny thing is the rgp looks like a tapered trapezoid at the time distal end, so they ought to work on the retiring speed also.
I don't get it, still. All thread blocks are limited to 1024 size. Even an ai could pattern all possible permutations of a 1024 units workgroup. They aren't trying hard enough, haven't they even played any Starcraft... build orders are everything. Just 4pool, gg wp.

System Name	Workhorse
Processor	13900K 5.9 Ghz single core (2x) 5.6 Ghz Allcore @ -0.15v offset / 4.5 Ghz e-core -0.15v offset
Motherboard	MSI Z690A-Pro DDR4
Cooling	Arctic Liquid Cooler 360 3x Arctic 120 PWM Push + 3x Arctic 140 PWM Pull
Memory	2 x 32GB DDR4-3200-CL16 G.Skill RipJaws V @ 4133 Mhz CL 18-22-42-42-84 2T 1.45v
Video Card(s)	RX 6600XT 8GB
Storage	PNY CS3030 1TB nvme SSD, 2 x 3TB HDD, 1x 4TB HDD, 1 x 6TB HDD
Display(s)	Samsung 34" 3440x1400 60 Hz
Case	Coolermaster 690
Audio Device(s)	Topping Dx3 Pro / Denon D2000 soon to mod it/Fostex T50RP MK3 custom cable and headband / Bose NC700
Power Supply	Enermax Revolution D.F. 850W ATX 2.4
Mouse	Logitech G5 / Speedlink Kudos gaming mouse (12 years old)
Keyboard	A4Tech G800 (old) / Apple Magic keyboard

AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

squallheart

mtcn77

mahirzukic2

mtcn77