• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Radeon RX 6000 Series "Big Navi" GPU Features 320 W TGP, 16 Gbps GDDR6 Memory

Joined
Aug 15, 2017
Messages
18 (0.01/day)
Performance per watt did go up on Ampere, but that's to be expected given that Nvidia moved from TSMCs 12nm to Samsung’s 8nm 8LPP, a 10nm extension node. What is not impressive is only 10% performance per watt increase over Turing while being build on 25% denser node. RDNA2 arch being on 7 nm+ looks to be even worse efficiency wise given that density of 7nm+ is much higher, but let's wait for the actual benchmarks.

Did you literally just completely ignore the chart that was a few posts above you? 100/85 = 117.6% so still a 17.6% improvement in performance/watt over the most efficient Turing GPU.
 
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
Really, AMD needs to put out a new optimization guide that contains information like this (which they haven't written one since the 7950 series
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.
If you are shader-launch constrained, it isn't a big deal to have a for(int i=0; i<16; i++){} statement wrapping your shader code. Just loop your shader 16 times before returning.
I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
 
Joined
Jul 5, 2019
Messages
318 (0.16/day)
Location
Berlin, Germany
System Name Workhorse
Processor 13900K 5.9 Ghz single core (2x) 5.6 Ghz Allcore @ -0.15v offset / 4.5 Ghz e-core -0.15v offset
Motherboard MSI Z690A-Pro DDR4
Cooling Arctic Liquid Cooler 360 3x Arctic 120 PWM Push + 3x Arctic 140 PWM Pull
Memory 2 x 32GB DDR4-3200-CL16 G.Skill RipJaws V @ 4133 Mhz CL 18-22-42-42-84 2T 1.45v
Video Card(s) RX 6600XT 8GB
Storage PNY CS3030 1TB nvme SSD, 2 x 3TB HDD, 1x 4TB HDD, 1 x 6TB HDD
Display(s) Samsung 34" 3440x1400 60 Hz
Case Coolermaster 690
Audio Device(s) Topping Dx3 Pro / Denon D2000 soon to mod it/Fostex T50RP MK3 custom cable and headband / Bose NC700
Power Supply Enermax Revolution D.F. 850W ATX 2.4
Mouse Logitech G5 / Speedlink Kudos gaming mouse (12 years old)
Keyboard A4Tech G800 (old) / Apple Magic keyboard
Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.

I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.:oops:
If this is really possible, it would be really awesome.
 
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
If this is really possible, it would be really awesome.
Yes, just power gate them until they are ready for full operation with no delay since they denote it is already an established problem to keep pipelines full rather than to empty them. If it helps, turning off shader array could provide a overclock ceiling expansion which also speeds up the idle recovery.
Funny thing is the rgp looks like a tapered trapezoid at the time distal end, so they ought to work on the retiring speed also.
I don't get it, still. All thread blocks are limited to 1024 size. Even an ai could pattern all possible permutations of a 1024 units workgroup. They aren't trying hard enough, haven't they even played any Starcraft... build orders are everything. Just 4pool, gg wp.
 
Top