Thank you for some very valuable insight. It is all a game to me, however it is a learning opportunity nonetheless.
I'm intrigued, this trains up the L2 caches, I estimate?
What I find a general lack of is, to put very simply, how very easily demonstrable what the workloads are in comparison to what they could have been.
Suppose, we say there are 64 CU's - let's just say 80 CU's for the sake of the latest series - according to the 'engine optimisation hot lap' guideline, the CU's start up one by one to be issued work. This is on average 40,5 CU's not working for the next 80 cycles when we calculate via the Gaussian Method. We could either take it as 50.6% duty for 80 cycles latency, or statically placed 40.5 cycles of latency at the start of all gpu workflow. The issue is what we could do with the hardware, in case we directed our gpu power budget differently. If we instructed the gpu to 'load', but not do any work, we could not just keep loading it in for all 80 CU's, but also for each of the 80 CU's times 40 waves per CU. If the gpu is working at 2.5GHz, that is 2‰ of the gpu time! There is a giant window of opportunity when the gpu can be omitted from any real shader work and just tracking the instruction flow to prepare the shaders for operation.
It is crazy, but I think Nvidia won't let AMD sit on its laurels, if they don't discover buffered instruction and data flow cycling first. Imagine; the execution mask is off for the whole shader array and the gpu waits for 5 MHz until all waves are loaded, then it releases it and off it goes! I know there are kinks. I just don't know any better.