Don't look at it that way. It issues a 32 item simd32 at once, now. Previously, it issued 16 thread item simd16's, so it didn't care whether you accessed 4 different 16/64 item wavefronts consecutively or 1 single 64/64 wavefront operating. When you look at it, it makes sense since both halves of the old wavefront is free from each other, it doesn't take 2 cycles, but one whereas 4 simd units issued one after the other incurred a 4 cycle penalty to retire either wavefront.
It is crazy how things could thus be. I can recall sebbbi giving notion on a 5 way split giving memory accesses isolation from runtime. It must be tight having to code for 48 vpgrs for memory unimpediment...
My perspective was from a "number of vGPR bits on the CU" perspective. Which btw, I got the numbers wrong earlier. Looking at the RDNA ISA again, there's
1024 VGPRs per WGP. But my overall point is correct
Let me explain my point more carefully this time (and with correct numbers this time!!).
A GCN CU 4x vALUs. (I'm going to ignore the sALU on both GCN and RDNA for simplicity). Each vALU had 256 vGPRs for 64-threads. Only 16 of these threads ever executed at a time, but all 4 vALUs x 256 VGPRs x 32-bit x 64-thread registers had to exist in memory. Thats 2-million bits per GCN CU.
Lets perform a similar calculation for RDNA. RDNA is organized as WGPs (a "dual compute unit"), with 4 vALUs. Normalizing for GCN, there's only 2 vALUs (One WGP is roughly comparable to 2x CUs from GCN). That is to say, an RX 580 (40 CUs) is upgraded to a 5700 XT (20 WGPs, sold as a "40 CU" card). For RDNA: that's 2 vALUs x 1024 vGPRs x 32-bits x 32-threads == 2-million bits per "Half WGP"
For both GCN and RDNA, there's "2-million VGPR-bits per CU". (Whether its an "old GCN CU" or a "new RDNA CU - half WGP").
I always wondered when there would be an a.i. for those kinds of linear software tuning jobs.
To steal your phrase: Don't look at it that way
* Kernel A uses 70 vGPRs
* Kernel B uses 200 vGPRs
* Kernel C uses 400 vGPRs.
That means you can run 1x Kernel C (400 vGPRs) + 2x Kernel B (400 vGPRs) + 3x Kernel A (210 vGPRs) RDNA SIMD (4x SIMDs per WGP). Dispatching GPU code is basically a "malloc for VGPRs".
Once KernelC completes, you can fill up the remaining space with 5x KernelA (for a total of 8x KernelA running).
---------
Damn it, it looks like I misremembered a few things. It was the Scalar-L1 cache that changed, not the vGPR-L1 cache. Whatever, here's the answer (from
https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf) :