If one googles GFX1010:
//On GFX10 I$ is 4 x 64 bytes cache lines. By default prefetcher keeps one cache line behind and reads two ahead. We can modify it with S_INST_PREFETCH for larger loops to have two lines behind and one ahead. Therefor we can benefit from aligning loop headers if loop fits 192 bytes. If loop fits 64 bytes it always spans no more than two cache lines and does not need an alignment. Else if loop is less or equal 128 bytes we do not need to modify prefetch, Else if loop is less or equal 192 bytes we need two lines behind.
-> L0 cache, which is referred to below.
// In WGP mode the waves of a work-group can be executing on either CU of the WGP. Therefore need to invalidate the L0 which is per CU. Otherwise in CU mode and all waves of a work-group are on the same CU, and so the L0 does not need to be invalidated.
-> CU mode and WGP mode
// HWRC = Register destination cache
&
// Try to reassign registers on GFX10+ to reduce register bank conflicts.
// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1, s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
// The shader can read one dword from each of these banks once per cycle. If an instruction has to read more register operands from the same bank an additional cycle is needed. HW attempts to pre-load registers through input operand gathering, but a stall cycle may occur if that fails. For example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands, potentially incuring 2 stall cycles.
// The pass tries to reassign registers to reduce bank conflicts.
// In this pass bank numbers 0-3 are VGPR banks and 4-11 are SGPR banks, so that 4 has to be subtracted from an SGPR bank number to get the real value. This also corresponds to bit numbers in bank masks used in the pass.
-> HWRC and banking are part of Super-SIMD patents;
https://patents.google.com/patent/US20180357064A1
https://patents.google.com/patent/US20180121386A1
//In one embodiment, each bank of the vector destination cache holds 4 entries, for a total 8 entries with 2 banks.
-> destination register cache // HWRC => 8 destination registers with 3-entry source operand forwarding.
//In one embodiment, source operands buffer holds up to 6 VALU instruction's source operands. In one embodiment, source operand buffer includes dedicated buffers for providing 3 different operands per clock cycle to serve instructions like a fused multiply-add operation which performs a*b+c.
-> source operand buffer => 6 * 3-entry source operand buffer