One more reason for saturation issues, GCN compute unit is way too asymmetrical, not enough granularity and has too many special purpose modules ... let me illustrate:
View attachment 91697
- special units for integers and special for float vectors, opposed to each cuda core having both alus inside
- too many special purpose decode hardware blocks, opposed to one unit that knows to decode all and shares internal logic for all
- too many special purpose cache units connected to its special purpose block, opposed to more flexible approach with bigger unified shared cache pool and bigger multipurpose and unified local caches
Basically it's a low-latency throughput favoring design that is wasteful and inflexible. Based on the type of the code running, at some particular moment, bunch of the units are doing nothing still being fully powered on to maybe do something useful in the next clock cycle. To gracefully saturate GCN (both peak efficiency and 100% usage) you should have right ratio of int/float instructions and right amount of memory operations sprinkled through code
... which is incidentally easier to do using async compute