For my simple non-technical (let alone professional) understanding and explanation, I’m thinking that if IC is truly delivering wide bandwidth (800+bit effective) across different workload levels (up to 4K that is more common than 8K) and scale well across them then the real bottleneck for any better performance is, as you also stated indirectly or not, the cores of the GPU and its surrounding I/O. And if really true they’ve manage to remove bandwidth bottleneck completely, up to 4K at least.
Okay, people tend to think of bandwidth as a constant thing (I'm always pushing 18Gbps or whatever the hell it is) at all times, and that if I'm not pushing the most amount of data at all times the GPU is going to stall.
The reality is only a small subset of data is all that necessary to keeping the GPU fed to not stall. The majority of the data (in a gaming context anyway) isn't anywhere near as latency sensitive and can be much more flexible for when it comes across the bus. IC helps by doing two things. It
A: Stops writes and subsequent retrievals from going back out to general memory for the majority of that data (letting it exist in cache, where its likely a shader is going to retrieve that information from again), and
B: It helps act as a buffer for further deprioritising data retrieval, letting likely needed data be retrieved earlier, momentarily held in cache, then ingested to the shader pipeline than written back out to VRAM.
As for Nvidia, yep, they would have, but the amount of die space being chewed for even 128mb of cache is pretty ludicrously large. AMD has balls chasing such a strategy tbh (but is probably why we saw 384 bit Engineering Sample cards earlier in the year, if IC didn't perform, they could fall back to a wider bus).