when the HBM technology is loaded into RADEON, a problem in a memory band is cleared.
there is no need to increase the secondary cache as Maxwell.
Except if R390x is indeed 4096sp, 512gbps (Which would be 4*1GB HBM operating at 128gbps) would really only be good up to around 1120mhz (if like Hawaii) or around 1200mhz if using the compression tech we saw in R9 285. With or without factoring in scaling (96-97%), that doesn't touch big maxwell (at probably a fairly similar size, if not Fiji granted slightly smaller on the same process)...and you can bet your butt we'll see a '770'-like GM204 (or really weak-sauce butchered big Maxwell sku) if it's stock clock is 1ghz. While this method for bw would work for a 28 or even 20nm part using their current arch, compared to what is possible on 16nm it's not nearly enough if they want to actually compete.
The reason is 4096sp generally won't be used to it's full extent in core gameplay, closer to ~3800 (just as you saw with 280x vs gk104, or 7950/280 vs 7970/280x scaling on a half scale), and when you figure whatever that number is divided by 2560 effective units in GM204, and the fact it can do 1500mhz BECAUSE of having such secondary cache....that ain't good. Btw, this is why big maxwell is essentially 3840 units ([128sp+32sfu]*24). The same way gk104 was essentially 1792 (192+32*8)....because the optimal count for 32/64 ROPs is right around there. Slightly higher in GK104's case (and hence why 280x was slightly faster per clock), but that was a fairly small chip and could expect decent yields. Slightly lower in big maxi's case, but I'd be willing to bet most parts sold will be under that threshold (which is still less than 1 shader module).
What's unfortunate is while excessive compute and high bw is good for certain things (like tressfx etc), it's still a better play to generally have less units than what the rops can handle in most core gaming situations, as it's more power/die/bw efficient (again, see gk104 vs 280x), and if need-be scale the core clock for performance of all units (texture, rops etc) at an optimal ratio. If we essentially get a 2x280x just because AMD has the bandwidth to do so (and clockspeeds won't allow a more efficient core config with higher clock to saturate it, similar to their more recent bins that generally do ~1100mhz) they are kind of missing the big picture in an effort to pull out all the stops and create something slightly faster through brute force...It'll be Tahiti vs GK104 all over again on a literally slightly larger scale.
All they are doing is moving the goalpost with CUs and bw, more-or-less similarly since R600, when a fundamental efficiency change is sorely needed. I'm talking like when they went to 4VLIW instead of 5 (when the avg call was 3.44sp), the move to 4x16 with a better scheduler, or to a lesser extent what they did with compression in 285. Even if the bandwidth problem is solved for another generation (and even that's arguable when larger than 4GB is going to quickly become normal and HBM won't see that for a year or more, not to mention GM200 will literally be out of their league if on the same process) the fundamental issue is the lack of architectural evolution to cope with outside factors (bw, process node capabilities) not lining up with what they currently have. Some of that is probably tied to and hamstrung by their ecosystem (HSA, APUs, etc), but I still think it primarily comes down to lack of resources in their engineering dept over the last couple to few years.
I truly think Roy knows all this (that they currently are in a bad place and the immediate future doesn't appear fabulous either), but his job is his job, and I respect that.