So, end of the day, what did we learn from the Fiji launches?
IMHO, it justified some of the viewpoints I've long held about AMD's architecture...do you guys agree or do you see things differently?
1. AMD really needs clockspeeds in the ~1400-1500mhz (capable) range for an extremely compelling part given the properties of today's games.
2. AMD needs to get their CU/ROP ratio under control. 1CU:1ROP is not optimal...it's closer to something like 16ROPs: (14/)15 CUs or 24ROPs:22 CUs. While compute is great, there becomes a point where it's a liability. This is something nvidia learned from Kepler to Maxwell.
3. AMD is not benefited by HDL (high-density libraries), or whatever other jazz (outsourcing/lack of key engineers?) has gotten into their designs since Hawaii. While whatever process (HPM?) likely saves them space and/or may in theory run higher clocks at lower voltage, which obviously for Fiji's design may be crucial in one way (it's the largest it can be at 28nm) or another (Nano might be compelling if something like 850mhz-900/400mhz HBM at .9v core/1v memory vs 970/980), the underlying voltage required for decent clockspeed/performance isn't great, nor is the over-all scaling. While we see newer 390(x) parts doing better than the initial 290 series run, these parts clock worse per volt than any of the original 7000 series by a decent margin (10%?). Also, even figuring Maxwell having a 20-25% deeper pipeline (or whatever changed so their clock-speed scaling is now more similar to ARM A57; I always assumed a presumptive design towards 20nm), their scaling has stayed the same from Kepler to Maxwell. What is going on with AMD's clockspeed problems?
End of the day, I think AMD's arch could be a good one, give or take a few tweaks and changes in philosophy; GCN needs to be rebalanced. While I certainly have no idea what design rules apply, ie does 16 CUs take up just as much space as 15 in a setup engine within the confines of the overall chip parameters and/or can AMD reconfigure such an engine to be 12-24 ROPs etc, something needs to be done in that regard for efficiency. On the same token, changes in process/design need to be applied to allow the chips' clockspeeds to scale, even at the cost of chip size (remember the decap ring in rv790?) or they are flat-out doomed. While there is always an argument for adding more units and lowering clockspeed/voltage, probably especially at we move forward to more mobile-oriented (low-voltage) processes, ATi has always been at the top of their game when using less units and having a greater clockspeed potential than their competition; it saves space/cost (and on former processes stock power consumption) while also making it the 'overclocker's choice'. This philosophy has also surely helped nvidia succeed. Look at how many references you see in this review alone taking note that nvidia's arch, even if at a stock 1000-<1300mhz, and using less units (say 2080-2560 sp+sfu in GM204 or 3360-3840 in GM200) can overclock (even if sometimes while drawing lots of power) super high consistantly: ~1500mhz.
This, in part, is what scares me about 14LPP/16nmFF+ wrt to AMD. Samsung 14nm is seemingly smaller and cheaper (~10%), but likely offset by clock potential vs TSMC and 16nmFF+. While I could very much see AMD dropping a Fiji shrink that is under 225w (typical first parts from AMD on a process are around 188w; half of 375w) and capitalizes completely on die savings and extra perf/v, perhaps even dropping two such small chips (and 2x HBM2) on a single interposer for a crazy-awesome part, the compelling nature of such a chip, let's say (for argument's sake) Fiji at 1400mhz/625mhz and 8GB only goes so far. I seriously fear (for competition's sake) that nvidia will use 14/16nm to both increase floating point (and/or decrease unique special special function) to create a part similar to AMD while maximizing clock speed.
Say, for instance, 224sp-240 (+/- 32sfu) in a SMP (shader module Pascal), up from 128sp (+32 sfu) in Maxwell or 192 (+32 sfu) in Kepler. In a 16 module design (ex: GP104, something replacing GM200 for the slightly lower-end performance market)...the result is something similar to either 3584(4096) or 3840sp...similar or more efficient than Fiji. While nvidia may not capitalize completely on die savings (as amd surely will) nor absolute power consumption (I could see them doing another '980' which is made to draw >225w), we could conceivabley see something that continues to follow the clock scaling path of ARM processors (which on 14/16nm are planned as 2ghz+). While certainly rumors, this theory is backed by the voices in the wind mentioning nvidia ran back to tsmc for their next designs after Samsung's (/GF's) yields were terrible. For AMD, that has to be an incredibly scary thought....I very much doubt they want a 970 vs 290x (but imagine 970 wasn't gimped and 290x drew less power) rematch.
TLDR: I've always appreciated AMD's strengths in engineering, design choices, and pushing technology forward....but something needs to change. I surely hope it does by next generation.