Man... the power consumption on these cards always makes me smile.
What it can do within 200w certainly isn't shabby on it's face, and probably somewhat indicative of what we'll see from 6GB GM200 parts (and 300w). Shame they'll probably split what we really want (6GB full-fat) for a 12GB lower-clocking Titan and high-clocking 5.5GB 21smm numbered Geforce, at least initially. Gotta milk that cow.
The thing I find interesting, in essence them trading shader unit alus (compared to kepler) in the pipeline for a more (mobile) cpu-like clockspeed aim, seems to have worked to an extent...but is it truly the most efficient beyond the current circumstances? Already we see the arch requiring more than optimal volts, and topping out with minimal voltage increase but substantially higher power consumption (tight tolerance, but not perfectly optimized for 28nm's properties). Obviously other factors play a roll other than unit setup/clockspeed (like the use of cache and smaller bus) as well when factoring in total power consumption.
What I'm getting at is, all things equal (both companies using HBM for larger buffer that will be required given their core capabilities, and only just-adequate cache/total bandwidth to meet the chips needs), and amd with a tightly optimized architecture (say 96CU/96 rops vs 32SMM/96S ROPs comparably within a given tdp) will a unit structure/clock like this hold up, especially on Samsung's 14nm process?
One can argue Maxwell was tailored toward the approx die size and voltage/clock disparity of tsmc 20nm (higher performance per volt, maybe 20-25% [the difference between maxwell and older 28nm chips], but lower voltage ceiling), and was then more-or-less successfully back-ported. If nvidia planned a similar route for tsmc 16nm for either a shrink or Pascal, the trick will be meaningfully less impressive on Samsung/GF 14nm, which is tailored toward density vs clockspeed (TSMC is about 10% faster, GF/Samsung is 10% smaller when comparing 16nm/14nm Early, 16nm+/14nm mature may be more similar [while closer to the 14nm Samsung methodology] as both companies search for parity to satiate customers double-sourcing).
IOW, more units and lower clockspeed is the smarter play going forward, at least initially. AMD has been geared this direction for some time (post 7000-series), while nvidia's methodology is different (in essence clockspeed vs ipc). Couple that with the greater dependance of shader resources in the ps4 (which could be seen as a 939sp core arch + extra compute in some circumstances when scaling from the xbox one, which is either shader-limited for 16 rops or could be seen as 470sp + extra compute for 8 practical rops in some cases), this setup, especially at the high-end (where it is essentially a good match for 64 ROPs, even if matched to 96 because of the larger bus size required for GDDR5 and chip size of 28nm) may not be as impressive as it currently appears.
Not saying the architecture isn't sound, or that over-all it isn't a better optimized solution for 28nm than even Fiji...just stating it's far from perfect, and the current methodology will probably not be as fruitful going forward.