This Nvidia-provided slide gives brief insight into how the GTX 970 is constructed. The three disabled SMs are shown at the top and 256KB L2s and pairs of 32-bit memory controllers on the bottom. Notice the greyed-out right-hand L2 for this GPU? Tied into the ROPs as they are this is a direct consequence of reducing the overall ROP count. GTX 970 has 1,792KB of L2 cache, not 2,048KB, but, as Alben points out, still has a greater cache-to-SMM ratio than GTX 980.
Historically, including up to the Kepler generation, cutting off the L2/ROP portion would require the entire right-hand quad section to be deactivated too. Now, with Maxwell, Nvidia is able to use some smarts and still tap into the 64-bit memory controllers and associated DRAM even though the final L2 is missing/disabled. In other words, compared to previous generations, it can preserve more of the performance architecture even though a key part of a quad is purposely left out. This is good engineering.
But while it's still accurate to say the GeForce GTX 970 has a 256-bit bus through to a 4GB framebuffer - the memory controllers are all active, remember - cutting out some of the L2 but keeping all the MCs intact causes other problems; there is no usual eighth L2 to access, meaning that the seventh L2 will be hit twice. The way in which the L2 work makes this a very undesirable exercise, Alben explains, because this forces all other L2s to operate at half normal speed.
Smoke and mirrors
Finally coming back to point, Nvidia gets around this L2 problem by splitting the 4GB memory into a regular 3.5GB section, constituted by seven MCs and associated DRAM, and a 0.5GB section for the last memory controller. The company could have marketed the GeForce GTX 970 as a 3.5GB card, or even deactivated the entire right-hand quad and used a 192-bit memory interface allied to 3GB of memory but chose not to do so. How does this play out with the huge memory bandwidth drop-off in the Lazygamer Nia test versus Nvidia's statement that games barely suffer from this smart engineering? The Lazygamer test at the >3.5GB metric simply probes bandwidth on a single DRAM, which is admittedly low, or 1/8th of the total speed, while in-game code, according to Nvidia, doesn't pinpoint memory in this way. There's certainly a memory-bandwidth drop-off when the 0.5GB section is called into action, Alben states, but it's not anywhere near as severe if nonrecurring code is shunted into the last MC.
In a high-level nutshell Nvidia is using smart engineering to get the most out of the GTX 970's architecture. The lack of total ROPs is relatively unimportant because this GPU cannot make use of them - the 13 SM units, running at four pixels per clock (so 52 in total), are limiting the GPU more so than the 56 processed by the ROPs. The GeForce GTX 970's performance hasn't changed, obviously, but Nvidia wasn't clear on how the back-end works... and it has taken investigation by enthusiasts to uncover the real reason why this 256-bit architecture isn't as good as the GTX 980's