Traditional
Unified Shader
If you had a "screen render" that fitted into the existing pipeline "4 cycles", single pass for each cycle in the rendering stage... as shown in the diagram, then increasing the number of shaders doesnt change anything. The spare-capacity doesnt help. A low FSAA, AA, 1280x1024 can "fit in" the "4 cycle" path, single pass for each stage.
If you have a scene that is 1920x1200 with 16x, 16x, then a screen render will require more than one pass through each stage.
In instance A, clock speed will get you faster FPS. Shaders doesnt help much.
In instance B, increasing the shaders means more can be done in each pass, meaning fewer passes, ultimately getting to just one single pass through each stage. Here, gains are from increased shaders in addition to increased clocks.
That's how I've always understood it. If there is a fallacy with the logic... let me know.
No, no, no... you understood it wrong. In your image, where it says shader core, it's not 1 shader processor, it's the entire shader array. The next stage can be calculated in any available ALU within the core. To explain this simply I will use G80 as an example, since it's SPs are fully scalar. R600 is more complicated because it needs some pre-arrangement, but it works equally in the sense of that next stage of the same fragment or a next fragment within the same stage can be calculated in the next available unit. The latter just means you can do A -> B -> C -> D or calculate several pixels in A stage together and then continue. The latter is how they work nowadays.
Example: G80 GTX has 128 SP. Imagine you want to calculate vertex data, vertex are represented by x, y and z coordinates and each one is a floating point variable. We are going to say vertex1 is V1(x1, y1, z1), vertex2 is V2(x2, y2, z2)... vertexn Vn(xn, yn, zn) ,In the SP core (of 128), each dimesion can be calculated in 1 ALU which belongs to 1 SP. (there's controversy here as Nvidia said each SP is capable of 2 per clock per SP, but it seems it can't)
It works like that:
clock cycle 1 : sp1 runs x1 - sp2 runs y1 - sp3 z1 - sp4 x2 - sp5 y2 - ... - sp127 x44 - sp128 y44 <<< as you can see V44 is not finalized yet, but it doesn't matter because:
clock cycle 2 : sp1 z44 - sp2 x45 - ...
And so on. Imagine we have a core with 64 SPs running at 2x the speed. The result, the throughoutput (GFlops) is exacly the same and thus the code is going to be calculated as fast. Same if we have 256 SPs running at half the speed. There won't be any spare SP at any time, unless:
A: It can't fetch enough data from memory pool, the frame buffer, whatever the reason there is for this: other units are slow, not enough data sent by the CPU...
B: The Unit that has to continue the work i.e the ROPs can't keep up and have ordered to not continue with the work as the frame buffer is full of unprocessed data.
You can mix data types in the above example too, as long as they don't belong to the same cluster (I think). G80 and G92 have clusters of 16 SP, GTX and G92 GTS have 8 (8x16=128), GT has 7 clusters. I don't think different data types are allowed within the same cluster, but I wouldn't bet a leg neither...