No, it doesn't. Starting with 3000 series, Huang decided to claim twice the number of shaders the card had.
Not that Huang's claims needed anything of substance to support it (e.g. 4000 are "2-4 times faster"), but formal reason was that new shaders could do fp+fp, so they "should be counted as 2". It is still one shader (a dumb mini CPU)
No, they are 2 separate units entirely. In fact, they are
groups of units.
Where, on this, is a shader? Those listed cores don't have any scheduling logic like a CPU has, that's up in the warp scheduler. No instruction or data cache, that's in the register file and instruction buffer(and L1, obviously).
From a CPU perspective, nvidia has 4 cores per SM on maxwell, pascal(bar GP100), Turing, Ampere, and Lovelace.
So, ok, we've established that shaders are
not similar to CPU cores, on maxwell here a shader is a unit that can do FP32 math, it's an FPU. How do we know that it's just FP32 math? Well the load/store units there would be akin to an AGU, they request and store data, which
does do math but not the same type of math. And the Special Function Units handle more complex math like Sine transforms, which use things other than standard FP32.
So nvidia uses FP32 units for its shader counts, great. Now lets address your claim that Ampere uses 2 paired FPUs to reach the 128 shaders per SM number(note that such is the same number Maxwell and gaming Pascal reach).
As you can see, the general structure remains the same, though there are changes carried over from Turing(fewer SFUs, LD/ST units closer to the cache, cache moved locations along with the SFUs, etc).
As you can
also see, there are 2 datapaths in each SM subsection, 8 in total. One of these datapaths
only does FP32 math while the
other can do both, but it does both by having INT32 and FP32 units(ALUs and FPUs) on the same scheduling system, pairing them together but keeping the actual hardware units separate(likely for ease of design or for lower power draw, possibly to ease the next issue I'll talk about). Previous to Turing,
all shaders were set up like this which caused some issues with context switching, that being the SM would take a few cycles(upwards of 10) to swap to an integer instruction and then take another few cycles to swap back to FP instructions.
With Ampere they decided to keep the split structures from Turing but added in more FP only units, as only around 30% of instructions are integer it makes sense to have only 1/3 of the SM be ALUs rather than the 1/2 Turing has. Ideally what they'd do is split INT off into its own datapath again, but it's possible that the extra die size incurred from such a move makes the economics infeasible or those INT units are primarily for doing tasks that the FPUs stall on and so it wouldn't gain much performance.
AMD, meanwhile, defines them similarly:
This is a Vega 20 compute unit, or CU, as found in the Mi50 and Mi60 datacenter cards.
They specifically call out Vector ALUs, or groups of FPUs and ALUs all paired together on the same datapath. Vega splits them up into subunits 4 units of 16 each and 2 major units of 32 each. This was changed in RDNA in that they merged 2 of the 16 wide units into one, getting a single 32 wide one and then pairing
those into groups of 4(the WGPs).
But, still, the definitions are the same. 1 Shader=1 32bit FPU.
also: All shaders can do 2 floating point operations per clock cycle. All of them. That's why the formula to calculate tflops is Shaders*2*Clockspeed