I feel the re-release of the RTX 2060 with a slight bump in specs really made the current RTX 3060 look very bad despite having a much higher CUDA count, and more advanced node. While the power consumption is generally very close with the RTX 3060 offering better performance, the step up in performance isn't fantastic, at least in my opinion. Ultimately, it is the pricing of this card that will do it in. Where I live, the RTX 2060 refresh cost a little bit more than a RX 6600 XT, which the latter tends to edge it out in most games. One can argue that the RTX 2060 can benefit from DLSS, but I feel the RX 6600 XT may have a slightly longer runway and also a decent enough FSR to fall back on.
You can't compare a Turing CUDA core count to an Ampere CUDA core count.
A Turing "CUDA core" was made up of an integer block and an FP32 block. Technically, in a perfectly-designed and perfectly-scheduled workload, Turing could execute INT and FP32 simultaneously on the same "core".
Ampere integer blocks can also run FP32, which means the two blocks being counted as one "INT or FP32" CUDA core are now each being counted as an CUDA core. It's why Nvidia seemed like they doubled the core count in one generation when in reality all they did is extend their INT cores slightly to allow them to run FP32
instead (not at the same time). The downside is that only half of Ampere's CUDA cores can even run integer math
at all.
So, in a completely hypothetical FP32-only situation, the 3060 has 3584 FP32 blocks compared to the 2060S's 2176 INT blocks. Clock for clock, the
3060 is 65% better.
And, in a completely hypothetical INT-only situation, the 3060 has 1792 INT block compared to the 2060S's 2176 INT blocks. Clock for clock, the
3060 is 22% worse.
In reality, workloads are mixed FP32 and INT, the 3060 falls somewhere between 22% worse and 65% better. A lot of gaming workload is FP32 but you have to offset that 65% advantage because the 3060 has 25% fewer texture units and ROPs compared to the 2060S. Remember, that Ampere's core counts were doubled by just tweaking existing INT blocks. If the 3060 was counted by INT & FP32 blocks like Turing, it would be a 1792-core part, with the reduced count of TMU, ROP, L2 cache, being the obvious side effect of reducing the number of SMs.
TL;DR
Looking at matchups in reviews, it's fairly accurate to say that in the current suite of games people are testing, an Ampere CUDA core is only worth about two-thirds the performance of a similarly-clocked Turing CUDA core. If you want to compare on paper specs, a 3000-core Ampere card will roughly match a 2000-core Turing card. Simples!