We actually have some pretty interesting benches to extrapolate from, to see what kind of options switch 2 devs will be looking at.
We have nvidia provided 'ballpark' benchmarks from nvidias dlss development guides for dlss 2 and 3.1.blah.blah
And now we have a 4090 benchmark for dlss 3.5 through Nsight.
This shows a pretty interesting improvement scale of the dlss fixed execution time of about 20-30% improvement jumps for each bench. (Jumps generally get higher the higher the input res, some close to 40%)
I'm going to bring this up now, because I don't think the person this setup was intended for would ever actually make it to the point where I could use it, but that 0.2 ms execution time for dlss 3.5 wasn't actually quite for 4k, it was for 1440 ultrawide or 3440 x 1440p. Which is more pixels than standard 1440p, but about 60% less pixels than 4k.
So it's an easy math problem to solve with acceptable accuracy. 0.2 ms * 1.6 == around 0.32 ms. Which falls neatly in line with the dlss improvement pattern, with the previous bench (3.1.blah) being .51 Ms.
Comparing benches for amperes 3090, we have 4k on 2.blah at 1.028 ms, and on 3.1.blah at 0.79 ms.
The pattern fits well enough.
So now we can start nailing down some associations.
The 4090 has 16384 cuda cores, the 3090 has 10496, a difference of 1.561x. The 4090's boost clock (used for benches) is 2.52 Ghz, the 3090's is 1.695, a difference of 1.4867x a total difference of about 2.321x. (These are all rounded results)
Let's ballpark test:
Cuda 3090 tflops = 35.58, X 2.321= 82.58
Cuda 4090 Tflops = 82.58 Tflops. Dang good match.
Tensor sparse fp16
3090 = 284.64 * 2.321 = 660.649
4090 = 660.64 ÷ 2.321 = 284.6359
Damn good. Peak Theoretical is a good match.
Testing on dlss performance results, with known 3090 and 4090 performance benches provided by nvidia for the 3.1.blah dlss. Let's see how peak theoretical stacks up to real world performance.
Dlss 3.1.blah:
4090 4k = 0.51 ms
3090 4k = .79 ms
Only a 1.55x difference instead of that 2.321x difference. Huh. Peak Theoretical over promising for the high end, as expected. Had peak theoretical been real world accurate, the 3090 would take 1.18 ms to execute dlss. A 1.49x offset in favor of the weaker hardware.
8k: 4090 = 1.97 vs 3090 = 2.98. Only 1.52x real world performance difference. Peak Theoretical over promising again. I should probably do this for each resolution and take an average, but whatever.
The ratio between Peak theoretical and real use reminds me of the difference between the "Double fp32" ampere (and on) Cuda Peak theoretical, and real world application with typical 30% integer use, leaving only 1.7X fp32. This bodes well for less powerful hardware, it shows it won't get hit has hard in real world applications, as the peak theoretical would make it seem. Although unlike the cuda core example, this is likely dlss on tensor cores being less beneficial the faster the cuda cores are.
Both of these are barely using a fraction of the render time to perform dlss, leaving well over 90% of the render time to the cuda cores. The tensor cores are barely utilized. This is great news for the question of 'can a system with only 48 tensor cores clocked at 1ghz use dlss well'.
So now let's apply this to the t239 ga10f gpu in the switch 2 at 1 GHz:
Ga10f = 1536 cuda cores and 48 tensor cores for 3.072 Tflops dense fp32, and 24.576 Tflops sparse Fp16 @ 1Ghz.
Grabbing our metrics from before:
3090 = 10496 cuda cores @ 1.695 Ghz
4090 = 16384 cuda cores @ 2.52 Ghz
GA10F=1536 Cuda cores @ 1 Ghz. (No boost for you, extra conservative speculation)
3090 has 6.83 X the cuda cores, and 1.695x the clock, for a total of 11.577X the ga10f.
4090 has 10.65 X the cuda cores and 2.52 X the clock, for 26.838 X the GA10F.
Let's test:
3090: 35.58 tflops ÷ 11.577 = 3.073.
Ga 10f 3.072 tflops * 11.577 = 35.56. Good match.
4090 = 82.58 tflops ÷ 26.838 = 3.076. Another good match for peak theoretical.
So let's extrapolate dlss 4k execution times.
3090 was 0.79 ms: 0.79ms X 11.577 = 9.146 Ms.
4090 was 0.51 ms: 0.51ms X 26.838 = 13.687 ms. But what about that 1.49X "over promise?" 13.687÷1.49 = 9.18ms. Yeah, now thats a good match.
Does the 3090 also have an 'over promise ratio'? Maybe, but in order to get the data I need to confirm that, I would need to bench the ga10f dlss execution time, and if I could do that, I wouldn't be doing this.
So here we are. We are looking at 8.3 ms to dlss from 1080p to 4k for dlss 3.1.blah.
We have a real world bench for a 4090 on dlss 3.5 being .2 ms (actually between 0.1 to 0.2) with 1440p ultrawide, and extrapolated to 4k from that with .32 ms, so, .32 X 26.838 = 8.588 ms. Speculate the "overpromise" ratio? 8.588 ÷ 1.49 = 5.76 ms?
Either one of those is very promising. Most of dlss can be run concurrently, with few dependencies with cuda cores.
1440p also has roughly half the execution time cost, and an even smaller input resolution, meaning even more graphical bells and whistles can be stuffed in the frame and better performance on top.
Because the tensor cores have been so very under utilized on high end pc hardware, it looks like switch 2 will actually have plenty of frame time, to render as high fidelity a frame as it can with 1536 cuda cores, and plenty of frame time to perform dlss with tensor cores.
I'm feeling 1440p is going to be the majority sweet spot for switch 2 docked, which is a great upscale fit for 4k tv's.
Switch 'ports'/upgrades(as in not just playing through bc, but the bigger non switch version of the game) .......wii u ports... will..... will they do that again Xenoblade x at last? And ps4 ports will likely be 4k, and even 4k 60.
And there will of course be those standout 4k 60fps modern games, likely with a smart hybrid forward renderer.
Thinking about the other systems on the market like the ps5, brings up an interesting picture as well.
At 10 tflops, the ps5 is a bit more than 3x more powerful than the switch 2. But with dlss performance, the switch 2 only needs to render at 1/4th the resolution. Something interesting to look forward to for sure.
There will be no 4k 120fps games with anything resembling a modern game. Anyone who says anything like that should be laughed at.