Wednesday, September 13th 2023
Nintendo Switch 2 to Feature NVIDIA Ampere GPU with DLSS
The rumors of Nintendo's next-generation Switch handheld gaming console have been piling up ever since the competition in the handheld console market got more intense. Since the release of the original Switch, Valve has released Steam Deck, ASUS made ROG Ally, and others are also exploring the market. However, the next-generation Nintendo Switch 2 is closer and closer, as we have information about the chipset that will power this device. Thanks to Kepler_L2 on Twitter/X, we have the codenames of the upcoming processors. The first generation Switch came with NVIDIA's Tegra X1 SoC built on a 20 nm node. However, later on, NVIDIA supplied Nintendo with a Tegra X1+ SoC made on a 16 nm node. There were no performance increases recorded, just improved power efficiency. Both of them used four Cortex-A57 and four Cortex-A53 cores with GM20B Maxwell GPUs.
For the Nintendo Switch 2, NVIDIA is said to utilize a customized variant of NVIDIA Jetson Orin SoC for automotive applications. The reference Orin SoC carries a codename T234, while this alleged adaptation has a T239 codename; the version is most likely optimized for power efficiency. The reference Orin design is a considerable uplift compared to the Tegra X1, as it boasts 12 Cortex-A78AE cores and LPDDR5 memory, along with Ampere GPU microarchitecture. Built on Samsung's 8 nm node, the efficiency would likely yield better battery life and position the second-generation Switch well among the now extended handheld gaming console market. However, including Ampere architecture would also bring technologies like DLSS, which would benefit the low-power SoC.
Sources:
@Kepler_L2, GitHub, via Tom's Hardware
For the Nintendo Switch 2, NVIDIA is said to utilize a customized variant of NVIDIA Jetson Orin SoC for automotive applications. The reference Orin SoC carries a codename T234, while this alleged adaptation has a T239 codename; the version is most likely optimized for power efficiency. The reference Orin design is a considerable uplift compared to the Tegra X1, as it boasts 12 Cortex-A78AE cores and LPDDR5 memory, along with Ampere GPU microarchitecture. Built on Samsung's 8 nm node, the efficiency would likely yield better battery life and position the second-generation Switch well among the now extended handheld gaming console market. However, including Ampere architecture would also bring technologies like DLSS, which would benefit the low-power SoC.
118 Comments on Nintendo Switch 2 to Feature NVIDIA Ampere GPU with DLSS
You folks seem to think it takes a Geforce 4090 to do 4k well. It does not. $k was being done, very well, on a Geforce GTX1070 years ago. The current NVidia ARM SOCs are more than capable of it.
You also, somehow, don't seem to understand, that those 6/7 year old gpu's were running what are now 10 to 6 year old games at the time they came out, and new games have actually kept coming out since then. Those gpus, will NOT run modern games at 4k like they ran decade old games at 4k. Switch 2, will be a modern system focusing on MODERN games. Nobody is going to be impressed by it running games from 2010-2016 at 4k. Nobody cares, thats expected, people would be shocked if it couldn't. Also the 1070 didn't do 4k "very well", it typically did 30-40fps, which is.... serviceable. It did NOT do 4k 60, as the standard, as you are trying to infer.
A gtx 1070 is a 6tflop (12 tera ops total considering 6 tflops fp32 and 6 tops int32) machine that took 150 watts to power. Thats twice the cuda compute power of the ga10f in the t239 in the switch 2 at its likely 1ghz clock speed, and 10 watt or less power draw.
Obviously, the t239 is waaaaaaaaaaayyy more powerful per watt than the gtx 1070, but it doesnt get to use anywhere NEAR the same amount of power draw as the gtx 1070's 150 watts, because its a tegra, a mobile design, so in the end its only half as as powerful in cuda compute, and can run for hours on a battery, in a small contained enclosure without overheating, which the 1070 could never dream of.
So then, if the statements "the switch 2 only has half the cuda compute of the gtx 1070." AND the statement "the switch 2 will be running (not all, but it will run) more advanced games than the gtx 1070 could run at 4k, because it has waaaaaaayyy more compute" are both true (they are)....
What's the proprietary nvidia only hardware feature that satisfies both statements that did not exist back then for gp architecture like the 1070?
If you figure out what that is, you're back at my very first post you responded to, which lays out that exact math in detail, and you should have figured out why you should have never posted anything you did to begin with.
Ok, we're done.
We have nvidia provided 'ballpark' benchmarks from nvidias dlss development guides for dlss 2 and 3.1.blah.blah
And now we have a 4090 benchmark for dlss 3.5 through Nsight.
This shows a pretty interesting improvement scale of the dlss fixed execution time of about 20-30% improvement jumps for each bench. (Jumps generally get higher the higher the input res, some close to 40%)
I'm going to bring this up now, because I don't think the person this setup was intended for would ever actually make it to the point where I could use it, but that 0.2 ms execution time for dlss 3.5 wasn't actually quite for 4k, it was for 1440 ultrawide or 3440 x 1440p. Which is more pixels than standard 1440p, but about 60% less pixels than 4k.
So it's an easy math problem to solve with acceptable accuracy. 0.2 ms * 1.6 == around 0.32 ms. Which falls neatly in line with the dlss improvement pattern, with the previous bench (3.1.blah) being .51 Ms.
Comparing benches for amperes 3090, we have 4k on 2.blah at 1.028 ms, and on 3.1.blah at 0.79 ms.
The pattern fits well enough.
So now we can start nailing down some associations.
The 4090 has 16384 cuda cores, the 3090 has 10496, a difference of 1.561x. The 4090's boost clock (used for benches) is 2.52 Ghz, the 3090's is 1.695, a difference of 1.4867x a total difference of about 2.321x. (These are all rounded results)
Let's ballpark test:
Cuda 3090 tflops = 35.58, X 2.321= 82.58
Cuda 4090 Tflops = 82.58 Tflops. Dang good match.
Tensor sparse fp16
3090 = 284.64 * 2.321 = 660.649
4090 = 660.64 ÷ 2.321 = 284.6359
Damn good. Peak Theoretical is a good match.
Testing on dlss performance results, with known 3090 and 4090 performance benches provided by nvidia for the 3.1.blah dlss. Let's see how peak theoretical stacks up to real world performance.
Dlss 3.1.blah:
4090 4k = 0.51 ms
3090 4k = .79 ms
Only a 1.55x difference instead of that 2.321x difference. Huh. Peak Theoretical over promising for the high end, as expected. Had peak theoretical been real world accurate, the 3090 would take 1.18 ms to execute dlss. A 1.49x offset in favor of the weaker hardware.
8k: 4090 = 1.97 vs 3090 = 2.98. Only 1.52x real world performance difference. Peak Theoretical over promising again. I should probably do this for each resolution and take an average, but whatever.
The ratio between Peak theoretical and real use reminds me of the difference between the "Double fp32" ampere (and on) Cuda Peak theoretical, and real world application with typical 30% integer use, leaving only 1.7X fp32. This bodes well for less powerful hardware, it shows it won't get hit has hard in real world applications, as the peak theoretical would make it seem. Although unlike the cuda core example, this is likely dlss on tensor cores being less beneficial the faster the cuda cores are.
Both of these are barely using a fraction of the render time to perform dlss, leaving well over 90% of the render time to the cuda cores. The tensor cores are barely utilized. This is great news for the question of 'can a system with only 48 tensor cores clocked at 1ghz use dlss well'.
So now let's apply this to the t239 ga10f gpu in the switch 2 at 1 GHz:
Ga10f = 1536 cuda cores and 48 tensor cores for 3.072 Tflops dense fp32, and 24.576 Tflops sparse Fp16 @ 1Ghz.
Grabbing our metrics from before:
3090 = 10496 cuda cores @ 1.695 Ghz
4090 = 16384 cuda cores @ 2.52 Ghz
GA10F=1536 Cuda cores @ 1 Ghz. (No boost for you, extra conservative speculation)
3090 has 6.83 X the cuda cores, and 1.695x the clock, for a total of 11.577X the ga10f.
4090 has 10.65 X the cuda cores and 2.52 X the clock, for 26.838 X the GA10F.
Let's test:
3090: 35.58 tflops ÷ 11.577 = 3.073.
Ga 10f 3.072 tflops * 11.577 = 35.56. Good match.
4090 = 82.58 tflops ÷ 26.838 = 3.076. Another good match for peak theoretical.
So let's extrapolate dlss 4k execution times.
3090 was 0.79 ms: 0.79ms X 11.577 = 9.146 Ms.
4090 was 0.51 ms: 0.51ms X 26.838 = 13.687 ms. But what about that 1.49X "over promise?" 13.687÷1.49 = 9.18ms. Yeah, now thats a good match.
Does the 3090 also have an 'over promise ratio'? Maybe, but in order to get the data I need to confirm that, I would need to bench the ga10f dlss execution time, and if I could do that, I wouldn't be doing this.
So here we are. We are looking at 8.3 ms to dlss from 1080p to 4k for dlss 3.1.blah.
We have a real world bench for a 4090 on dlss 3.5 being .2 ms (actually between 0.1 to 0.2) with 1440p ultrawide, and extrapolated to 4k from that with .32 ms, so, .32 X 26.838 = 8.588 ms. Speculate the "overpromise" ratio? 8.588 ÷ 1.49 = 5.76 ms?
Either one of those is very promising. Most of dlss can be run concurrently, with few dependencies with cuda cores.
1440p also has roughly half the execution time cost, and an even smaller input resolution, meaning even more graphical bells and whistles can be stuffed in the frame and better performance on top.
Because the tensor cores have been so very under utilized on high end pc hardware, it looks like switch 2 will actually have plenty of frame time, to render as high fidelity a frame as it can with 1536 cuda cores, and plenty of frame time to perform dlss with tensor cores.
I'm feeling 1440p is going to be the majority sweet spot for switch 2 docked, which is a great upscale fit for 4k tv's.
Switch 'ports'/upgrades(as in not just playing through bc, but the bigger non switch version of the game) .......wii u ports... will..... will they do that again Xenoblade x at last? And ps4 ports will likely be 4k, and even 4k 60.
And there will of course be those standout 4k 60fps modern games, likely with a smart hybrid forward renderer.
Thinking about the other systems on the market like the ps5, brings up an interesting picture as well.
At 10 tflops, the ps5 is a bit more than 3x more powerful than the switch 2. But with dlss performance, the switch 2 only needs to render at 1/4th the resolution. Something interesting to look forward to for sure.
There will be no 4k 120fps games with anything resembling a modern game. Anyone who says anything like that should be laughed at.
You guessed that gpu's with hardware dedicated to hardware accelerated ray tracing would be good at ray tracing.
That's so good..... for you.
The GA10F, does NOT have the hardware necessary to accelerate Nvidias dlss 3 frame generation, which would be Ada's extra big, extra beefy OFA.
DLSS enables higher and more stable frame rates plus better quality upscaling, thats far more meaningful for console players and will likely be utilised on every Nintendo developed game on the switch 2.
Also just because the odd game markets it, its not representative of the market as a whole, Sony and Microsoft, barely mention it and I expect Nintendo wont mention it at all as their consoles are not about cutting edge visuals. The main talked about feature on launch of PS5 and Xbox series is their support for VRR and higher frame rates. For switch it was about the portability, the detachable controllers, its dock etc.
Can see the compromises that had to be made on the Zelda switch game. DLSS will open things up for Nintendo.
For reference I got the 3080 instead of AMD primarily for two reasons, SGSSAA (drivers) and cost, this was in the middle of the price gouging, Nvidia was selling FE in the UK at MSRP, and AMD wasnt. If AMD add SGSSAA my next GPU is a AMD, I take extra VRAM over dedicated RT any day of the week.
From a 3050 TPU review I can see the 3050 gets an average of 30-35fps at 4k highest settings in a wide range of AAA games from the last 5 years or so. Assuming you can double the fps with lower settings & then get another 50% with DLSS that puts the 3050 in the 90fps range. So even with the ambitious assumption that the Switch 2 will have half the performance of the 3050, we're still only talking in the 45fps range at 4k with lowered settings.
Are any of my assumptions or rough maths majorly flawed anywhere here?
Ga10f:
1 12 SM ampere rtx gpc: 6 TM's,
6 Polymorph engines,
1,536 cuda cores,
48 Tensor cores,
48 TMU's,
16 Rops,
12 Ray trace cores.
(Lapsu$ ransom attack, nvn2 graphics api dump)
2 lppdr5 ram blocks, dual channel, 128 bit bus, standard 102 GB/s.
(Commercial shipping manifesto product description for t239, being shipped to nvidia india)
Capacity not confirmed, rumored to be 12 gb for retail unit and 16 for dev sdk.
Likely downclocked to 1-1.3 GHz docked, 500-650 portable (Lapsu$ ransom attack nvn2 clock speed profiles. There was one higher but thats likely just a stress test)
Cpu: Arm 1 cluster, 8 cores. (Nvidia employee updating public Linux for tegra.)
Almost certainly a a78c. Probably 2 or 2.5 GHz. That's about the long and short of it. There will be the standard console boons, closed environment, no pc overhead. Horizon is a very lean os
Nintendo will almost assuredly strip ptx from the shader compiles again, for just the cubins and a nice little performance boost.
And NVN, or I guess nvn2 now is a very low level api which gains another nice little performance boost.
Are you high right now, or is this just some strange psychological need? Just asking. Politely. Being generous makes the argument more apparent.
That being said I should have used the mobile RTX 3050 Max-Q, but thought about that model later and chose not to change my post.
Good thinking on the 3050 Max-Q, that's actually the closest thing we've got for comparison in the 30-series, I wonder if there's some 4k benchmarks of that.
NVIDIA GeForce RTX 3050 Laptop GPU - Benchmarks and Specs - NotebookCheck.net Tech
Yeah there's a limited number of 4k benchmarks with it though, which makes sense tbh because it is clearly not up to 4k AAA gaming! They're kind of all over the place so it's hard to make any conclusions really. I guess just using 3DMark results & comparing to GPUs with lots of 4k testing is probably good enough though.
Have a strong hunch it's the denoisers thats been kicking the butts of smaller rtx cards, and not the actual raytracing calculations.
And yes everything is marketing.
www.techpowerup.com/gpu-specs/geforce-rtx-2050-mobile.c3859
At its non boost clock of 735 mhz this thing basically is as close to the ga10f @ 1ghz as you can get. Just has way less ram capacity, but has the same bandwidth.