Wednesday, September 13th 2023

Nintendo Switch 2 to Feature NVIDIA Ampere GPU with DLSS

The rumors of Nintendo's next-generation Switch handheld gaming console have been piling up ever since the competition in the handheld console market got more intense. Since the release of the original Switch, Valve has released Steam Deck, ASUS made ROG Ally, and others are also exploring the market. However, the next-generation Nintendo Switch 2 is closer and closer, as we have information about the chipset that will power this device. Thanks to Kepler_L2 on Twitter/X, we have the codenames of the upcoming processors. The first generation Switch came with NVIDIA's Tegra X1 SoC built on a 20 nm node. However, later on, NVIDIA supplied Nintendo with a Tegra X1+ SoC made on a 16 nm node. There were no performance increases recorded, just improved power efficiency. Both of them used four Cortex-A57 and four Cortex-A53 cores with GM20B Maxwell GPUs.

For the Nintendo Switch 2, NVIDIA is said to utilize a customized variant of NVIDIA Jetson Orin SoC for automotive applications. The reference Orin SoC carries a codename T234, while this alleged adaptation has a T239 codename; the version is most likely optimized for power efficiency. The reference Orin design is a considerable uplift compared to the Tegra X1, as it boasts 12 Cortex-A78AE cores and LPDDR5 memory, along with Ampere GPU microarchitecture. Built on Samsung's 8 nm node, the efficiency would likely yield better battery life and position the second-generation Switch well among the now extended handheld gaming console market. However, including Ampere architecture would also bring technologies like DLSS, which would benefit the low-power SoC.
Sources: @Kepler_L2, GitHub, via Tom's Hardware
Add your own comment

117 Comments on Nintendo Switch 2 to Feature NVIDIA Ampere GPU with DLSS

#101
Mussels
Freshwater Moderator
SoupsammichAt its non boost clock of 735 mhz this thing basically is as close to the ga10f @ 1ghz as you can get. Just has way less ram capacity, but has the same bandwidth.
You know that these have nothing in common with a device that's meant to use about 5W of power, right?

The VRAM alone on that GPU would use more power than the entire devices power budget
Posted on Reply
#102
HOkay
MusselsYou know that these have nothing in common with a device that's meant to use about 5W of power, right?

The VRAM alone on that GPU would use more power than the entire devices power budget
Yes, we're trying to give evidence to show that 4k120 gaming on the Switch 2 is a ludicrous pipedream, & these are the closest thing to the Switch 2 that we have numbers for, & they clearly show they can't do even close to 4k60 let alone 4k120 even with a vastly higher power budget!
SoupsammichI raise you the 2050 mobile. Surprise its ampere.

www.techpowerup.com/gpu-specs/geforce-rtx-2050-mobile.c3859

At its non boost clock of 735 mhz this thing basically is as close to the ga10f @ 1ghz as you can get. Just has way less ram capacity, but has the same bandwidth.

I didn't realise they released an Ampere 20-series! I guess it must have been right at the end of the 20-series cycle. Good find.
Posted on Reply
#103
Soupsammich
MusselsYou know that these have nothing in common with a device that's meant to use about 5W of power, right?

The VRAM alone on that GPU would use more power than the entire devices power budget
Yup that's the idea. And no 4k 120 fps surprise surprise. Well probably not a surprise to most.

Yeah that gddr6 is probably taking the majority of that 30 watts.

Good thing there's lpddr.
Posted on Reply
#104
lexluthermiester
HOkayI'm sorry if I came off as inflammatory, I assumed you had some counter examples in mind that help give some evidence to your stance & I was hoping throwing out one example that goes against your point would get you to give me a counter example.
It did come off that way a bit, but I'm used to people giving me flak.

My real point was this:The NVidia SOC Nintendo is reported to be using is very capible and while some visual effect will have to be scaled down, playable 4k30 or 4k60 is not outside the realm of possibility. For anyone to make the blanket statement that it's NOT possible needs to take a step back and look at the bigger picture for the simple reason that the Jetson Platform can already do so and the Nintendo SOC is going to be a customized and enhanced version of that.
Posted on Reply
#105
Soupsammich
lexluthermiesterIt did come off that way a bit, but I'm used to people giving me flak.

My real point was this:The NVidia SOC Nintendo is reported to be using is very capible and while some visual effect will have to be scaled down, playable 4k30 or 4k60 is not outside the realm of possibility. For anyone to make the blanket statement that it's NOT possible needs to take a step back and look at the bigger picture for the simple reason that the Jetson Platform can already do so and the Nintendo SOC is going to be a customized and enhanced version of that.
No one said 4k30 or even 4k60 (input 1080) was outside the realm of possibility.
Posted on Reply
#106
lexluthermiester
john_Looking at RTX 3050 that is probably twice that Switch iGPU, with extra memory bandwidth and no limitations in how that bandwidth will be split between the GPU and the CPU part of the SOC and more importantly no power limitations that the Switch will have, it will be difficult, even with the advantage of games tailored for Switch 2's specific hardware and capabilities. Graphics will be low to mid settings at best and probably DLSS performance will be used at 4K. Of course some games will have simpler graphics and lower needs by design. Those will play nicely.
It's way too early to call or make any exact conclusions.
Posted on Reply
#107
Soupsammich
lexluthermiesterIt's way too early to call or make any exact conclusions.
It's really not, we've literally known the render config and tested clocks for over a year because of the lapsu$ attack.

We know the architecture, the number of cuda cores, the tensor cores, the tmu's the rops.

We know the bus width, we know the ram type. We know it's a unified memory architecture.

We have literally never known this much about a nintendo system so early ever.

There is literally precious little mystery for your appeal to the mysterious.
Posted on Reply
#108
Mussels
Freshwater Moderator
HOkayYes, we're trying to give evidence to show that 4k120 gaming on the Switch 2 is a ludicrous pipedream, & these are the closest thing to the Switch 2 that we have numbers for, & they clearly show they can't do even close to 4k60 let alone 4k120 even with a vastly higher power budget!


I didn't realise they released an Ampere 20-series! I guess it must have been right at the end of the 20-series cycle. Good find.
They'll be lucky to do 1080p 60 at native res with these wattages
Things are just getting absurd and derailed in here.
Posted on Reply
#109
chrcoluk
I think we will see maybe 720p native resolution, DLSS upscaled to 1080p on Zelda games, and on top of that better draw distances, more assets on screen using saved cycles from DLSS. Less complex games should be able to run 1080p upscaled to 1440p.
Posted on Reply
#110
Soupsammich
MusselsThey'll be lucky to do 1080p 60 at native res with these wattages
Things are just getting absurd and derailed in here.
1080p is all you need for performance mode 4k.

*Edit* Somebody reminded me that I forgot ultra performance is no longer 8k only, so all you need for ultra performance 4k is 720p, and 360p for ultr perf to 1080 for portable.*

We know the clock speeds and the render config from the stolen nvn2 api. The wattage is a function of the lithography and feature set.

1536 ampere cuda cores downclocked to 1ghz (the nvn2 test docked clock was something like 1.125 GHz btw, ive been low balling) is 3.072 tflops for fp32, and 24.576 tflops sparse fp16 on the tensor cores for dlss.

It has a dual channel 128 bit bus for its lpddr5, for a standard 102 gbps.

It will be able to handle 1080p 60 native just fine if someone wants to target that, and the stated performance must be within the target tdp, or it would have never been taped out. And again, 1080p is all you need for a 4k input res.

One of the closed doors demos nintendo showed off at gamescom was botw, literally running at 4k60 fps with no loading times.

Nintendo probably wouldn't be showing this to people they want to make games for their system, if it wasn't something that was feasible to do.... or that they did not have the intent of doing themselves.

4k 120 is ridiculous, 4k 60 docked is going to happen on the system, just like 1080p 60fps happened on switch. Like switch It's not going to be the standard, and its mostly going to be switch and ps4/xbone ports, but it's going to happen.
chrcolukI think we will see maybe 720p native resolution, DLSS upscaled to 1080p on Zelda games, and on top of that better draw distances, more assets on screen using saved cycles from DLSS. Less complex games should be able to run 1080p upscaled to 1440p.
Scene Complexity doesn't really directly matter for dlss, it's a fixed render time based on input resolution to output resolution, no matter how simple or complex the fidelity of the source inputs are.

Using quality or balanced mode, is likely something that will never happen on this system, it will almost undoubtedly always be performance.

If they were targeting 1440p, the input res would be 720p. As I stated earlier, this is my bet for the standard on the system.

If you have the render time to make a 1080p frame of your desired fidelity and frequency on the cuda cores, you can run dlss performance to 4k on the tensor cores. Concurrently. I really can't imagine anyone hitting 1080p native, and wanting to do 1440p quality instead of 4k performance on this thing.
Posted on Reply
#111
Soupsammich
Had a little time, so decided to do a little referential power breakdown of the rtx 2050 mobile, and how that would be different in an actual mobile/hybrid device. I'm looking at docked power draw for the switch 2.

So let's break it down:
2048 Cuda cores, 25% more to power that the 1536 in the t239

4 GB gddr6, massive power hog compared to lpddr5 in t239.

Clock 1.245 GHz, 20% faster than my downclock estimate of 1 GHz for the t239.

There are other factors I am not accounting for as I have not found isolated/normalized power draws for them, like the aux power draw shown in the power breakdown article, i dont know how that applies to the lenovo laptop, so im leaving whatever watts they may be in, same for pcb loss, and the fact the 2050 has to power 2 GPC's and the IO crossbar between them, is more additional watt expenditure over the t239, but that won't be included. So whatever we end up with, is going to be higher than if I had all the data broken down.

So let's get rid of the ram first:

Starting point, 30 watts:

Gddr6x was measured to have a power draw of 2.5 watts per GB here:
www.igorslab.de/en/350-watts-for-nvidias-new-top-of-the-line-geforce-rtx-3090-ampere-model-explained-chip-area-calculated-and-boards-compared/

However, the 2050 mobile does not appear to use gddr6x, but gddr6, gddr6x was shown by micron to have a 15% power efficiency over gddr6 here:

www.tomshardware.com/news/micron-reveals-gddr6x-details-the-future-of-memory-or-a-proprietary-dram

So 15% more than 2.5 is 2.875, 2.875 X 4Gb =11.5 watts drawn for the 4 Gb gddr6 ram.

Lpddr4, which was used in the switch, has been measured to have a power draw around 2 watts for a complete 2x 32 bit bus x3 gb capacity =6Gb unit for the familiar 25 GB/s bandwidth (switch had 2 32 bit x2 gb units for the same bus and bandwidth.) From the university of maryland Here:

www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://user.eng.umd.edu/~blj/papers/isca99.pdf&ved=2ahUKEwjmi7LApICCAxWtE1kFHeapDJoQFnoECA4QAQ&usg=AOvVaw19OhDE7wDcNfAAVybZSm-d

Lpddr5 has been stated by samsung to be 30% more efficient than lpddr4x shown here:

www.google.com/amp/s/www.smartprix.com/bytes/lpddr5-ram-phones/%3famp

which was stated to be 20% more power efficient over lpddr4 here:
www.google.com/amp/s/www.faceofit.com/lpddr4-vs-lpddr4x/amp/

So 70% of 2 watts = 1.4 watts, and 80% of that = 1.12 watts for a 2 unit lpddr5 block.

Gddr6 11.5 watts, - lpddr5 1.12 watts = 10.38 watts dropped just by switching to low power ram from gddr.

Now we are at 30 watts -10.38 watts, 19.62 watts.

Now let's get rid of the fan difference. The lenovo rtx 2050 being used in this example uses a 5 watt fan, sourced by searching replacement parts for it.

The switch uses a 3 watt fan, which shouldnt really need to change.

so 2 more watts down for 17.62 watts

Now we're down to the gpc/sm's cause I'm just leaving the watts from stuff I couldn't find for certain in.

So the 2050 mobile has 2058 ampere cuda cores, or 16 sm's, 25% more to power than t239.

17.62 × .75 = 13.215 watts.

And peak performance/power draw is calculated using its boost clock of 1.245, which is 20% more than 1 ghz so.

13.215 * .80 = 10.572 watts.

And that's of course, assuming T239 is still on samsung 8nm, which doesn't have to be the case.
Posted on Reply
#113
Soupsammich
lexluthermiesterwww.tweaktown.com/news/94221/hypothetical-nintendo-switch-2-benchmarks-show-dlss-ray-tracing-and-even-4k-performance/index.html
Hmm. Isn't this interesting...
I certainly hope you aren't trying to imply thats validating anything you've been trying to claim, because it brutally, mercilessly, slaughters your claims. 18 ms execution time for 4k dlss Just leaves your claims face down in a ditch. To be clear, your claim of 4k 60fps, or good god didnt you say120 at one point? That ALL needs to fit within 16.66 ms INCLUDING the time needed to do the 3d rendering. That video showed it took 18ms JUST for dlss ALONE. Do you understand yet? 4k 60fps is IMPOSSIBLE by that video the article you posted is talking about. Also why didn't you just post Riches video instead of some vulture trying to ride on his work?

Also It's not a coincidence rich used the exact performance specs I've been listing.

On the bright side, the one part of this experiment that was really off that rich admitted he could not mitigate was the severe vram bottleneck, which was starving the tensor cores in a way that won't happen on a device with 12 GB of Unified memory. So it won't be anywhere near 18 ms. But it still will be way out of the ballpark needed for your claims.
Posted on Reply
#114
THU31
Interesting video from Digital Foundry showing a laptop with performance comparable to what is expected from Switch 2.


What is shown here is that DLSS has a very high cost with such a low power GPU. Reconstructing from 720p to 4K has a cost of over 18 ms in Death Stranding, and it lowers performance by about 50% compared to native 720p.
But the chip might feature a dedicated deep learning accelerator along tensor cores, which could help significantly reduce DLSS processing time.

It's all speculation, though. But even if 4K isn't viable, 720p performance (or 1080p with DLSS) looks really good. Even Cyberpunk is playable, and that's just a PC laptop, without any dedicated console optimization.


For me personally, what will make or break this console is backwards compatibility, for both physical and digital games.
Posted on Reply
#115
lexluthermiester
THU31For me personally, what will make or break this console is backwards compatibility, for both physical and digital games.
With you there, though I only care about physical carts.
Posted on Reply
#116
Soupsammich
THU31Interesting video from Digital Foundry showing a laptop with performance comparable to what is expected from Switch 2.


What is shown here is that DLSS has a very high cost with such a low power GPU. Reconstructing from 720p to 4K has a cost of over 18 ms in Death Stranding, and it lowers performance by about 50% compared to native 720p.
But the chip might feature a dedicated deep learning accelerator along tensor cores, which could help significantly reduce DLSS processing time.

It's all speculation, though. But even if 4K isn't viable, 720p performance (or 1080p with DLSS) looks really good. Even Cyberpunk is playable, and that's just a PC laptop, without any dedicated console optimization.


For me personally, what will make or break this console is backwards compatibility, for both physical and digital games.
Yup, it was an Interesting video, with some important caveats.

1. Orins/a100's double tensor cores/DLA is off the table.

A. You can already tell from the nvidia employees t239 initialization kernels vs the t234's it's been removed. This space is used for rt cores on rtx arches.

B. The... er... "Source" Richard was conversing with on this topic immediately confirmed it was a miscommunication after the video went live.

2. This was not because of the tensor cores, Tensor cores are massive overkill and are not the bottleneck. See the performance of the 2080ti in the dlss 3.5 programming guide. It only has 68 gen 1 tensor cores = 68/4 = equivalent of 17 gen 2 tensor cores, yet it beats the 3060ti and 3070 in dlss 4k execution time.

The bottlenecknwas the 4 GB of Vram capacity, (not bandwidth for once). This is actually demonstrated by rich in this video, in particular when he went in depth with death stranding, where the vram.was causing stuttering because it was constantly swapping in and out assets it couldn't hold.

Dlss to 4k requires 200 MB vram set aside to feed tensor core opmem for full performance, if it doesn't get it, your tensor cores stall. As clearly demonstrated, there was no way these tensor cores were getting that.
Posted on Reply
#117
Soupsammich
SoupsammichYup, it was an Interesting video, with some important caveats.

1. Orins/a100's double tensor cores/DLA is off the table.

A. You can already tell from the nvidia employees t239 initialization kernels vs the t234's it's been removed. This space is used for rt cores on rtx arches.

B. The... er... "Source" Richard was conversing with on this topic immediately confirmed it was a miscommunication after the video went live.

2. This was not because of the tensor cores, Tensor cores are massive overkill and are not the bottleneck. See the performance of the 2080ti in the dlss 3.5 programming guide. It only has 68 gen 1 tensor cores = 68/4 = equivalent of 17 gen 2 tensor cores, yet it beats the 3060ti and 3070 in dlss 4k execution time.

The bottlenecknwas the 4 GB of Vram capacity, (not bandwidth for once). This is actually demonstrated by rich in this video, in particular when he went in depth with death stranding, where the vram.was causing stuttering because it was constantly swapping in and out assets it couldn't hold.

Dlss to 4k requires 200 MB vram set aside to feed tensor core opmem for full performance, if it doesn't get it, your tensor cores stall. As clearly demonstrated, there was no way these tensor cores were getting that.
Correction to this, I copied from 1 row too high, which was rt cores instead of tensor cores, it has the equivalent to 136 tensor cores, to the 3070's 184.
Posted on Reply
Add your own comment
Jun 1st, 2024 10:50 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts