Tuesday, December 17th 2024

NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

Dec 17th, 2024 03:30 Discuss (44 Comments)

NVIDIA is preparing to unveil its "Blackwell" GeForce RTX 5080 graphics card, featuring cutting-edge GDDR7 memory technology. However, RTX 5080 is expected to be equipped with 16 GB of GDDR7 memory running at an impressive 30 Gbps. Combined with a 256-bit memory bus, this configuration will deliver approximately 960 GB/s bandwidth—a 34% improvement over its predecessor, the RTX 4080, which operates at 716.8 GB/s. While the RTX 5080 will stand as the sole card in the lineup featuring 30 Gbps memory modules, while other models in the RTX 50 series will incorporate slightly slower 28 Gbps variants. This strategic differentiation is possibly due to the massive CUDA cores gap between the rumored RTX 5080 and RTX 5090.

The flagship RTX 5090 is set to push boundaries even further, implementing a wider 512-bit memory bus that could potentially achieve bandwidth exceeding 1.7 TB/s. NVIDIA appears to be reserving larger memory configurations of 16 GB+ exclusively for this top-tier model, at least until higher-capacity GDDR7 modules become available in the market. Despite these impressive specifications, the RTX 5080's bandwidth still falls approximately 5% short of the current RTX 4090, which benefits from a physically wider bus configuration. This performance gap between the 5080 and the anticipated 5090 suggests NVIDIA is maintaining a clear hierarchy within its product stack, and we have to wait for the final launch to conclude what, how, and why of the Blackwell gaming GPUs.

Sources: Benchlife, via VideoCardz

Add your own comment

44 Comments on NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

#26

Prima.Vera

In a normal world, where user would be much more considerate with their hard earn money, this card should be DOA, unless is priced as the current 4070, or less. By the specs, this is a 5070 on all, except name.
It is extremely obvious thst in the near future a better variant with 20 or 24GB of VRAM, and probably way wider bus, is going to be released. So this card will be easily forgotten.

#27

AcE

londisteGDDR7 uses PAM3 signalling vs GDDR6X's PAM4. GDDR7 has lower voltages as well.

Yes, GDDR7 is to G6X what GDDR6 was to G5X - a more refined version, or "final" version and AMD will use it as well, long time. GDDR6 was announced by Samsung for up to 24 Gbps speed's btw, not making it really inferior to G6X, but I never saw those versions on any graphics cards.

#28

londiste

Dr. Dro24 Gbps on both 4080 and 4080 Super.

Did these actually ever get those? Specs were for 22 and 23 IIRC.

#29

NoneRain

Can't wait for the totally reasonable prices.

#30

AcE

londisteDid these actually ever get those? Specs were for 22 and 23 IIRC.

www.techpowerup.com/gpu-specs/geforce-rtx-4080.c3888

24 Gbps memory (according to review), which is downclocked to 22.4. Same with 4080 S, but downclocked to 23Gbps, guess it simply doesn't need more. The L2 "infinity cache" does outstanding work.

#31

londiste

AcE24 Gbps memory (according to review), which is downclocked to 22.4. Same with 4080 S, but downclocked to 23Gbps, guess it simply doesn't need more. The L2 "infinity cache" does outstanding work.

I very very much doubt the part I bolded :P
Interesting choice though to actually have faster rated memory but run at lower speed... I do wonder what that is about.

#32

AcE

londisteI very very much doubt the part I bolded :p

Then post data to support your claim. :) Very much doubt you have a point here. Get a 4080 and OC its vram, get about +0% performance. ^^ It has more than enough bandwidth.

#33

Dr. Dro

londisteDid these actually ever get those? Specs were for 22 and 23 IIRC.

Yes, both of them have the exact same 24 Gbps IC, Micron MT61K512M32KPA-24 (D8BZF). It comes clocked in at 38 MHz faster in the 4080 Super out of the box, but the limits on both are relatively identical, as is the IC. The "normal" 4080 just has some extra headroom relative to its slightly slower stock speed as a result.

AcEThen post data to support your claim. :) Very much doubt you have a point here. Get a 4080 and OC it's vram, get about +0% performance. ^^ It has more than enough bandwidth.

It actually does not have enough bandwidth, IMO. The 256-bit bus hampers the card really bad at high resolutions. The 4080 loses a ton of gas at 4K in games that are bandwidth heavy. Performance falls off much faster than it does with say, the 7900 XTX with its ~960 GB/s bandwidth.

It's a balanced configuration, however. I'll give you that much. All hardware resources generally end up fully utilized.

#34

londiste

AcEThen post data to support your claim. :) Very much doubt you have a point here. Get a 4080 and OC its vram, get about +0% performance. ^^ It has more than enough bandwidth.

Are you actually trying to argue that faster VRAM speed in not beneficial to a GPU?
If you send me a 4080, I will gladly run some tests :D

Edit:
As Dr. Dro pointed out - gaming tests show that 4080 loses performance at higher resolutions faster than otherwise comparable or comparable-ish GPUs that have wider memory bus and more memory bandwidth. At smaller resolutions the bigger cache on Ada can mask the relatively lower bandwidth but it is simply not sufficient any more at higher resolutions.

#35

AcE

londisteAre you actually trying to argue that faster VRAM speed in not beneficial to a GPU?
If you send me a 4080, I will gladly run some tests :D

My argument is this:
- card has G6X 24 Gbps. Because the card already has more than enough *effective bandwidth* with the big L2 cache, they downclocked it to 22.4 / 23 (4080 S) to save energy. Go and check performance reviews, it doesn't lose "extra" in 4K because the bandwidth is sufficient. Go check 7900 XT, it loses too much in 4K. -> the infinity cache of Nvidia works better.

Sorry can't do, i only have the bigger one

#36

efikkan

john_It's funny. Nvidia users ask for more VRAM and Nvidia is giving them faster VRAM.
It's like a joke with deaf people.

And most people don't understand how VRAM even works. And while more VRAM doesn't increase performance directly, faster VRAM usually does.

It's a new architecture, and we know nothing about its performance characteristics. But judging from previous launches, it's reasonable to assume both memory compression and management are improved. And as usual, a lot of you will probably end up buying one in the end despite joining the "politically correct" whining about VRAM size, bits of memory bus or core count, when the only thing that really matters is how it performs in the real world.

Let's instead look forward to a brand new generation full of goodies. :)

#37

tugrul_LordOfDrinks

Maybe they upscale the textures with AI so the 8GB VRAM is enough.

Also PCIE V5.0 will reduce stutters by 50%.

Let's assume card's memory is 8GB and dataset is 9GB. So 1GB constantly moved through PCIE V4 or V5.

PCIE V4: 1GB goes 30 milliseconds. 33 FPS

PCIE V5: 1GB goes 15 milliseconds. 67 FPS.

Let's assume card's memory is 8GB and dataset is 10GB: 2GB is moved.

PCIE V4: 2GB goes 60 milliseconds. 16 FPS

PCIE V5: 2GB goes 30 milliseconds. 33 FPS.

33 FPS is manageable imho. (I played CS at 10 FPS for years...)

#38

efikkan

tugrul_SIMDAlso PCIE V5.0 will reduce stutters by 50%.

Let's assume card's memory is 8GB and dataset is 9GB. So 1GB constantly moved through PCIE V4 or V5.

If by dataset out mean textures and meshes only, then there would be a lot of large buffers additionally, most of which are heavily compressed though, so are some textures too. (Total allocated VRAM would probably be 12+ GB then.)

But let's assume that you have 9GB of assets to utilize during a single frame with only 8GB of VRAM, that will be a sad story, no question about that. But that's not how rendering engines work though;

Firstly, textures are stored as multiple mip levels and uses anisotropic filtering, so within the allocated memory there are dozens of versions of the "same" texture, and the GPU isn't going to be using all of that data during a single frame. So even if your dataset is 9 GB (even with dynamic loading of assets), only about ~1-1.5 GB of that will be used during any given frame. (Also the highest mip levels will only be used by objects close to the camera, and only so many object can ever be close at once, so for most of the scene very low mip levels are usually used.)

Secondly, the memory bandwidth will be a bottleneck long before you run into heavy swapping anyways. Take for instance RTX 4060 with its 272 GB/s, if you target 60 FPS then you could theoretically only ever access a total of 4.5 GB during 16.7 ms, half of that if you target 120 FPS. Additionally, accesses will not be evenly distributed throughout the frametime (i.e. vertex shaders and tessellation are much less memory intensive), so the real world peak would probably be less than half of that. As you see, by the time you even come close to the VRAM limit of a card, you are already approaching "slide show" territory at ~15 FPS. So any reasoning for having lots of VRAM for "future proofing" is a moot point, not to mention that future games will be even more computationally intensive.

If you do run into heavy swapping, you'll know, trust me, it would be quckly unplayable. But if you do so, then you're probably doing something funky, like running a browser reloading a lot of tabs in the background, or the driver is bad.

#39

tugrul_LordOfDrinks

efikkanIf by dataset out mean textures and meshes only, then there would be a lot of large buffers additionally, most of which are heavily compressed though, so are some textures too. (Total allocated VRAM would probably be 12+ GB then.)

But let's assume that you have 9GB of assets to utilize during a single frame with only 8GB of VRAM, that will be a sad story, no question about that. But that's not how rendering engines work though;

Firstly, textures are stored as multiple mip levels and uses anisotropic filtering, so within the allocated memory there are dozens of versions of the "same" texture, and the GPU isn't going to be using all of that data during a single frame. So even if your dataset is 9 GB (even with dynamic loading of assets), only about ~1-1.5 GB of that will be used during any given frame. (Also the highest mip levels will only be used by objects close to the camera, and only so many object can ever be close at once, so for most of the scene very low mip levels are usually used.)

Secondly, the memory bandwidth will be a bottleneck long before you run into heavy swapping anyways. Take for instance RTX 4060 with its 272 GB/s, if you target 60 FPS then you could theoretically only ever access a total of 4.5 GB during 16.7 ms, half of that if you target 120 FPS. Additionally, accesses will not be evenly distributed throughout the frametime (i.e. vertex shaders and tessellation are much less memory intensive), so the real world peak would probably be less than half of that. As you see, by the time you even come close to the VRAM limit of a card, you are already approaching "slide show" territory at ~15 FPS. So any reasoning for having lots of VRAM for "future proofing" is a moot point, not to mention that future games will be even more computationally intensive.

If you do run into heavy swapping, you'll know, trust me, it would be quckly unplayable. But if you do so, then you're probably doing something funky, like running a browser reloading a lot of tabs in the background, or the driver is bad.

By dataset size I meant assets that are always accesed for each frame. For example it can be a physx simulation with 1024x1024x1024 cells of space for fluid with 8 bytes per cell making 8GB always accessed each frame, plus some textures and stuff to render the objects in there maybe 1 GB more on top of fully accessed 8GB. Then the 1 extra GB goes through pcie for RAM.

Gpu certainly has enough compute power to accelerate fluid simulation to 60 FPS (especially with compressed L2 cache) but hindered by pcie bandwidth. For example, RTX4070 memory is 500GB/s, L2 is about 1.5TB/s but pcie only 20-30 GB/s because its version 4, not 5.

#40

efikkan

tugrul_SIMDBy dataset size I meant assets that are always accesed for each frame. For example it can be a physx simulation with 1024x1024x1024 cells of space for fluid with 8 bytes per cell making 8GB always accessed each frame, plus some textures and stuff to render the objects in there maybe 1 GB more on top of fully accessed 8GB. Then the 1 extra GB goes through pcie for RAM.

No game that I know of would reserve that much for a fluid simulation vs. other assets, not to mention a game wouldn't simulate a single large volume like that, but rather have detailed simulation around objects, and for most games there are only a very basic simulation of the surface needed. A large volume like that will also be highly compressible.

tugrul_SIMDGpu certainly has enough compute power to accelerate fluid simulation to 60 FPS (especially with compressed L2 cache) but hindered by pcie bandwidth. For example, RTX4070 memory is 500GB/s, L2 is about 1.5TB/s but pcie only 20-30 GB/s because its version 4, not 5.

You misunderstand the role of caches; it's actually a streaming buffer. Over time, the average throughput will converge towards the bandwidth of the memory, no matter how fast the cache may be. One of the crucial jobs of the L2 cache in GPUs is to compensate for uneven load across memory controllers, yes, there are usually more than one. Each 64-bit of memory bandwidth is connected to a separate controller, and whenever the data requested "overloads" one or more controllers, prefetching to a larger cache will help offset this and prevent unnecessary stalls of the GPU. (Back when AMD first added large L2 caches to their GPUs they saw large gains in some cases, as AMD's GPUs have struggled with poor resource management for many generations.) L2-Caches (both in GPUs and CPUs) have usually very large bandwidth, not because you will ever come close to that on average, but because you want the peak speed for each cache bank when you actually need it, to get the maximum savings from having the cache.

#41

tugrul_LordOfDrinks

efikkanNo game that I know of would reserve that much for a fluid simulation vs. other assets, not to mention a game wouldn't simulate a single large volume like that, but rather have detailed simulation around objects, and for most games there are only a very basic simulation of the surface needed. A large volume like that will also be highly compressible.

You misunderstand the role of caches; it's actually a streaming buffer. Over time, the average throughput will converge towards the bandwidth of the memory, no matter how fast the cache may be. One of the crucial jobs of the L2 cache in GPUs is to compensate for uneven load across memory controllers, yes, there are usually more than one. Each 64-bit of memory bandwidth is connected to a separate controller, and whenever the data requested "overloads" one or more controllers, prefetching to a larger cache will help offset this and prevent unnecessary stalls of the GPU. (Back when AMD first added large L2 caches to their GPUs they saw large gains in some cases, as AMD's GPUs have struggled with poor resource management for many generations.) L2-Caches (both in GPUs and CPUs) have usually very large bandwidth, not because you will ever come close to that on average, but because you want the peak speed for each cache bank when you actually need it, to get the maximum savings from having the cache.

Fluid simulation and generally particle simulations access neighbor particles for each particle. This is a redundancy and good for L2. Also compressible memory feature of cuda makes L2 compressed.

For example, game of life accesses closest 8 cells. Fluid mechanics can reach maybe 30 or more cells per cell. Nbody algorithm accesses all particles per particle when in brute force. This is extremely cache sensitive. When code updates cell data, L2 stores it compressed. So 40MB cache acts like 160MB. But doesnt affect main memory.

I wish rtx5080 comes with 200MB cache and 2x more compression so that up to 1.6GB redundancy can stay in gpu chip without touching vram. This would be like 8GB + 1.6GB. Add pcie v5.0 bandwidth too. Perfect combo to counter some stutters.

In cuda, there is an option to keep a selected region of data in L2 between different function calls or kernels.

#42

Macro Device

efikkanBut if you do so, then you're probably doing something funky, like running a browser reloading a lot of tabs in the background, or the driver is bad.

Or trying to play Cyberpunk-esque games on an ancient 2 GB GPU like R9 380 also having 8 GB system RAM. With textures at Medium, it stutters like crazy and crashes with them on High. //please don't ask me why I know it

Generally yes, I have seen VRAM overclocking (or any other bandwidth improvement) helping out a lot but additional VRAM only helps in outlier games with some mega shady rendering going on. Or when we're talking additional over seriously obsolete (which is ~5 GB in this day and age) amount thereof.

I don't think 30 GT/s is remotely enough to release the 5080's full potential though. More like 45 GT/s or so, unless that cache somehow someway works 2+ times better than in Ada.

#43

efikkan

tugrul_SIMDFluid simulation and generally particle simulations access neighbor particles for each particle. This is a redundancy and good for L2. Also compressible memory feature of cuda makes L2 compressed.

Pretty much anything the GPU is good at, is done on many instances in parallel, which is why memory accesses on GPU are in larger blocks.

tugrul_SIMDFor example, game of life accesses closest 8 cells. Fluid mechanics can reach maybe 30 or more cells per cell. Nbody algorithm accesses all particles per particle when in brute force. This is extremely cache sensitive.

The amount of simulation that a rendering engine can do within a strict deadline is vastly different from what you can do with a simulation without such constraints (even on the same class of hardware). So whether it's for academic purposes or in a professional setting creating a non-realtime simulation is very different from a game.

A game engine is designed more like a real time system, where algorithms aren't chosen primarily based on O-notation (like in academia), but for achieving consistent performance by awareness of access patterns, cache optimization and avoiding very costly system calls etc. in critical paths. This holds true for both rendering and for game loops; terrible worst-case would result in stutter while rendering, but it's even worse in a game loop, such problems are the cause of many bugs and glitches in modern games. All performant code must be developed with some kind of assessment of the underlying hardware characteristics, and cache optimization is one of the most important. I could go much deeper, but hopefully you get the point. ;)

tugrul_SIMDWhen code updates cell data, L2 stores it compressed. So 40MB cache acts like 160MB. But doesnt affect main memory.
I wish rtx5080 comes with 200MB cache and 2x more compression so that up to 1.6GB redundancy can stay in gpu chip without touching vram.

Compression rates varies based on data density; sparse data can be very heavily compressed, while more random data is not. If your dataset is fairly reliable, then you can probably make an assessment of effective compression rate for your use case, but it wouldn't be applicable to anything else.
I'm not aware of the details of the upcoming generation, but generally speaking it's usually a hint better than the previous.

Macro DeviceOr trying to play Cyberpunk-esque games on an ancient 2 GB GPU like R9 380 also having 8 GB system RAM. With textures at Medium, it stutters like crazy and crashes with them on High. //please don't ask me why I know it

Since when did we expect recent games to run well on 9 year old hardware that isn't even supported any more? Not that more VRAM would have saved it, at best you'd have a pretty slide show, as that GPU wouldn't be powerful enough or have enough bandwidth to pump out 60 FPS at good settings.

#44

tugrul_LordOfDrinks

Even when a kernel function does exact same amount of computation for each run, the time taken can still vary between 1x - 3x due to driver or OS. Cuda needs TCC driver mode to reduce the fluctuation in windows. Sadly, rtx/gtx cant do tcc mode. But even the low low low end k420 can.

Wddm does its own grouping for multiple kernels so it sometimes cant even overlap data copy & compute. Ubuntu is better than windows 11 in this part.

Add your own comment

NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

44 Comments on NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

Related News

44 Comments on NVIDIA GeForce RTX 5080 to Stand Out with 30 Gbps GDDR7 Memory, Other SKUs Remain on 28 Gbps

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts