• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Crysis 3 Installed On and Run Directly from RTX 3090 24 GB GDDR6X VRAM

Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
SLI doesn't "double" your video memory. Think of it as RAID-1, but with videocards.
...you are correct in that it mirrors the memory on each card.
Your analogy is VERY flawed. If you were to compare SLI to RAID, it would be RAID0 as you are adding the capacity of one card to another not mirroring one card with another as would be done with RAID1. And yes, the VRAM doubles. In the case of the RTX3090, 24GB + 24GB = 48GB.
RAID1 is accurate for the RAM... RAID0 is accurate for the GPU itself, lol. The memory is mirrored, not unique and not combined/pooled. Each GPU has its own frame buffer with the same rendering and geometry information on each card (the same data). In the case of the RTX 3090, you still have a pool of 24GB to work with since the same data is mirrored on the second card.

......at least, that is how SLI worked through through Turing..... did it change with Ampere? (gaming/SLI, not compute note)
 
Last edited:
Joined
Jul 5, 2013
Messages
29,304 (6.89/day)
RAID1 is accurate for the RAM...
Incorrect.
The memory is mirrored, not unique and not combined/pooled.
That is not how SLI works. One GPU(and it's ram) draw one part of the screen and the other GPU draws a different part of the screen before both parts are sent to the framebuffer. VRAM usage is completely independent.
In the case of the RTX 3090, you still have a pool of 24GB to work with since the same data is mirrored on the second card.
If you're adding one framebuffer to another the combined total is double.

......at least, that is how SLI worked through through Turing..... did it change with Ampere? (gaming/SLI, not compute note)
The way SLI has worked since NVidia bought out 3DFX is that it's not a scanline offset rendering scheme anymore. It has not changed dramatically since then. NVidia's SLI works by the primary card(which is always the card connected to the display) assigning workloads for itself and the slave card to do. Each card renders a section of the screen and moves it to the framebuffer(which always resides on the primary card) through the SLI bridge. Each card uses it's own VRAM exclusively and the VRAM is always additive. So in the case of 3090's in SLI, 24GB + 24GB does = 48GB.
 
Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
Incorrect.

That is not how SLI works. One GPU(and it's ram) draw one part of the screen and the other GPU draws a different part of the screen before both parts are sent to the framebuffer. VRAM usage is completely independent.

If you're adding one framebuffer to another the combined total is double.


The way SLI has worked since NVidia bought out 3DFX is that it's not a scanline offset rendering scheme anymore. NVidia's SLI works by the primary card(which is always the card connected to the display) assigning workloads for itself and the slave card to do. Each card renders a section of the screen and moves it to the framebuffer(which always resides on the primary card) through the SLI bridge. Each card uses it's own VRAM exclusively and the VRAM is always additive. So in the case of 3090's in SLI, 24GB + 24GB does = 48GB.
My guy... the data is mirrored in typical (gaming) SLI... it does not pool, it does not double. In other words... yes, you have two 24gb cards, but each card has the same data in it so you get zero benefits of a pooled set of vram. It doesn't work that way. Each cards reads its own vram...it is not a shared pool of 48GB. It is not "additive".

Please, go look online to confirm. Here's a start. :)

.

SLI Myth #6: SLI Doubles VRAM

Many gamers will also be aware of this one, but like the myth of doubling performance it's easy to fall prey to this misconception because it also seems quite logical on the surface when you think about it.

So to set the record straight, no, SLI does not double your available VRAM (Video Memory).

The VRAM between a multiple video card system isn't shared or added, but instead copied. What I mean is that say you have two 8GB video cards in SLI.

Instead of now having 16GB, you still only have access to 8GB, as during processing the data in the first GPU is copied to the second GPU.

So your system only ever uses 8GB at one time.

Seeing Double
A common misconception about SLI is that you can get double, triple, or even quadruple video RAM with more graphics cards. Unfortunately, Nvidia SLI only uses the RAM from one card, as each card needs to access the same information at the same time.


If you find something different, feel free to post it. But those links go on for days and back a decade. ;)

Edit: I vaguely recall dx12 supposedly being able to pool it, but... I cant find a thing thats concrete... people are saying the same thing but get shut down left and right.

EDIT2: From Nvidia (circa 2012, lol) - https://nvidia.custhelp.com/app/ans...ry-shared-(i.e-do-both-2gb-cards-become-a-4gb

In SLI or Multi GPU mode, is memory shared (i.e do both 2GB cards become a 4GB configuration)?
No, each GPU maintains its own frame-buffer so you will not double your memory. Rendering data, such as texture and geometry information, is duplicated across both cards. This is also the case with Multi-GPU mode when using a single GeForce 7950 GX2, 9800 GX2, GTX295, GTX 590 and GTX 690 based card.
 
Last edited:
Joined
Jul 5, 2013
Messages
29,304 (6.89/day)
Yeah, let's see white papers from NVidia... Neither of those sites cite NVidia documentation.


In short, VRAM availability to applications and APIs will be equal to the amount on one card, but the cards themselves do and must use all VRAM available. If custom code is used SLI performance can optimized on a per-application basis which includes both symmetric and asymmetric VRAM usage, which means the standard limits and operational constraints can and are altered.

That being said, in the context of a VRAM drive, the VRAM of each card can be used independently or in series while the card still runs SLI functions simultaneously.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,178 (2.76/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Yeah, let's see white papers from NVidia... Neither of those sites cite NVidia documentation.


In short, VRAM availability to applications and APIs will be equal to the amount on one card, but the cards themselves do and must use all VRAM available. If custom code is used SLI performance can optimized on a per-application basis which includes both symmetric and asymmetric VRAM usage, which means the standard limits and operational constraints can and are altered.

That being said, in the context of a VRAM drive, the VRAM of each card can be used independently or in series while the card still runs SLI functions simultaneously.
Both GPUs need everything to render the scene if you're going with an AFR approach because the GPUs are taking turns rendering each frame. Even with SFR you need most of the scene data if not all of it. So no, VRAM is not additive as you suggest in a SLI rendering configuration.
 
Joined
Jul 5, 2013
Messages
29,304 (6.89/day)
Both GPUs need everything to render the scene if you're going with an AFR approach because the GPUs are taking turns rendering each frame. Even with SFR you need most of the scene data if not all of it. So no, VRAM is not additive as you suggest in a SLI rendering configuration.
Oh, then why does NVidias own documentation say otherwise? Please go read..
 
Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
Oh, then why does NVidias own documentation say otherwise? Please go read..
It does outside of the scope we're talking about. Quadros and non gaming applications it can do this. Nobody disagrees...its in the documentation. Thing is, I mentioned (multiple times) gaming/sli... just to avoid this misunderstanding.

In short, VRAM availability to applications and APIs will be equal to the amount on one card, but the cards themselves do and must use all VRAM available
So you agree? When I load a game using SLI, the same data is mirrored on both cards. The memory is not in a shared pool when gaming with sli.
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,178 (2.76/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Oh, then why does NVidias own documentation say otherwise? Please go read..
Read it yourself. :slap:
In all SLI-rendering modes all the graphics API resources (such as buffers or textures) that would normally be expected to be placed in GPU memory are automatically replicated in the memory of all the GPUs in the SLI configuration. This means that on an SLI system with two 512MB video cards, there is still only 512MB of onboard video memory available to the application. Any data update performed from the CPU on a resource placed in GPU memory (for example, dynamic texture updates) will usually require the update to be broadcast other GPUs. This can introduce a performance penalty depending on the size and characteristics of the data. Other performance considerations are covered in the section on SLI performance.
Remember, this is your source.
https://developer.download.nvidia.com/whitepapers/2011/SLI_Best_Practices_2011_Feb.pdf
 
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
...you are correct in that it mirrors the memory on each card.
RAID1 is accurate for the RAM... RAID0 is accurate for the GPU itself, lol. The memory is mirrored, not unique and not combined/pooled. Each GPU has its own frame buffer with the same rendering and geometry information on each card (the same data). In the case of the RTX 3090, you still have a pool of 24GB to work with since the same data is mirrored on the second card.

......at least, that is how SLI worked through through Turing..... did it change with Ampere? (gaming/SLI, not compute note)
I think the Tesla K80 would work like the RTX 3090 as well with it's memory. From what I understand that's actually dual GPU card on a single PCB if I'm not mistaken and has a 12Gb+12GB configuration and can use NVLINK with memory pooling as well if you were to using them in a mGPU super computer server style work setup. There reason I find this matter intriguing is the K80 is a oddly interesting scenario 24GB VRAM and they can be found for under $200's pretty easily second hand these days. If they performed well they could be interesting for a cache accelerator of sorts.
 
Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
I think the Tesla K80 would work like the RTX 3090 as well with it's memory. From what I understand that's actually dual GPU card on a single PCB if I'm not mistaken and has a 12Gb+12GB configuration and can use NVLINK with memory pooling as well if you were to using them in a mGPU super computer server style work setup. There reason I find this matter intriguing is the K80 is a oddly interesting scenario 24GB VRAM and they can be found for under $200's pretty easily second hand these days. If they performed well they could be interesting for a cache accelerator of sorts.
Perhaps. I just know that for gaming/SLI, that isn't the case. Best answer = Aquinus. :p

I know in the past for dual GPU Geforce/gaming cards, it didn't do that. THe new ones with NVlink.. no idea. No idea there was any dual GPU on a single PCB card out (I know little about Tesla/Quadro land, admittedly).
 
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
Just to clear any confusions...

1602112050878.png

1602111884614.png

1602111757940.png


The way SLI has worked since NVidia bought out 3DFX is that it's not a scanline offset rendering scheme anymore.
I believe that's actually a SLI render option you can still force with nvidiaInspector, but I don't know of a practical reason anyone would do it.

To be fair with that quote Nvidia is generalizing how it works for gaming workloads. It could work differently for CUDA perhaps. Also they did say "usually" in their own description so it sounds like it can also be asynchronous in some instances perhaps or maybe they were implying not all updates require data mirroring obviously textures do, but other assets don't necessarily. I don't think Nvidia was considering this kind of situation at all when making those remarks it was a question framed around gaming workload scenario's.
 
Last edited:
Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
This ain't about who's right or wrong so let's get this clear it's about if my system performance can WTFBBQ or not. To clarify I'm curious on use cases it might actually serve. Does it actually have a good use case option? What is ATTO benchmark performance like how is the I/O!!? Is it faster or slower than NVMe especially in relation to stuff like I/O in theory one would hope it's faster than NVME or hell faster than Optane perhaps as well possibly quicker than system ram in the right scenario's. If it's faster than NVMe or Optane it would actually serve as a good hybrid cache between either of those storage mediums even if it was slower than system memory while not getting in the way of system memory at the same time and system memory could always be used as a cache alongside it with Primo Cache and give you more upside to a already fast storage option. Is it practical over NVMe not on cost per GB defiantly not, but neither is system memory either and Optane most defiantly isn't very cost effective last I checked. I think it's important to note as well NVMe performance can drag a fair bit when storage density becomes more fully populated. I guess that would be true with system memory/VRAM perhaps as well, but is more readily masked more by how fast they are in the first place. As others have mentioned VRAM has caveats so it might not be as fast as it looks at face value.
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,178 (2.76/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
It's almost like people can't freaking read. :kookoo:
1602117035479.png
 
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
If you're referring to the K80 it's intended use that's actually entirely irrelevant especially within the context of a GPU usage scenario that wasn't a intend usage scenario really for any of these GPU's then again neither was CUDA until it was. I mean GPU's were kind of intended for gaming and adapted to other fields because they could be and were good at it.
 
Joined
Dec 31, 2009
Messages
19,376 (3.50/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
Sweet jebus...

Can we simply agree that when gaming using SLI there isn't any sharing and pooling and move on? For pete's sake, we're arguing a point nobody disagrees with (pooling is possible with other workloads/tesla/quadro). I think the problem is the missed reference/scope that was mentioned (twice). And here we are. :(

Anyway, people can take away what they want from it. I know what I stated and supported that assertion. I cannot help if that was missed and talked right through/over just to....... I don't know why it was missed, lol.

G'Night, gents, it's Woodford Reserve time. ;)
 
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
Depending on the disk format and settings my GTX980 seems to be beating DDR4 actually for read speeds according to ATTO Disk Benchmark. Also for what it's worth the QD didn't make a pronounced difference. The allocation unit size was set for 2048 bytes it defiantly seems to play a big role and the I/O size and file size certainly come into play. Interestingly the compression of NTFS skews things a fair degree as well. In some cases it's dramatically better and other instances it's worse. I think it has to do with the L1/L2 cache structure on the CPU perhaps, but the results are surprisingly complex. The ideal allocation unit size seemed to be 512bytes to 2048bytes and the ideal I/O sizes seemed to fall into the 256KB to 2MB range. Bottom line seems like it could be useful depending on the scenario I guess!? It seems like it would be a great ting to store the small prefetch cache files on they tend to fall into those file sizes. I wouldn't think it would work well with all games, but certain games I can see where it might work well. It works as page file as well, but doesn't show up in windows disk management which is kind of odd, but hey whatever 1 out of 2 ain't so bad. It's about x37.5 higher IO/s than my SATA SSD too 8.85K IO/s at 2MB file size and QD1. Actually ended up getting 9.03K IO/s at QD256 as well the DDR4 RAM disk pulls 4.93K IO/s at QD256 or 4.99 IO/s QD1.

Perhaps I'm wrong, but makes me think a K80 could work pretty fast in theory it's got more bandwidth and is a newer architecture. Oddly when I tampered with memory clock rates I didn't see a obvious down tick in the performance wonder if that has anything at all to do with the boost states/clock states in the GPU bios GPC/XBAR/L2C/SYS I think they are core/cache/memory/pcie bus? Something janky like that anyway so idk if they play a role or not in VRAM storage application. I imagine they must somewhat have a partial impact in area's.

DDR4 2966MHz CL13
RAM QD1.jpg


GTX980
GPU NTFS 2048bytes.jpg
 
Last edited:
Joined
Jul 5, 2013
Messages
29,304 (6.89/day)
Depending on the disk format and settings my GTX980 seems to be beating DDR4 actually for read speeds according to ATTO Disk Benchmark. Also for what it's worth the QD didn't make a pronounced difference. The allocation unit size was set for 2048 bytes it defiantly seems to play a big role and the I/O size and file size certainly come into play. Interestingly the compression of NTFS skews things a fair degree as well. In some cases it's dramatically better and other instances it's worse. I think it has to do with the L1/L2 cache structure on the CPU perhaps, but the results are surprisingly complex. The ideal allocation unit size seemed to be 512bytes to 2048bytes and the ideal I/O sizes seemed to fall into the 256KB to 2MB range. Bottom line seems like it could be useful depending on the scenario I guess!? It seems like it would be a great ting to store the small prefetch cache files on they tend to fall into those file sizes. I wouldn't think it would work well with all games, but certain games I can see where it might work well. It works as page file as well, but doesn't show up in windows disk management which is kind of odd, but hey whatever 1 out of 2 ain't so bad. It's about x37.5 higher IO/s than my SATA SSD too 8.85K IO/s at 2MB file size and QD1. Actually ended up getting 9.03K IO/s at QD256 as well.

DDR4 2966MHz CL13
View attachment 171209

GTX980
View attachment 171210
I wonder what is holding back the write speeds? They should be more or less equal to the read speeds. What does CrystalDisk64 or DiskMark64 say? Or you could try the "Bypass Write Cache" option.
 
Joined
Mar 21, 2016
Messages
2,590 (0.79/day)
I don't have those disk benchmarks off hand. I think the write cache is limited by the PCIE bus itself. I actually tried the "Bypass Write Cache" option it didn't seem to improve things to any notable degree. Something else really weird is it shows up in Defraggler and it's built in bench shows really poor relative results with it like 276MB/s while the ramdisk pulls like 2566MB/s +/- a probably 5-10% on each. In ATTO Disk Benchmark you can make the write speeds scale higher if you tinker with the allocation unit size and compression enabled/disabled along with the I/O size and file size. The QD has very negligible impact as a whole.
 
Top