NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise

Valantar · Sep 7, 2020

R0H1T said:
Well here's a food for thought ~ has anyone tried NVMe 4.0 drives with Directstorage, or RTX IO, & seen how PCIE 3.0 would be a limiting factor in it? I do believe if this works like the way PS5 demos have shown PCIe 3.0 could be a major bottleneck, mainly on Intel, especially a year or two down the line with AAA titles!

That would require developers to develop PC games in a manner that requires more than, say, 2.5GB/s of disk read speed as an absolute minimum. I doubt we'll see that for quite a while yet. Remember, the XSX sticks to 2.4GB/s PCIe 4.0x2, so cross-platform games are unlikely to require much more than this. It would absolutely be possible to make this be a bottleneck, but it would require some quite specific game designs, or extremely aggressive just-in-time texture streaming solutions (which could then be alleviated by enabling higher VRAM usage and earlier streaming for installations on lower bandwidth storage). Still, it seems highly likely that games in the relatively near future will (need to) become aware of what kind of storage they are installed on in a way we haven't seen yet.

Mouth of Sauron · Sep 8, 2020

Valantar said:
All I'm saying is that you're making some claims here that you're not backing up with anything of substance, beyond alluding to experience as if that explains anything. I'm not doubting your experience, nor the value of it - not whatsoever - but all that tells us is that you ought to know a lot about this, not what you know. Because that is what I'm asking for here: an explanation of what you are saying. I'm asking you to present your points. I'm not contesting your claims (well, you could say that about decompression and RT audio, but you don't seem interested in discussing those), but you said something to the effect of current RT being complete garbage, which... well, needs fleshing out. How? Why? On what level? I mean, sure, we've all seen the examples of terrible reflection resolution in BF1 etc., but it seems you are claiming quite a bit more than that - though again, it's hard to judge going by your vague wording. So maybe try not to be insulted when someone asks you to flesh out your claims, and instead ... do so? Share some of that knowledge that you - rather patronizingly, I might add - claim that I should accept on blind faith? I think we're both talking past each other quite a bit here, but as far as I understand the general attitude on any discussion forum, making a vague claim - especially one backed by another claim of expert knowledge - and then refusing to go beyond this vagueness is a rather impolite thing to do. You're welcome to disagree, and I'll be happy to leave it at that, but the ball is firmly in your court.

I was also interested in seeing you argue your points about both the two first points in the post you quoted (about GPU-accelerated decompression and "RT" audio), but again you don't seem to have come here to have any kind of exchange of opinions or knowledge. Which is really too bad.

I would assume on Nvidia's internal servers and the work computers of their employees, and nowhere else. Nvidia doesn't tend to be very open source-oriented.

OK, perhaps I have misunderstood your meaning - what it looked like to me is that you said that "I need to flesh out stuff for myself (and perhaps return then)". This doesn't sound nice at all, you'll admit. Again, I'm not (surprise!) native speaker, so I may've got it wrong - the whole thing sounds quite different with that premise...

RT is not the topic here, but I'll be brief - no, I don't expect people to believe me blindly. Then again, none of the new products is out and tested yet, so discussing them now is kinda pointless - this is inline my post that I (personally) have low confidence in NVIDIA claims before it (testing) proves them.

What is certain is that they purposely use the terms wrongly, in marketing purposes. From computational point of view, real camera or scene RT is still very far away, also across the whole range of hardware we will have rasterization + RT combination, done by different shaders. All that for, according to Steam, market that currently has significantly less than 5% 'RTX' GPUs. Old games won't be 'upgraded' in 99% cases. Those are the facts.

How game devs will react and what will be adoption rate, also how their interpretations (it's not all about hardware) across a hardware line of different capabilities will look like is questionable. RT is so uneven in computational load that FPS is extreme sawtooth - this applies the most for full scene/camera RT, but also partial interpretations (just lights and reflections, just shadows etc). To compensate this, they will use a lot of approximations - and don't hide it (this may not be necessarily a bad thing, could be an achievement, really - except, well, it's untested too). Those are the questions.

At the end - and this got quite longer than intended - my personal reason to have a low confidence in NVIDIA is that they used terms improperly for marketing, claimed that various partial solutions and approximations are 'the holy grail' of rendering (RT itself, by the way, isn't perfect for that matter), which will provide highly impactful results. Well, I need to see that firstly - and they are obviously in overhype-mode right now, claiming miracles everywhere.

InVasMani · Oct 21, 2020

Punkenjoy said:
The compression is not just useful for saving SSD space but also for bandwidth saving.

Let say you have a link that can send 10 GB/s. You want to send 2 GB uncompress or 1 GB compress. The first one will take at least 200 ms where the second one would take 100 ms.

This is just for pure data transfer but you can see how it can reduce latency on large transfers

Also, these days the major energy cost come from moving the data around and not from doing the calculation itself. If you can move the data in a compressed state, you can save power there.

But what I would like to know is can we just uncompress just before using it and continue to save on bandwidth and storage while it sit in GPU memory? Just in time decompression!

That do not seem to do that there but I think it would be the thing to do as soon as we can have decompression engine fast enough to handle the load.

I think the future might be interesting. If AMD want to be the leader on PC, they might bring to PC OMI (Open memory interface) where the memory or storage is attached to the CPU via a super fast serial bus (Using way less pin and die space than modern memory technology). The actual memory controller would be shifted directly on the memory stick. The CPU would become memory agnostics. You could upgrade your CPU or memory independly. Storage (like optane) could also be attach via this.

The pin count is much smaller than with modern memory so you can have way more channel if required.
View attachment 167596

This is based on the OpenCAPI protocol. OpenCAPI itself would be used to attach any kind of accelerator. The chiplet architecture from AMD would probably make it easy for them to switch to these kinds or architecture and it's probably the future.

These are open standard pushed by IBM but i would see AMD using them or pushing their own standard in the future that have a similar goal. With these standard, the GPU could connect directly to the Memory controler and vice versa.

VRAM can be compressed NTFS compression or other more advanced compression methods like LZX. I'm not sure how that works with data actually being utilized though, but a lot of data is allocated into VRAM that isn't actually being used at all times it's literally sitting there so when it needs to be accessed it's ready to be and can quickly. That of course begs the question if it could be decompressed on the fly quickly and when it's not readily being utilized re-compressed to fit more data into the VRAM as well increase bandwidth since compressed data that can be decompressed quickly by the CPU can actually end up being faster. What you say about the compression saving on bandwidth is defiantly true I can actually get more bandwidth with compressed storage on a VRAM disk or a RAM disk with NTFS compression enabled in certain instances. In fact I disabled a CPU core and thread on my CPU and tested it and performance dropped under the same settings by a noticeable amount pretty much across the board within the same test scenario which seems to indicate the CPU core count and cache per core comes into play in a noticeable way. In fact over the years with gaming we've seen a number of times over CPU cache come into play so I'm not surprised in the least.

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise

Valantar

Mouth of Sauron

InVasMani