Thursday, September 3rd 2020
NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise
NVIDIA at its GeForce "Ampere" launch event announced the RTX IO technology. Storage is the weakest link in a modern computer, from a performance standpoint, and SSDs have had a transformational impact. With modern SSDs leveraging PCIe, consumer storage speeds are now bound to grow with each new PCIe generation doubling per-lane IO bandwidth. PCI-Express Gen 4 enables 64 Gbps bandwidth per direction on M.2 NVMe SSDs, AMD has already implemented it across its Ryzen desktop platform, Intel has it on its latest mobile platforms, and is expected to bring it to its desktop platform with "Rocket Lake." While more storage bandwidth is always welcome, the storage processing stack (the task of processing ones and zeroes to the physical layer), is still handled by the CPU. With rise in storage bandwidth, the IO load on the CPU rises proportionally, to a point where it can begin to impact performance. Microsoft sought to address this emerging challenge with the DirectStorage API, but NVIDIA wants to build on this.
According to tests by NVIDIA, reading uncompressed data from an SSD at 7 GB/s (typical max sequential read speeds of client-segment PCIe Gen 4 M.2 NVMe SSDs), requires the full utilization of two CPU cores. The OS typically spreads this workload across all available CPU cores/threads on a modern multi-core CPU. Things change dramatically when compressed data (such as game resources) are being read, in a gaming scenario, with a high number of IO requests. Modern AAA games have hundreds of thousands of individual resources crammed into compressed resource-pack files.Although at a disk IO-level, ones and zeroes are still being moved at up to 7 GB/s, the de-compressed data stream at the CPU-level can be as high as 14 GB/s (best case compression). Add to this, each IO request comes with its own overhead - a set of instructions for the CPU to fetch x piece of resource from y file, and deliver to z buffer, along with instructions to de-compress or decrypt the resource. This could take an enormous amount of CPU muscle at a high IO throughput scale, and NVIDIA pegs the number of CPU cores required as high as 24. As we explained earlier, DirectStorage enables a path for devices to directly process the storage stack to access the resources they need. The API by Microsoft was originally developed for the Xbox Series X, but is making its debut on the PC platform.
NVIDIA RTX IO is a concentric outer layer of DirectStorage, which is optimized further for gaming, and NVIDIA's GPU architecture. RTX IO brings to the table GPU-accelerated lossless data decompression, which means data remains compressed and bunched up with fewer IO headers, as it's being moved from the disk to the GPU, leveraging DirectStorage. NVIDIA claims that this improves IO performance by a factor of 2. NVIDIA further claims that GeForce RTX GPUs, thanks to their high CUDA core counts, are capable of offloading "dozens" of CPU cores, driving decompression performance beyond even what compressed data loads PCIe Gen 4 SSDs can throw at them.
There is, however, a tiny wrinkle. Games need to be optimized for DirectStorage. Since the API has already been deployed on Xbox since the Xbox Series X, most AAA games for Xbox that have PC versions, already have some awareness of the tech, however, the PC versions will need to be patched to use the tech. Games will further need NVIDIA RTX IO awareness, and NVIDIA needs to add support on a per-game basis via GeForce driver updates. NVIDIA didn't detail which GPUs will support the tech, but given its wording, and the use of "RTX" in the branding of the feature, NVIDIA could release the feature to RTX 20-series "Turing" and RTX 30-series "Ampere." The GTX 16-series probably misses out as what NVIDIA hopes to accomplish with RTX IO is probably too heavy on the 16-series, and this may have purely been a performance-impact based decision for NVIDIA.
According to tests by NVIDIA, reading uncompressed data from an SSD at 7 GB/s (typical max sequential read speeds of client-segment PCIe Gen 4 M.2 NVMe SSDs), requires the full utilization of two CPU cores. The OS typically spreads this workload across all available CPU cores/threads on a modern multi-core CPU. Things change dramatically when compressed data (such as game resources) are being read, in a gaming scenario, with a high number of IO requests. Modern AAA games have hundreds of thousands of individual resources crammed into compressed resource-pack files.Although at a disk IO-level, ones and zeroes are still being moved at up to 7 GB/s, the de-compressed data stream at the CPU-level can be as high as 14 GB/s (best case compression). Add to this, each IO request comes with its own overhead - a set of instructions for the CPU to fetch x piece of resource from y file, and deliver to z buffer, along with instructions to de-compress or decrypt the resource. This could take an enormous amount of CPU muscle at a high IO throughput scale, and NVIDIA pegs the number of CPU cores required as high as 24. As we explained earlier, DirectStorage enables a path for devices to directly process the storage stack to access the resources they need. The API by Microsoft was originally developed for the Xbox Series X, but is making its debut on the PC platform.
NVIDIA RTX IO is a concentric outer layer of DirectStorage, which is optimized further for gaming, and NVIDIA's GPU architecture. RTX IO brings to the table GPU-accelerated lossless data decompression, which means data remains compressed and bunched up with fewer IO headers, as it's being moved from the disk to the GPU, leveraging DirectStorage. NVIDIA claims that this improves IO performance by a factor of 2. NVIDIA further claims that GeForce RTX GPUs, thanks to their high CUDA core counts, are capable of offloading "dozens" of CPU cores, driving decompression performance beyond even what compressed data loads PCIe Gen 4 SSDs can throw at them.
There is, however, a tiny wrinkle. Games need to be optimized for DirectStorage. Since the API has already been deployed on Xbox since the Xbox Series X, most AAA games for Xbox that have PC versions, already have some awareness of the tech, however, the PC versions will need to be patched to use the tech. Games will further need NVIDIA RTX IO awareness, and NVIDIA needs to add support on a per-game basis via GeForce driver updates. NVIDIA didn't detail which GPUs will support the tech, but given its wording, and the use of "RTX" in the branding of the feature, NVIDIA could release the feature to RTX 20-series "Turing" and RTX 30-series "Ampere." The GTX 16-series probably misses out as what NVIDIA hopes to accomplish with RTX IO is probably too heavy on the 16-series, and this may have purely been a performance-impact based decision for NVIDIA.
52 Comments on NVIDIA RTX IO Detailed: GPU-assisted Storage Stack Here to Stay Until CPU Core-counts Rise
Just a fancy new name, and calling we did it first and it is ours.
Anyway, besides the marketing quoting worst-case scenarios, that's definitely a much more efficient way of doing these transfers, and AMD will most probably be doing the same thing, with a different name.
Edit to add: To the OP and title, I don't see any reason for this kind of optimisation to disappear even when core counts increase. Doing this way it's much more efficient, just like DMA for disk drives is much more efficient, they will be replaced by other technologies, but it makes no sense to make all this data transition through the CPU only for decompression.
Ages ago every game would let you choose how much of the installation you wanted to put on HDD and how much would be left of the CD/DVD. Why not add an option to chose compression level of stored data?
As for these numbers being a worst case scenario, I disagree, mainly as the scaling is most likely calculated with 100% scaling, i.e. 1 core working 100% with decompression = X, 12 cores = 12X, despite scaling never really being 100% in the real world. As such this is a favorable comparison, not a worst-case scenario, and saying "would require the equivalent of n cores" could just as well end up requiring more than this to account for imperfect scaling. I sincerely hope AMD also adds a decompression accelerator to RDNA2, which would make a lot of sense given that they designed those for both MS and Sony in the first place. Here I entirely agree with you. There's no reason to move this back to the CPU in the future - it's a workload that only really benefits the GPU (nothing but the GPU really uses compressed game assets, and in the edge cases where the CPU might need some it should be able to handle that), thus alleviating load on the PCIe link by bypassing the CPU, and given that GPUs are more frequently replaced than CPUs it also allows for more flexibility in terms of upgrades, adding new compression standards, etc. Keeping this functionality as a dedicated acceleration block on the GPU makes a ton of sense. Sorry, but what world do you live in? NVMe SSDs have come down a lot in price, but cheap? No. Especially not in capacities like what would be needed for even three games with your 2-300GB install sizes. And remember, even with compressed assets games are now hitting 150-200GB. Not to mention the effect removing compression would have on download times, or install times if data was downloaded and then decompressed directly. Compressing game assets is the only logical way of moving forward.
GPU ram requirements would not change either, because only decompressed data is stored there, so no change there.
How cheap ssds are is for the user to decide. If you wanna save on ssd volume, you can opt for longer loading times, if you have ssd space to spare, you can opt for the uncompressed installation.
Nvidia is saying they could get 2x the effective bandwidth out of PCIe gen 4 x4 NVMe drive, that is 14GBs of effective bandwidth. Imagine no loading time, no texture pop-in with open world game.
Good NVME SSDs cost about 150$ per TB
I don't know what the compression factor of these assets is, but lets say its 1:6 so a 50GB game comes to 300GB of uncompressed data (not counting that not all of the assets are even GPU ralated, like sound assets or pre rendered videos) that would mean you could store three uncompressed Games on a 1TB drive.
Most games are not AAA games that are even this big and a lot of games do fine the way they are now, so only a fraction of the games would even need an option for uncompressed install. Which means, you could propably store even more games on that 1TB drive.
A lot of enthusiasts spent high three digits or even four digits on GPUs, so why not spent another 150$ on an additional SSD to immensely speed up those data intensive AAA games?
To clear things up, I consider 'heavily approximated, always partial ray-tracing with destructive compression-like algorithm, which will work on <1% games' an advancement and generally nice feature - except it's just a distant cousin of scene or camera ray-tracing and not the 'ultimate dream'. Yeah, RTX looks great on Minecraft and Quake 2 - but please find an easier example for a derivative of ray-tracing and you'll be rewarded tenfold. Low poly-count, flat surfaces... Why not even Quake 3? Why not new Wolfenstain? Errr...
So, now he is speeding up M.2 storage? Yes, sure, why not?
I'll believe in all those things *WHEN* I see them in work, on real (and normal) system with more general benchmarks, not by-NVIDIA-for-NVIDIA set of 2...
Also, I think your claim about the RDNA teams is fundamentally flawed. AMD post-RTG is a much more integrated company than previously. And while there are obviously things worked on within some parts of the company that the other parts don't know about, a new major cross-platform storage API provided by an outside vendor (Microsoft) is not likely to be one of these things. Because that would limit support to boards with free PCIe slots, excluding ITX entirely, require an expensive NVLink bridge, limit support to the 3090, etc. This would likely work just as "well" over a PCIe 4.0 x16 slot, but that would of course limit support to HEDT platforms. Besides, we saw how well dedicated coprocessor AICs worked in the market back when PhysX launched. I.e. not at all. Is there actually anything proprietary here though? Isn't this just a hardware implementation of DirectStorage? Nvidia doesn't like to say "we support standards", after all, they have to give them a new name, presumably to look cooler somehow. What review sites do you know of that systematically only tests games in RT mode? Sure, RT benchmarks will become more of a thing this generation around, but I would be shocked if that didn't mean additional testing on top of RT-off testing. And comparing RT-on vs. RT-off is obviously not going to happen (that would make the RT-on GPUs look terrible!).
Horizon Zero Dawn: 72 GB
Mount & Blade 2: 51 GB
Red Dead Redemption 2: 110 GB
Star Citizen: 60 GB