• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

NVIDIA GeForce RTX 4090 PCI-Express Scaling

If I were to use a two Gen 4 NVMe SSDs in RAID 0 on the M.2 slots that are wired to the chipset of a high end z790 board like the ASUS ROG Maximus z790 Extreme, would I get the full PCIe x16 for the graphics card and get speeds matching a single Gen 5 NVMe SSD that would have been installed on the Gen 5 slot?
For sequential R/W speeds and access times you would get the 2x improvement, for the 4k random you'd still be limited by 1 SSD.
 
So you didn’t want to change the benchmark system right before having to change it again for Raptor Lake? that makes sense.
Retesting everything is a 2-3 weeks task, full-time, doing nothing but running benchmarks all day. I'm sure you are aware of all the launches that happened in the last weeks, no way to do this kind of retesting in that time.

Maybe other sites that have 4 benchmarks and 5 comparison cards can do it, and they probably still recycle results .. I have 25 games, 30+ cards, 3 resolutions

@W1zzard Wouldn't the most correct test methodology be pairing the 4090 with a 7700X?
The 5800X cannot extract the full potential of this GPU.
Only 12900K, better 13900K
 
@W1zzard Wouldn't the most correct test methodology be pairing the 4090 with a 7700X?
The 5800X cannot extract the full potential of this GPU.
I think there are three valid CPU options for GPU testing right now that work. Each of them has its pros and cons performance-wise. The 5800X3D, Alder Lake, and Zen 4. I would say Zen 4, as even if Raptor Lake is (let's say) 10% faster in gaming, Zen 4X3D will most likely crush Raptor Lake in gaming. If you use the base Zen 4, then you can just drop in a Zen 4X3D CPU while keeping the motherboard and RAM the same.
 
only for top hardware - when they castrate lower-end cards, the drop from 3.0 to 2.0 is almost 20%
I said you don't need PCI 4.0 or more even for 4090... 3.0 seems enough
 
Retesting everything is a 2-3 weeks task, full-time, doing nothing but running benchmarks all day. I'm sure you are aware of all the launches that happened in the last weeks, no way to do this kind of retesting in that time.

Maybe other sites that have 4 benchmarks and 5 comparison cards can do it, and they probably still recycle results .. I have 25 games, 30+ cards, 3 resolutions


Only 12900K, better 13900K
3% for me is margin of error. This end result can change entirely depending on the games used, whether it's in favor of the i9 or the Ryzen 7xxx, the point is that this is the only PCIe 5 platform available.

Yeah, the i9 would also bring better results than the current CPU.
 
I said you don't need PCI 4.0 or more even for 4090... 3.0 seems enough
The issue is that we can't say for sure yet. There are other bottlenecks before PCIe bandwidth. Maybe even if we remove said bottlenecks, nothing will change, but maybe it will be a 10% difference.
 
Curious if PCIE 4.0 vs 3.0 matters if you are recording using the AV1 encoder at the same time? Cause it to saturate a bit more?
 
Curious if PCIE 4.0 vs 3.0 matters if you are recording using the AV1 encoder at the same time? Cause it to saturate a bit more?
Shouldn't .. moving the input frames to the encoder is free, because it happens within the GPU. The encoded result that goes over the bus is tiny in filesize, just a few MB/s
 
The issue is that we can't say for sure yet. There are other bottlenecks before PCIe bandwidth. Maybe even if we remove said bottlenecks, nothing will change, but maybe it will be a 10% difference.
In some cases there is a considerable difference
PCIe.png
 
Shouldn't .. moving the input frames to the encoder is free, because it happens within the GPU. The encoded result that goes over the bus is tiny in filesize, just a few MB/s
I figured that transfer over the bus with lossless quality would be more than a few MB/s. Something like CQP 10 or lower.
 
In some cases there is a considerable difference
I agree, I should have said the average numbers. There are some situations where PCIe doesn't matter. However, as you showed, there are some situations where it matters a ton. Based on what we have from this review/benchmarks, it seems there is only a 2-3% difference between 4.0 and 3.0. I can see it changing to 10% on average or more with updated CPU testing that removes some of the CPU bottlenecks.

Then that begs the question, how realistic is testing PCIe scaling with the same CPU? There shouldn't be any situation where you have PCIe 2.0 or PCIe 1.1 on a modern system.
 
disagree man, look at the gap at 1440p versus the 3080:

relative-performance_2560-1440.png
I have absolutely no idea what you're talking about; There's no 3080 in this chart, a 4090 was used and PCIe 2.0 is only 4% slower than PCIe 4.0, which is the point I'm making.

If you mean that the 4090 is 50% faster than a 3080, then that's still a CPU bottleneck because the 4090 is twice as fast as a 3080 at 4K - ergo the scaling at lower resolutions is CPU-limited.

You can see many games exceeding 200fps which is well into the realms of high-fps gaming that frequently expose CPU weaknesses:

1665778495883.png

Add that to the problem of 240Hz 4K displays being vanishingly rare and expensive, and what we have an average framerate that frequently exceeds the capabilities of most displays and/or CPUs.
 
What software is that? The people you listed don't need such a fast GPU as far as I understand, they only need some basic viewport acceleration
Workflows such as 3D GPU render engines - Octane, Redshift, V-Ray, Arnold GPU, Cycles, etc. Do we know how much data goes over PCIe bus in those workflows? I am asking out of curiocity to get an idea of possible saturation point at x8 Gen4 outside of gaming workflows.

Yeah, so about those PCIe Gen5x16 slots on higher end boards...
You should only buy those boards if you need enough of PCIe data bandwidth for specific workflows. You can bifurcate x16 Gen5 lanes into x8 x8, keep GPU in the first slot and attach another device, such as NVMe RAID array to second x8 Gen 5 slot. PCIe 5.0 opens up many possibilties for those who need it and if motherboard vendors wire slots properly.
 
Workflows such as 3D GPU render engines - Octane, Redshift, V-Ray, Arnold GPU, Cycles, etc. Do we know how much data goes over PCIe bus in those workflows? I am asking out of curiocity to get an idea of possible saturation point at x8 Gen4 outside of gaming workflows.
Should be very low, otherwise the GPU cores can't get loaded fully. Any idea if anyone has ever done a PCIe scaling test for these apps?
 
Well done test.

We can conclude that the PCI-E adoption to even 6 in a few years is not really because of the ever demanding powerfull GPU's but simply enterprise markets having the need for more bandwidth. NVME storage, NIC's and everything else.

When the switch was made from AGP X8 vs PCI-E X16 there was barely any difference. It took a few generations of cards even to start taking benefit from that extra bandwidth.
 
This article will be important for those deciding to go with AMD x670 or x670E or B650 or b650E. From your foundings, you should just be sure that m2 slot has that PCie Gen5 possibility as the GPU will not be able to use it. Saving maybe 50 bucks on that could be invested in a bit faster GPU.
Yes and no, depending on how many PCIe lanes someone needs for their peripherals and workflows.
Both extreme and vanilla boards with two x16 Gen4/5 slots can be bifurcated to run at x8 and x8 Gen4/5. You can keep your GPU, either Nvidia or AMD, in one x8 slot (today's finding show that x8 Gen4 with Nvidia will work fine, so AMD GPU will work well too in x8), and attach another device to second x8 Gen4/5, such as NVMe RAID array.

Why are you benchmarking GPUs with a 5800X? Shouldn't you be using the best CPUs for benchmarking GPUs? You have three other options for the best gaming CPUs, Alder Lake, Zen 4, and the 5800X3D
Does it matter? He was testing PCIe bus data exchange.

Hah! PCIe 2.0 still being fine almost 16 years later :)

Sure, if you want that last 5% then you'll need PCIe 4.0 but these articles always prove that the PCIe race really isn't that necessary unless you're already chasing diminishing returns.
It's more about providing more bandwidth over PCIe bus to give more flexibility to attach various peripherals to less lanes as PCIe generation widens data bandwidth. You can't do too much with x16 PCIe 2.0, but you can with x16 PCIe 4.0 or 5.0. You can bifurcate and trifurcate those lanes to several slots and attach numerous peripherals, such as AIC network card, NVMe RAID array, etc.

Considering there's on average a 2% difference between PCIe 3.0 and 4.0 there would be a 0% performance inrovement from using PCIe 5.0.
Widening bandwidth on PCIe bus is more about providing more flexibility in assigning lanes to several slots for different peripherals. You can do a lot of configurations with 16 Gen5 lanes from CPU, such as two slots x8 x8, three slots x8 x4 x4, etc.

Nowadays with x16 Gen5 lanes you need much less space on motherboard to devide into several peripherals. With Gen3, you would need 64 lanes to provide the same data bandwidth.

If I were to use a two Gen 4 NVMe SSDs in RAID 0 on the M.2 slots that are wired to the chipset of a high end z790 board like the ASUS ROG Maximus z790 Extreme, would I get the full PCIe x16 for the graphics card and get speeds matching a single Gen 5 NVMe SSD that would have been installed on the Gen 5 slot?
Two NVMe Gen4 drives on the chipset in RAID0 mode cannot exceed the speed of M.2 x4 slot wiring, so you get 64 Gbps of traffic.
You would need to attach two NVMe Gen4 drives to AIC card in x8 Gen4 slot to get this unified pool to transmit over eight lanes, which gives 128 Gbps. This would match one NVMe Gen5 traffic.

The issue is that testing with the 5800X is disingenuous,
You can test with any recent CPU. The purpose was to see whether PCIe bus throughput was affected by cutting lanes or between generations. Several CPUs and GPUs should be tested for the consistency of PCIe bus throughput data. This was the first test, but HUB have tested this before in more diverse PCIe configurations. Have a watch on youtube.
 
Does it matter? He was testing PCIe bus data exchange.
It absolutely does. He was testing GPU performance scaling, not bandwidth. Would you say the same if he tested PCIe 3.0 to PCIe 1.1 using an Athlon X4 950 Bristol Ridge? That is a quad-core AM4 Bulldozer CPU. Could you see the same PCIe scaling as the Athlon if he used the 7700X or 12900KS PCIe 3.0 to PCIe 1.1? No.

If you have a limit with the CPU performance, you won't be able to see any difference in PCIe scaling or any other changes. If there is a bottleneck, it doesn't matter how fast or how wide the flow is after the bottleneck. It still will be limited by the bottleneck
 
Maybe other sites that have 4 benchmarks and 5 comparison cards can do it, and they probably still recycle results .. I have 25 games, 30+ cards, 3 resolutions
Would you consider breaking down the three resolutions into SRD and HDR gaming performance? I could not find information on the testing page whether games are tested for SDR/HDR output. I can see "highest quality settings". Does that include HDR being on in each game?

As shown by Tim from HUB, some gaming engines deal with HDR differently and need more processing when enabled, which lowers frame rates on HDR monitor. This would mean that games need to be tested tested in SDR and HDR, to show differences in GPU performance in those scenarios. A game achieving say 70 fps on 4K SDR monitor could achieve 63 fps on 4K HDR monitor. Many games perform similarly in both modes, but it would be useful to identify the outliers, so that published charts are even more accurate.

Yeah, the i9 would also bring better results than the current CPU.
This is irrelevant. He was testing PCIe bus throughput, not which CPU is better performing.

Curious if PCIE 4.0 vs 3.0 matters if you are recording using the AV1 encoder at the same time? Cause it to saturate a bit more?
Someone would need to test several workflows at the same time to tell us where saturation point is. This somebody would need to have a lot of time to do that.

In some cases there is a considerable difference
PCIe.png
Please make pictures smaller before posting, as they appear gigantic in the comment section! Try to drag picture angle diagnonally towards its centre.

Add that to the problem of 240Hz 4K displays being vanishingly rare and expensive, and what we have an average framerate that frequently exceeds the capabilities of most displays and/or CPUs.
It depends on what game people play on what kind of gear. Here, average performance is a meaningless measure. If I play Flight Simulator and Cyberpunk in ultra settings only, which is very demanding on GPU, I will not need more than 4K/90hz HDR display. Even mighty 4090 cannot give more than 73 fps in Flight Simulator in this case. Do we need 73 fps in Flight Simulator? Not necessarily. 60 is plenty. It's a slow pace game mostly, unless someone enjoys fighter jet action. It's all about use scenario.

4090 is a halo product that unlocks higher frame rates on 4K display and could be purchased specifically for that purpose by those who need it to do that. Anyone playing 1080p or 1440p should never consider this card, unless needing insane frame rates. A total overkill. Anyone who needs future-proof DisplayPort 2.0 connectivity should never buy 4000 series cards.
 
Should be very low, otherwise the GPU cores can't get loaded fully. Any idea if anyone has ever done a PCIe scaling test for these apps?
Not to my knowledge, that's why I am asking. I am curious about saturation point and data rates for different workflows. How much bandwidth different games use and how much those pro workflows send over the bus? Cannot find any chart or table with those GB/s.

If we had such data, we would roughly be able to see saturation points for all generations of PCIe with simulataneous workflows, e.g. gaming while other app performs render and third one other graphics tasks.

It absolutely does. He was testing GPU performance scaling, not bandwidth. Would you say the same if he tested PCIe 3.0 to PCIe 1.1 using an Athlon X4 950 Bristol Ridge? That is a quad-core AM4 Bulldozer CPU. Could you see the same PCIe scaling as the Athlon if he used the 7700X or 12900KS PCIe 3.0 to PCIe 1.1? No.

If you have a limit with the CPU performance, you won't be able to see any difference in PCIe scaling or any other changes. If there is a bottleneck, it doesn't matter how fast or how wide the flow is after the bottleneck. It still will be limited by the bottleneck
True that. I stand corrected.
With DirectStorage 1.1, will CPU bottleneck become less relevant in testing PCIe scaling?
 
True that. I stand corrected.
With DirectStorage 1.1, will CPU bottleneck become less relevant in testing PCIe scaling?
No, it will not. DirectStorage increases speed with which textures are loaded from SSD, and afterwards instead of decompressing the compressed textures and assets on the CPU they will be decompressed on the GPU for even faster loading of data which will decrease loading times any time new assets are being loaded by the game, e.g. in cutscenes, changing the map or zone, loading new assets in case of open world game, etc.
Other operations being done by the CPU, calculating game logic and issuing GPU draw calls will still be there and will still tax the CPU.
So no, CPU won't become less relevant in bottlenecking the PCIe scaling. With DirectStorage 1.1 you will see faster loading time and shorter cut scenes but PCIe scaling will still be the same as before in terms of raw FPS values.
 
Last edited:
I always have wanted PCIe to not require powers of 2 for the number of lanes enabled . This would be a perfect situation where an x16 slot with 12 enabled lanes would be great.
 
Back
Top