• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

DirectX 12 API New Feature Set Introduces GPU Upload Heaps, Enables Simultaneous Access to VRAM for CPU and GPU

Joined
Nov 4, 2005
Messages
11,988 (1.72/day)
System Name Compy 386
Processor 7800X3D
Motherboard Asus
Cooling Air for now.....
Memory 64 GB DDR5 6400Mhz
Video Card(s) 7900XTX 310 Merc
Storage Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s) 55" Samsung 4K HDR
Audio Device(s) ATI HDMI
Mouse Logitech MX518
Keyboard Razer
Software A lot.
Benchmark Scores Its fast. Enough.
I want a GPU with a M.2 cache device like they were planning.
 
Joined
Aug 20, 2007
Messages
21,485 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
I want a GPU with a M.2 cache device like they were planning.
I don't. The latencies would be awful. Last time I checked caches are supposed to be fast, which flash really isn't (in gpu terms).
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Joined
Nov 18, 2010
Messages
7,564 (1.48/day)
Location
Rīga, Latvia
System Name HELLSTAR
Processor AMD RYZEN 9 5950X
Motherboard ASUS Strix X570-E
Cooling 2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory 4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s) Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s) Philips PHL BDM3270 + Acer XV242Y
Case Lian Li O11 Dynamic EVO
Audio Device(s) SMSL RAW-MDA1 DAC
Power Supply Fractal Design Newton R3 1000W
Mouse Razer Basilisk
Keyboard Razer BlackWidow V3 - Yellow Switch
Software FEDORA 41
Joined
Jan 8, 2017
Messages
9,451 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Blame TPU database for that.

But in reality it had partial DX11.

DX11 has Shader Model 5.0/5.1, tessellation, hull & domain shaders, DirectCompute (CS 5.0/5.1), 16K textures, BC6H/BC7, extended pixel formats, and all 10_1 features.
 

Mussels

Freshwater Moderator
Joined
Oct 6, 2004
Messages
58,413 (7.93/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
I am not sure if the CPU could use the huge bandwidth of the Graphics cards. The latency alone would kill it.
Works fine in current gen consoles.

I want a GPU with a M.2 cache device like they were planning.
That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.
 
Joined
Oct 12, 2005
Messages
708 (0.10/day)
Works fine in current gen consoles.
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
 
Joined
Mar 10, 2010
Messages
11,878 (2.21/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
CPU + GPU unified memory architecture is nothing new, just hasn't been done for consumer level software but you can get unified memory with CUDA or HIP in Linux right now.
Or ps5 or Xbox , it's just levelling pc up to console.
 
Joined
Oct 12, 2005
Messages
708 (0.10/day)
I still suspect that by the end of this decade, we will have some chips that will target high performance and will be a single SoC a la MI300 on PC. At this time, having the ability for the GPU and CPU to use the same memory will just make way more sense and will allow another level of performance.

but with dedicated GPU, i think it will only be used marginally.
 

Mussels

Freshwater Moderator
Joined
Oct 6, 2004
Messages
58,413 (7.93/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Current Gen console CPU aren't the paramount of CPU processing power. Yes they have access to much more memory bandwidth, but that doesn't means they can do something with it.

Also, They are directly connected to it. They do not need to go via the PCI-E Bus. Doesn't matter if the GPU have 1 TB/s of bandwidth when you have to go thru a PCI-E 16X bus that is limited to 32 GB/s at with PCI-E 4.0 or 64 GB with PCI-E 5.0.

You mostly just saving copy there.
Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.
 
Joined
Oct 12, 2005
Messages
708 (0.10/day)
Console's Zen 2 CPU cluster is limited by internal Infinity Links despite the high 256-bit or 320-bit GDDR6-14000 memory bandwidth. Only the IGP can fully exploit system memory bandwidth.

PCIe 4.0 16 lanes 32 GB/s bandwidth read direction is slightly above Xbox 360's 22.4 GB/s or about half of the texture memory bandwidth of Xbox One's 68 GB/s. PC iGPU is not limited by PCIe 4.0 16-lane link.
Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.


They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.

VRAM is temporary storage, same as main ram were NVME is long term storage. The data you would want to access in VRAM is probably not stored on the SSD anyway.

I see this more a good way to utilize more the GPU. Things like mesh shaders are super powerful, but you are limited to what you can do right now due to having to sync data between the GPU and CPU.

By example, i could see very complex mesh shaders that perform complex destructions on mesh that also affect the collision model. with this tech, the CPU could have the collision model in VRAM.

It will be really about exchanging temporary data. not static one.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Good point about the infinity fabrics limits and that would still be true in our case (traffic to the CPU die will have to compete with the traffic from memory). But anyway i don't really see scenario where that bandwidth would be used up to that point.

And also, i wonder how the cache will handle that as they cache memory line in main memory.
AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

Despite APU's single-chip design, Infinity Link still exists between IO and CCX blocks i.e. AMD's cut-n-paste engineering.

Renior APU example

08.jpg


There's a reason for some Epyc SKUs having double infinity links between CCD and IO.

They're still x86-64 Zen hardware, meaning AMD could definitely do a Zen4/Zen5 variant available in the PC market.


32GB/s is far faster than any current NVME drives, and that bandwidth is used for other things at the same time - and 32GB/s is faster than any current NVME by a large amount, and is definitely going to be a lot faster than anything from system RAM since every reduced step takes out latency - and that latency is the killer
FYI, AMD 4700S APU recycled PS5 APU with 16 GB GDDR6-14000 memory for the PC market and it was benchmarked.

AMD supplied "Design for Windows" ACPI UEFI Firmware for recycled PS5 4700S APU to boot ACPI HAL-enabled Windows.

PS5 APU with "Design for Windows" ACPI UEFI Firmware is an AMD-based X86-64 PC.

My QNAP NAS has Intel Haswell Core i7 4770T CPU (45 watts) and it can't directly boot Windows since it's missing "Design for Windows" ACPI-enabled UEFI. The same Intel Haswell Core i7 4770T CPU was recycled from a slim office Windows-based PC.

AMD designed AM4 cooler mounting holes for 4800S. AMD is not throwing away defective console APUs with working Zen 2 CPUs in the bin i.e. BOM cost for these chips needs to be recovered.
 
Last edited:

Mussels

Freshwater Moderator
Joined
Oct 6, 2004
Messages
58,413 (7.93/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
the VRAM can't be compared to NVME so i am not sure what you are trying to describe here.
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
 
Joined
Oct 12, 2005
Messages
708 (0.10/day)
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
feeding one to the other as a cache system

It's a ton faster (latency wise) to go from NVME to VRAM, than it is any of the current methods - so even with bandwidth that's lower than VRAM speeds it's going to massively reduce stuttering on low VRAM cards, for example
Using system ram as art assets cache didn't stop stuttering mess from the recent games that exceeded 8 GB VRAM. 32 GB /s from PCIe 4.0 16 lanes is about half of Xbox One's texture bandwidth.

Using system ram as a landing zone from NVMe adds additional memory copy latency.

Nvidia's GPUdirect with CUDA skips system memory landing zone for direct NVMe to GPU VRAM.

MS's current DX12U Direct Storage build for PC doesn't skip system memory and has double data storage issue. PC's DX12U Direct Storage needs to evolved when DX12U gains AMD Fusion like feature i.e. this topic's DX12U improvements.
 
Last edited:

Mussels

Freshwater Moderator
Joined
Oct 6, 2004
Messages
58,413 (7.93/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
Maybe, but system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching.

Think more about data that both CPU and GPU need to have access to and that need to be modify on the GPU.

It's quite bit hard to see the real usage of this technology in gaming as many use case doesn't exist yet as it didn't make sense to use it
You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.
 
Joined
Oct 12, 2005
Messages
708 (0.10/day)
You missed the point - latency
It has to be moved from storage TO ram, and no, you cant fit everything into RAM. There are multiple AAA titles out there with 100GB+ sizes right now.

If it has to go storage -> RAM -> VRAM it's got delays every step of the way, vs the GPU just loading what's needed directly
DXdiag for example now shows a mix of VRAM + system RAM, with directstorage your NVME drive becomes part of that setup too - and the GPU is aware of it, instead of the CPU processing all the work prior to that point.
I think you mix Direct Storage and this technology


What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.


This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
I think you mix Direct Storage and this technology


What you discribe is Direct Storage. What this technology allow is for the CPU to be able to edit things in VRAM without having to copy them to local memory and also very importantly, without the GPU losing access to that data. It's true that you could maybe use that for something like Direct Storage, but it would be just be the tips of the iceberg. (And mostly, you don't need at all this technology to be able to acheive what you are describing, just DirectStorage is enough). Also Latency wise, the main source of latency will be the SSD access that is calculated in microseconds. System ram latency is calculated in nano seconds. But you save a big copy in ram so you save a lot of CPU cycles and also bandwidth by sending it directly where you want to send it.


This technology for scenario when you want to compute a data set with both GPU and CPU. Not just for copying stuff into VRAM without passing thru system RAM. This have usage outside of games right now but not much in game since you can't do that right now. The things possible to achieve with that will appear as the technology is deployed.

You argued "But system ram is generally way cheaper, upgradable and available in greater quantities. So better just cache it there and leave the main memory do the caching."

The current PC Direct Storage implementation has system memory as a landing zone.

microsoft-directstorage.jpg


The current PC Direct Storage implementation has the PC's legacy "double copy" issue.


Before Direct Storage on the Windows-based PC

Direcstorage-Legacy-IO-e1644518278873.png


Meanwhile, NVIDIA's GPUDirect on HPC markets

Screen-Shot-2020-08-19-at-9.29.23-PM.png


NVIDIA's RTX IO with Ampere generation https://developer.nvidia.com/rtx-io
rtx-io-visual-2545900-v5-01.png


PC's current Direct Storage implementation needs middleware evolution for direct NVME to GPU path i.e. this topic's DirectX12U's improvement direction.

PC's current Direct Storage Tier 1.1 implementation is half-baked. The console's DirectStorage model is the destination.
 
Last edited:
Joined
Oct 12, 2005
Messages
708 (0.10/day)
You argued ....
Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Good point !

GPU decompression got added with Direct Storage 1.1

I was under the impression that Direct Storage 1.0 was directly from NVME to GPU but you are right. It's actually DMA access from main memory. The CPU doesn't have to intervene there and the GPU can communicate directly with the memory controller to get the data.

It's a bit the opposite of the technology described in this news.

In this news, it's the CPU that can access directly the GPU memory.
BlogFig1.png

PC's current Direct Storage Tier 1.1 implementation. From https://devblogs.microsoft.com/directx/directstorage-1-1-now-available/

There are the chipset's DMA functions from NVMe to system memory (we're not in PIO modes) and then there are GPU's DMA functions from system memory to GPU memory.

---
For this topic, Microsoft has announced a new DirectX12 GPU optimization feature in conjunction with Resizable-BAR, called GPU Upload Heaps that allows the CPU to have direct, simultaneous access to GPU memory. This can increase performance in DX12 titles and decrease system RAM utilization since the feature circumvents the need to copy data from the CPU to the GPU.

CPU ping-pong between GPU VRAM is limited by PCIe 4.0 16 lanes (32 GB/s per direction).

Killzone Shadowfall CPU GPU storage example.jpeg


Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.
 
Last edited:
Joined
Mar 22, 2020
Messages
27 (0.02/day)
Works fine in current gen consoles.


That's just DirectStorage. The textures/files are pre-compiled to run instantly, so it's literally an NVME cache for the GPU.
It works fine on current consoles because of unified memory. On PC you cannot use the vram for the cpu as you would the ram, and vice versa, the PCIE link latency is too high.

Using PS4's Killzone Shadow Fall's example, the shared CPU-GPU data storage is usually small i.e. the bulk of CPU and GPU data sets don't need to be known by either CPU or GPU nodes e.g. CPU should not be interested in GPU's framebuffer and texture processing activities.
Yes ! That's because
- cpu and GPU do very different things with different data, and
- sharing data between the gpu and cpu is very costly even with unified memory because of memory coherency : if the gpu and cpu want to modify the same memory range they have to be sychronized and their caches have to be flushed which destroys performance. The same issue occurs with atomic operations inside a multi core gpu, and it can destroy performance on a single chip with access to a common cache !
 
Top