• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Radeon RX Vega Preview

Joined
Jun 10, 2014
Messages
2,985 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Nvidia's drivers allow for multithreaded draw calls to be decoupled from what the API does. AMD's do not.

You can argue with me all day , this is a well known fact : AMD's poor performance along the last couple of years had everything to do with a lack of multithreaded drivers.
That makes no sense. While the internal processing in the driver may utilize multiple threads, a single render pass executes its API calls from a single thread, and the internal queue ends up as a linear stream of native operations. There is no difference between AMD and Nvidia here. Lack of "multithreading" has never been the problem for GCN.
 
Joined
Jan 8, 2017
Messages
9,434 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Joined
Jun 10, 2014
Messages
2,985 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
No, I actually understand this, and I don't refer random things from the Internet I don't comprehend.
As mentioned in #124, you can have multiple threads building a queue, but it comes at a cost of synchronization overhead. Rendering is a pipelined process of steps, you can parallelize inside each step, but the steps still has to be executed in a serial manner. So if a rendering pass consists of steps a) -> b) -> c) -> d) -> …, you can use this deferred context to have four threads submitting commands to the queue. You can't have one thread working on c) while another is working on a). Synchronization of CPU threads are very expensive, and doing it many times during a single frame will cost milliseconds. It only makes sense when the overhead of the rendering thread (engine overhead, not API overhead) is greater than the synchronization overhead, which is unusual. This code example shows a simple scene with just simple cubes, while rendering in games are much more complex, so the application of this is much more challenging. So this technique is only applicable to certain scenarios or edge cases.

As evident in the code example you clearly don't understand, this has to be designed into the rendering engine. This feature works around CPU overhead in the rendering engine itself, not the driver. I can guarantee that this is not what makes Pascal and Maxwell outperform GCN, since this is in the rendering engine's realm and outside Nvidia's control. And using deferred contexts makes no difference from the GPU side, this is purely a rendering engine optimization.
 
Joined
Jan 8, 2017
Messages
9,434 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
No, I actually understand this, and I don't refer random things from the Internet I don't comprehend.
As mentioned in #124, you can have multiple threads building a queue, but it comes at a cost of synchronization overhead. Rendering is a pipelined process of steps, you can parallelize inside each step, but the steps still has to be executed in a serial manner. So if a rendering pass consists of steps a) -> b) -> c) -> d) -> …, you can use this deferred context to have four threads submitting commands to the queue. You can't have one thread working on c) while another is working on a). Synchronization of CPU threads are very expensive, and doing it many times during a single frame will cost milliseconds. It only makes sense when the overhead of the rendering thread (engine overhead, not API overhead) is greater than the synchronization overhead, which is unusual. This code example shows a simple scene with just simple cubes, while rendering in games are much more complex, so the application of this is much more challenging. So this technique is only applicable to certain scenarios or edge cases.

As evident in the code example you clearly don't understand, this has to be designed into the rendering engine. This feature works around CPU overhead in the rendering engine itself, not the driver. I can guarantee that this is not what makes Pascal and Maxwell outperform GCN, since this is in the rendering engine's realm and outside Nvidia's control. And using deferred contexts makes no difference from the GPU side, this is purely a rendering engine optimization.

That piece of documentation was for you to understand how this concept works , they implemented a form of this optimization as an automated feature done by the driver shortly after the launch of Kepler , they even made a big deal out of it how they suddenly got xx% more performance. There are also a ton of tests done on DX11 games that confirm AMD's drivers hammer down just one core/thread at a time , while using Nvidia hardware grants a more balanced load across cores/threads.

Here : https://developer.nvidia.com/dx12-dos-and-donts

I'll just pick up some key things :

  • Consider a ‘Master Render Thread’ for work submission with a couple of ‘Worker Threads’ for command list recording, resource creation and PSO ‘Pipeline Stata Object’ (PSO) compilation
    • The idea is to get the worker threads generate command lists and for the master thread to pick those up and submit them
  • Expect to maintain separate render paths for each IHV minimum
    • The app has to replace driver reasoning about how to most efficiently drive the underlying hardware

  • Don’t rely on the driver to parallelize any Direct3D12 works in driver threads
    • On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12
    • While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.
But hey , this has nothing to do at AT ALL with multithreading at the driver level. I mean Nvidia clearly has no clue about what they are talking about.

Look , we're not getting anywhere , you don't want to acknowledge this is how their drivers work for one reason or another. Carry on with your belief. In this situation I suggest we best drop this discussion , this is way off topic.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,985 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
That piece of documentation was for you to understand how this concept works , they implemented a form of this optimization as an automated feature done by the driver shortly after the launch of Kepler
The documentation describes using a feature to have multiple threads dispatch commands.

In #125 you said this:
Nvidia's drivers allow for multithreaded draw calls to be decoupled from what the API does. AMD's do not.

You can argue with me all day , this is a well known fact : AMD's poor performance along the last couple of years had everything to do with a lack of multithreaded drivers.
Would you please make up your mind? In one instance it's decoupled from the API, and in the next it's inside the driver?

Back to your code example, this has nothing to do with driver implementation or hardware architecture, but simply how the rendering engine interfaces with the driver. Pascal and Maxwell doesn't scale better because they interface differently with Nvidia hardware, no, both the render code and the API are in fact the same. All the rendering engine sees are queues of API commands, sometimes multiple queues, even both rendering and compute queues. The driver translates these commands into the GPU's native API. The render engine never does low level scheduling; it never knows which GPU cluster will do what in which clock cycle, it doesn't do resource dependency analysis and queue read/write operations, it doesn't estimate what will be in L2 cache and what not, etc. All of this is handled by the GPU's internal scheduler. Modern GPUs are fitted with multiple separate memory controllers. Only one GPU cluster can read from a memory bank at a time, so real-time dependency analysis is done in the GPU scheduler as it receives the queue from the driver. Whenever multiple clusters need the same resource or resources from the same bank, you'll get a stall. This is the core of the problem with utilization in GCN. This low level GPU scheduling is not only impossible from the driver and game engine side, it would also result in single frames taking minutes to render.
 
Joined
Jan 8, 2017
Messages
9,434 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Would you please make up your mind?

I don't need to. I think I made it very clear what I had to say, I brought up enough information. You can look further into this by yourself , I can't carry on with this forever. So I say again , we better drop this discussion.
 

VSG

Editor, Reviews & News
Staff member
Joined
Jul 1, 2014
Messages
3,649 (0.96/day)
I honestly didn't mind this conversation at all, and would not call it way off-topic. You both were courteous to each other and cited sources wherever possible, so props for that. But I do agree it would be better to take this else where if only to not have others come in to check for updates and see something they weren't expecting.
 
Top