Monday, August 31st 2015
Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12
It turns out that NVIDIA's "Maxwell" architecture has an Achilles' heel after all, which tilts the scales in favor of competing AMD Graphics CoreNext architecture, in being better prepared for DirectX 12. "Maxwell" lacks support for async compute, one of the three highlight features of Direct3D 12, even as the GeForce driver "exposes" the feature's presence to apps. This came to light when game developer Oxide Games alleged that it was pressured by NVIDIA's marketing department to remove certain features in its "Ashes of the Singularity" DirectX 12 benchmark.
Async Compute is a standardized API-level feature added to Direct3D by Microsoft, which allows an app to better exploit the number-crunching resources of a GPU, by breaking down its graphics rendering tasks. Since NVIDIA driver tells apps that "Maxwell" GPUs supports it, Oxide Games simply created its benchmark with async compute support, but when it attempted to use it on Maxwell, it was an "unmitigated disaster." During to course of its developer correspondence with NVIDIA to try and fix this issue, it learned that "Maxwell" doesn't really support async compute at the bare-metal level, and that NVIDIA driver bluffs its support to apps. NVIDIA instead started pressuring Oxide to remove parts of its code that use async compute altogether, it alleges."Personally, I think one could just as easily make the claim that we were biased toward NVIDIA as the only "vendor" specific-code is for NVIDIA where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that NVIDIA does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports," writes Oxide, in a statement disputing NVIDIA's "misinformation" about the "Ashes of Singularity" benchmark in its press communications (presumably to VGA reviewers).
Given its growing market-share, NVIDIA could use similar tactics to keep game developers away from industry-standard API features that it doesn't support, and which rival AMD does. NVIDIA drivers tell Windows that its GPUs support DirectX 12 feature-level 12_1. We wonder how much of that support is faked at the driver-level, like async compute. The company is already drawing flack for using borderline anti-competitive practices with GameWorks, which effectively creates a walled garden of visual effects that only users of NVIDIA hardware can experience for the same $59 everyone spends on a particular game.
Sources:
DSOGaming, WCCFTech
Async Compute is a standardized API-level feature added to Direct3D by Microsoft, which allows an app to better exploit the number-crunching resources of a GPU, by breaking down its graphics rendering tasks. Since NVIDIA driver tells apps that "Maxwell" GPUs supports it, Oxide Games simply created its benchmark with async compute support, but when it attempted to use it on Maxwell, it was an "unmitigated disaster." During to course of its developer correspondence with NVIDIA to try and fix this issue, it learned that "Maxwell" doesn't really support async compute at the bare-metal level, and that NVIDIA driver bluffs its support to apps. NVIDIA instead started pressuring Oxide to remove parts of its code that use async compute altogether, it alleges."Personally, I think one could just as easily make the claim that we were biased toward NVIDIA as the only "vendor" specific-code is for NVIDIA where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that NVIDIA does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports," writes Oxide, in a statement disputing NVIDIA's "misinformation" about the "Ashes of Singularity" benchmark in its press communications (presumably to VGA reviewers).
Given its growing market-share, NVIDIA could use similar tactics to keep game developers away from industry-standard API features that it doesn't support, and which rival AMD does. NVIDIA drivers tell Windows that its GPUs support DirectX 12 feature-level 12_1. We wonder how much of that support is faked at the driver-level, like async compute. The company is already drawing flack for using borderline anti-competitive practices with GameWorks, which effectively creates a walled garden of visual effects that only users of NVIDIA hardware can experience for the same $59 everyone spends on a particular game.
196 Comments on Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12
So while you decry my hostility - which was in fact a simple retort of intellectual deficit (aimed at myself as well, having the IQ of a slug) why are you not attacking the tone of the post from the fud that laughs in my face and calls me a fanboy? I'm not turning the other cheek if someone intentionally offends me.
EDIT: where I come from, my post wasn't even a tickle near hostility.
It's inevitably the case that the most significant feature of a new graphics API will require new hardware to go with it and that's what we have here.
It also doesn't surprise me that NVIDIA would pressure a dev to remove problematic code from a DX12 benchmark in order not to be shown up. :shadedshu:
What should really happen is that the benchmark point out what isn't supported when run on pre-Pascal GPUs and pre-Fury ones for AMD) but that's not happening is it? It should then run that part of the benchmark on AMD Fury hardware since it does support it. However, that part of the benchmark is simply not there at all and that's the scandal.
If DX12 with the latest games and full GPU DX12 features (eg Pascal) doesn't have a real wow factor compelling users to upgrade then this becomes a distinct possibility. :ohwell:
And I think it's not surprising if AMD have the upper had on async compute. They just have more "muscles" to do that, especially if the game devs spam the GPU with a lot of graphic and compute task.
As far as I understand, NVIDIA GPUs will still do an async compute, BUT it will be limited to 31 command queue to be effective (a.k.a. not overloading their scheduler) meanwhile AMD can do up to 64 command queue and still be as effective.
NVIDIA = 1 graphic engine, 1 shader engine, with 32 depth command queue (total of 32 queues, 31 maximum usable for graphic/compute mode)
AMD = 1 graphic engine, 8 shader engines (they coined it as an ACE or Asinc Compute Engine), with 8 depth command queue (total of 64 queues)
So if you spam lot of graphics and computes command (on the GPU in a non-sequential way) to an NVIDIA GPU, it will end up overloading its scheduler and then it will do a lot of context switching (from graphic command to compute command and vice versa), this will result in increased latency, hence the increased time for processing. This is what happened in this specific game demo (Ashes of Singularity, AoS), they use our GPU(s) to process the graphic command (to render all of those little space ships thingy) AND also to process the compute command (the AIs for every single space ship thingy), and the more the space ship thingy, the more NVIDIA GPUs will suffer.
And you'll all thinking : "AMD can only win in DX12, async compute rulezz!", well, the fact is we don't know yet. We don't know how most game devs deal with the graphic and compute side of their games, whether they think it would be wise to offload most compute task to our GPUs (so freeing CPU resource a.k.a. removing most of CPU bottleneck) or just let the CPU do the compute tasks (less hassle in coding and especially synchronizing).
Oh and UE4 wrote in their documentation for async compute implementation in their engine : From here : docs.unrealengine.com/latest/INT/Programming/Rendering/ShaderDevelopment/AsyncCompute/index.html
It would be much appreciated if you can provide one for reading purpose, thank you very much.
What you don't know is that @RejZoR has been a long time AMD supporter, and only recently got a 980 out of frustration.
Compliant being the key word. They aren't FULLY SUPPORTED. Careful with your words there man. ;) Indeed. He did it shortly after the "I'll eat my shoes if AMD make R9 390x GCN 1.1" debacle. He was like our AMD posterboy only he's proven he'll go green if it's required to get a good game. I would not target him for assumptions, if I were you.
Async compute is a DirectX feature, not an AotS feature.
And I never said it's "exclusive". All I said is that this very specific game has been developed with cooperation with AMD basically since day one using Mantle. And when that dropped in the water, it's DX12. No one says Project Cars is NVIDIA exclusive, but we all know it has been developed with NVIDIA since day one basically and surprise surprise, it runs on NVIDIA far better than on any AMD card. Wanna calls omeone AMD fanboy for that one? 1 single game doesn't reflect perfrmance in ALL games.
AMD have focused on Mantle to get better hardware level implementation to suit their GCN1.1+ architecture. From this they have set some fire under MS and got DX12 to be closer to the metal. This focus has left Nvidia to keep on top of things at DX11 level.
Following Kepler, Nvidia have focused on efficiency and performance and Maxwell has brought them that in spades with DX11. Nvidia have effectively taken the opposite gamble of AMD. Nvidia has stuck with DX11 focus and AMD has forged on toward DX12.
So far so neutral.
They have both gambled and they will both win and lose. AMD have gambled DX12 adoption will be rapid and that will allow their GCN1.1+ to provide a massive performance increase and quite likely surpass Maxwell architecture designs. Even possibly in best case scenarios with rebranded Hawaii matching top level Maxwell (bravo AMD). Nvidia have likely thought that DX12 implementation will not occur rapidly enough until 2016, therefore they have settled with the Maxwell DX11 performance efficiency. Nvidia for their part have probably 'fiddled' to pretend they have most awesome DX12 support when in reality it;s a driver thing (as AoS apparently shows).
So, if DX12 implementation is slow, Nvidia gamble pays off. If DX12 uptake is rapid and occurs before Pascal, Nvidia lose (and will most definitely cheat with massive developer pressure and incentive). If DX12 comes through in bits and bobs, it will come down to what games you play (as always). However, as a gamer, I'm not upgrading to W10 until MS patches the 'big brother' updating mechanisms I keep reading about.
TL.DR? = Like everyone has been saying - despite AMD's GCN advantage, without a slew of top AAA titles, the hardware is irrelevant. If DX11 games are still being pumped out, GCN wont help. If DX12 comes earlier, AMD win.
The GTX 980 Ti is twice as the Fury X but only when it is under 31 simultaneous command lists.
The GTX 980 Ti performed roughly equal to the Fury X at up to 128 command lists.
This is why we need to wait for more games to be released before we jump to conclusions.
Maxwellv2 is not capable of concurrent async + rendering without incurring context penalties and it's under this context that Oxdie made it's remarks.
AMD's ACE units are designed to run concurrently with rendering without context penalties and includes out-of-order features.
https://forum.beyond3d.com/threads/dx12-performance-thread.57188/page-10
From sebbbi:
The latency doesn't matter if you are using GPU compute (including async) for rendering. You should not copy the results back to CPU or wait for the GPU on CPU side. Discrete GPUs are far away from the CPU. You should not expect to see low latency. Discrete GPUs are not good for tightly interleaved mixed CPU->GPU->CPU work.
To see realistic results, you should benchmark async compute in rendering tasks. For example render a shadow map while you run a tiled lighting compute shader concurrently (for the previous frame). Output the result to display instread of waiting compute to finish on CPU. For result timing, use GPU timestamps, do not use a CPU timer. CPU side timing of GPU results in lots of noise and even false results because of driver related buffering.
---------------------
AMD APU would be king for tightly interleaved mixed CPU->GPU->CPU work e.g. PS4's APU was designed for this kind of work.
PS4 sports the same 8 ACE units as Tonga, Hawaii and Fury. XBO is the baseline DirectX12 GPU and it has two ACE units with 8 queues per unit as per Radeon HD 7790 (GCN 1.1).
The older GCN 1.0 still has two ACE units with 2 queues per unit but it's less capable than GCN 1.1.
GCN 1.0 such as 7970/R9-280X is still better than Fermi and Kelper in concurrent Async+Render category. With Project Cars, AMD's lower DX11 draw call limit is the problem.
Read SMS PC Lead's comment on this issue from
forums.guru3d.com/showpost.php?p=5116716&postcount=901
For our mix of DX11 API calls, the API call consumption rate of the AMD driver is the bottleneck.
In Project Cars the range of draw calls per frame varies from around 5-6000 with everything at low up-to 12-13000 with everything at Ultra. Depending on the single threaded performance of your CPU there will be a limit of the amount of draw calls that can be consumed and as I mentioned above, once that is exceeded GPU usage starts to reduce. On AMD/Windows 10 this threshold is much higher which is why you can run with higher settings without FPS loss.
I also mentioned about 'gaps' in the GPU timeline caused by not being able to feed the GPU fast enough - these gaps are why increasing resolution (like to 4k in the Anandtech analysis) make for a better comparison between GPU vendors... In 4k, the GPU is being given more work to do and either the gaps get filled by the extra work and are smaller.. or the extra work means the GPU is now always running behind the CPU submission rate.
So, on my i7-5960k@3.0ghz the NVIDIA (Titan X) driver can consume around 11,000 draw-calls with our DX11 API call mix - the same Windows 7 System with a 290x and the AMD driver is CPU limited at around 7000 draw-calls : On Windows 10 AMD is somewhere around 8500 draw-calls before the limit is reached (I can't be exact since my Windows 10 box runs on a 3.5ghz 6Core i7)
In Patch 2.5 (next week) I did a pass to reduce small draw-calls when using the Ultra settings, as a concession to help driver thread limitations. It gains around 8% for NVIDIA and about 15% (minimum) for AMD.
...
For Project Cars the 1040 driver is easily the fastest under Windows 10 at the moment - but my focus at the moment is on the fairly large engineering task of implementing DX12 support...
----------------------------------------
Project Cars with DX12 is coming.
Those people that spent top dollar on highend Nvidia cards recently are going to be disappointed over the coming months as new games are released.
I dont understand why some people are still defending Nvidia, just like they did with the 3.5 GB debacle, nvidia has been dishonest here, if the game developer has gone public, Nvidia must have been assholes by trying to force him to disable the function. They wanted to keep the consumer in the dark.
No wonder Nvidia keeps on pulling stuff like this, their fanboys will always defend them, or maybe Nvidia pays a bunch of people to troll the forums and defend them, it wouldnt surprise me. haha