Monday, August 31st 2015
Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12
It turns out that NVIDIA's "Maxwell" architecture has an Achilles' heel after all, which tilts the scales in favor of competing AMD Graphics CoreNext architecture, in being better prepared for DirectX 12. "Maxwell" lacks support for async compute, one of the three highlight features of Direct3D 12, even as the GeForce driver "exposes" the feature's presence to apps. This came to light when game developer Oxide Games alleged that it was pressured by NVIDIA's marketing department to remove certain features in its "Ashes of the Singularity" DirectX 12 benchmark.
Async Compute is a standardized API-level feature added to Direct3D by Microsoft, which allows an app to better exploit the number-crunching resources of a GPU, by breaking down its graphics rendering tasks. Since NVIDIA driver tells apps that "Maxwell" GPUs supports it, Oxide Games simply created its benchmark with async compute support, but when it attempted to use it on Maxwell, it was an "unmitigated disaster." During to course of its developer correspondence with NVIDIA to try and fix this issue, it learned that "Maxwell" doesn't really support async compute at the bare-metal level, and that NVIDIA driver bluffs its support to apps. NVIDIA instead started pressuring Oxide to remove parts of its code that use async compute altogether, it alleges."Personally, I think one could just as easily make the claim that we were biased toward NVIDIA as the only "vendor" specific-code is for NVIDIA where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that NVIDIA does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports," writes Oxide, in a statement disputing NVIDIA's "misinformation" about the "Ashes of Singularity" benchmark in its press communications (presumably to VGA reviewers).
Given its growing market-share, NVIDIA could use similar tactics to keep game developers away from industry-standard API features that it doesn't support, and which rival AMD does. NVIDIA drivers tell Windows that its GPUs support DirectX 12 feature-level 12_1. We wonder how much of that support is faked at the driver-level, like async compute. The company is already drawing flack for using borderline anti-competitive practices with GameWorks, which effectively creates a walled garden of visual effects that only users of NVIDIA hardware can experience for the same $59 everyone spends on a particular game.
Sources:
DSOGaming, WCCFTech
Async Compute is a standardized API-level feature added to Direct3D by Microsoft, which allows an app to better exploit the number-crunching resources of a GPU, by breaking down its graphics rendering tasks. Since NVIDIA driver tells apps that "Maxwell" GPUs supports it, Oxide Games simply created its benchmark with async compute support, but when it attempted to use it on Maxwell, it was an "unmitigated disaster." During to course of its developer correspondence with NVIDIA to try and fix this issue, it learned that "Maxwell" doesn't really support async compute at the bare-metal level, and that NVIDIA driver bluffs its support to apps. NVIDIA instead started pressuring Oxide to remove parts of its code that use async compute altogether, it alleges."Personally, I think one could just as easily make the claim that we were biased toward NVIDIA as the only "vendor" specific-code is for NVIDIA where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that NVIDIA does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports," writes Oxide, in a statement disputing NVIDIA's "misinformation" about the "Ashes of Singularity" benchmark in its press communications (presumably to VGA reviewers).
Given its growing market-share, NVIDIA could use similar tactics to keep game developers away from industry-standard API features that it doesn't support, and which rival AMD does. NVIDIA drivers tell Windows that its GPUs support DirectX 12 feature-level 12_1. We wonder how much of that support is faked at the driver-level, like async compute. The company is already drawing flack for using borderline anti-competitive practices with GameWorks, which effectively creates a walled garden of visual effects that only users of NVIDIA hardware can experience for the same $59 everyone spends on a particular game.
196 Comments on Lack of Async Compute on Maxwell Makes AMD GCN Better Prepared for DirectX 12
source: ka_rf @ beyond3d forum
their marketing department should have a full schedule on the next days
Interesting that on both architectures, 100 threads appears to be the sweet spot.
Those bastards may even engineer Pascal completely with dx12 in mind.
But seriously, Maxwell architecture seems to handle async task concurrency between themselves just fine (latencies are in accordance with 32 queue depth)... problem is graphics workload being synchronous against async compute workload - if there is no architectural reason to be that way, this could be solved through a driver update. Troubling thing is, if nvidia knew they could fix it in driver update, they'd be faster with their response. Maybe Jen-Hsun Huang is writing a heartwarming letter.
Yeah, AMD technical marketing might not be your best source for info about competitor products. Combine that with a meltdown from a game dev... Then we have some good old fashioned NVIDIA bashing.
wccftech.com/nvidia-async-compute-directx-12-oxide-games/
"We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute.
"
... meanwhile in real nvidia www.guru3d.com/news-story/nvidia-will-fully-implement-async-compute-via-driver-support.html
Until 10 DX12 games show on market Pascal will be old more than year.
Because of that I don't see reason for panic, if some card support DX12 that's not same as capable to offer playable fps.
I remember when 5870 with 2GB show up I bought immediately as first card with DX11 support.
Card was excellent but for DX9 and few DX10 environment, but first playable fps and much better with tessellation and DX11 was GTX580.
I changed and ATI5870 and ATI6970 but only with GTX580 situation become really better and with Tahiti later. Period between ATI 5870 and AMD 7970 AMD didn't improve nothing on DX11 field and people who wait and upgrade on GTX580 played much better, until HD7950/HD7970.
Because of that no reason to panic, NVIDIA will be ready when time come...
Only one other thing is bad and that's tendency to NVIDIA write driver only for last architecture.
If they continue to do that people will turn them back. At least middle segment.
That's much bigger reason for worry than Maxwell and DX12. We will not play nice DX12 games at least 2 years.
Maybe some very rich people with multi GPU. But I talk for people who play games as on beginning with single powerful graphic.
For pure compute, AMD's compute latency (green color areas) rivals NVIDIA's compute latency (refer to the attached file).
www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/1710#post_24368195
Here's what I think they did at Beyond3D:
- They set the amount of threads, per kernel, to 32 (they're CUDA programmers after-all).
- They've bumped the Kernel count to up to 512 (16,384 Threads total).
- They're scratching their heads wondering why the results don't make sense when comparing GCN to Maxwell 2
Here's why that's not how you code for GCNWhy?:
- Each CU can have 40 Kernels in flight (each made up of 64 threads to form a single Wavefront).
- That's 2,560 Threads total PER CU.
- An R9 290x has 44 CUs or the capacity to handle 112,640 Threads total.
If you load up GCN with Kernels made up of 32 Threads you're wasting resources. If you're not pushing GCN you're wasting compute potential. In slide number 4, it stipulates that latency is hidden by executing overlapping wavefronts. This is why GCN appears to have a high degree of latency but you can execute a ton of work on GCN without affected the latency. With Maxwell/2, latency rises up like a staircase with the more work you throw at it. I'm not sure if the folks at Beyond3D are aware of this or not.Conclusion:
I think they geared this test towards nVIDIAs CUDA architectures and are wondering why their results don't make sense on GCN. If true... DERP! That's why I said the single Latency results don't matter. This test is only good if you're checking on Async functionality.
GCN was built for Parallelism, not serial workloads like nVIDIAs architectures. This is why you don't see GCN taking a hit with 512 Kernels.
What did Oxide do? They built two paths. One with Shaders Optimized for CUDA and the other with Shaders Optimized for GCN. On top of that GCN has Async working. Therefore it is not hard to determine why GCN performs so well in Oxide's engine. It's a better architecture if you push it and code for it. If you're only using light compute work, nVIDIAs architectures will be superior.
This means that the burden is on developers to ensure they're optimizing for both. In the past, this hasn't been the case. Going forward... I hope they do. As for GameWorks titles, don't count them being optimized for GCN. That's a given. Oxide played fair, others... might not.
GCN has a constant latency, good enough for compute loads made of small number of async tasks and great for huge number of async tasks. Additionaly GCN mixes compute async load and graphics load in near perfect parallelism.
Maxwell shows varying latency that is extremely low for small number of async tasks and surpasses GCN over 128 async tasks. What's really bad is that in current drivers async compute load and graphics load are done serially.
Mind you, every single async compute task is parallel in itself and can occupy 100% of GPU if the job is suitable (parallelizable), so in most cases penalty boils down in how many times and how much context switching is done. Maxwell has nice cache hierarchy to help with that.
GCN should destroy Maxwell in special cases where huge number of async tasks depend on results calculated by huge number of other async tasks that are greatly varying in computational complexity ;)
To begin with, Cam McRae (Technical Director for the Windows 10 PC version) explained how they’re going to use DirectX 12 and even Async Compute in Gears of War Ultimate.