That is a claim presented at the beginning of the article. Through the end, if you read it, it is proven in benchmark that it is
not true (number of queues horizontally and time spent computing vertically - lower is better)
View attachment 67772
Maxwell is faster than GCN up to 32 queues, and it evens out with GCN to 128 queues, where GCN has same speed up to 128 queues.
It's also shown that with async shaders it's extremely important how they are compiled for each architecture.
Good find
@RejZoR
From
https://forum.beyond3d.com/posts/1870374/
For pure compute, AMD's compute latency (green color areas) rivals NVIDIA's compute latency (refer to the attached file).
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1710#post_24368195
Here's what I think they did at Beyond3D:
- They set the amount of threads, per kernel, to 32 (they're CUDA programmers after-all).
- They've bumped the Kernel count to up to 512 (16,384 Threads total).
- They're scratching their heads wondering why the results don't make sense when comparing GCN to Maxwell 2
Here's why that's not how you code for GCN
Why?:
- Each CU can have 40 Kernels in flight (each made up of 64 threads to form a single Wavefront).
- That's 2,560 Threads total PER CU.
- An R9 290x has 44 CUs or the capacity to handle 112,640 Threads total.
If you load up GCN with Kernels made up of 32 Threads you're wasting resources. If you're not pushing GCN you're wasting compute potential. In slide number 4, it stipulates that latency is hidden by executing overlapping wavefronts. This is why GCN appears to have a high degree of latency but you can execute a ton of work on GCN without affected the latency. With Maxwell/2, latency rises up like a staircase with the more work you throw at it. I'm not sure if the folks at Beyond3D are aware of this or not.
Conclusion:
I think they geared this test towards nVIDIAs CUDA architectures and are wondering why their results don't make sense on GCN. If true... DERP! That's why I said the single Latency results don't matter. This test is only good if you're checking on Async functionality.
GCN was built for Parallelism, not serial workloads like nVIDIAs architectures. This is why you don't see GCN taking a hit with 512 Kernels.
What did Oxide do? They built two paths. One with Shaders Optimized for CUDA and the other with Shaders Optimized for GCN. On top of that GCN has Async working. Therefore it is not hard to determine why GCN performs so well in Oxide's engine. It's a better architecture if you push it and code for it. If you're only using light compute work, nVIDIAs architectures will be superior.
This means that the burden is on developers to ensure they're optimizing for both. In the past, this hasn't been the case. Going forward... I hope they do. As for GameWorks titles, don't count them being optimized for GCN. That's a given. Oxide played fair, others... might not.