Thursday, July 14th 2016

Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

Jul 14th, 2016 22:16 Discuss (91 Comments)

Futuremark released the latest addition to the 3DMark benchmark suite, the new "Time Spy" benchmark and stress-test. All existing 3DMark Basic and Advanced users have limited access to "Time Spy," existing 3DMark Advanced users have the option of unlocking the full feature-set of "Time Spy" with an upgrade key that's priced at US $9.99. The price of 3DMark Advanced for new users has been revised from its existing $24.99 to $29.99, as new 3DMark Advanced purchases include the fully-unlocked "Time Spy." Futuremark announced limited-period offers that last up till 23rd July, in which the "Time Spy" upgrade key for existing 3DMark Advanced users can be had for $4.99, and the 3DMark Advanced Edition (minus "Time Spy") for $9.99.

Futuremark 3DMark "Time Spy" has been developed with inputs from AMD, NVIDIA, Intel, and Microsoft, and takes advantage of the new DirectX 12 API. For this reason, the test requires Windows 10. The test almost exponentially increases the 3D processing load over "Fire Strike," by leveraging the low-overhead API features of DirectX 12, to present a graphically intense 3D test-scene that can make any gaming/enthusiast PC of today break a sweat. It can also make use of several beyond-4K display resolutions.

DOWNLOAD: 3DMark with TimeSpy v2.1.2852

Add your own comment

91 Comments on Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

#26

ShurikN

HoodA middlin' result for my 2-3 year old system (i7-4790K/GTX 780 Ti)
www.3dmark.com/spy/38286
I need a GPU upgrade - seriously considering 980 Ti, prices are around $400 even for hybrid water-cooled models. And they're available right now, unlike the new cards. If I wait a few months, they'll get even lower...

I would rather go for an air cooled 1070 than any kind of 980ti. If I had to choose between 2 NV cards

#27

Aquinus

Resident Wat-man

efikkanIt's correct that AMD's architecture is wastly different (in terms of queues and scheduling) compared to Nvidia's. But the reason why AMD may draw larger benefits from async shaders is because their scheduler is unable to saturate their huge count of cores. If you compare GTX 980 Ti to Fury X we are talking about:
GTX 980 Ti: 2816 cores, 5632 GFlop/s
Fury X: 4096 cores, 8602 GFlop/s
(The relation is similar with other comparable products with AMD vs. Nvidia)
Nvidia is getting the same performance from far fewer resources using a way more advanced scheduler. In many cases their scheduler has more than 95% computational utilization, and since the primary purpose of async shaders is to utilize idle resources for different tasks, there is really very little to use for compute (which mostly utilizes the same resources as rendering). Multiple queues is not overhead free either, so in order for it to have any purpose there have to be a significant performance gain. This is basically why AofT gave up on Nvidia hardware and just disabled the feature, and their game was fine tuned for AMD in the first place.

It has very little to do with Direct3D 11 vs 12.

This benchmark proves Nvidia can utilize async shaders, ending the lie about lacking hardware features once and for all.

AMD is drawing larger benefits because they have more idle resources. Remember e.g. Fury X has 53% more Flop/s than 980 Ti, so there is a lot to use.

This benchmark also ends the myth that Nvidia is less fit for Direct3D 12.

Not that I disagree but, a big difference between nVidia and AMD that isn't mentioned often is the size of each CU/SM in terms of shader count and count themselves as a result. nVidia's SMs have a lot more shaders each whereas AMD tends to have more CUs with fewer shaders each. It's the same thing they did with their CPUs, they sacrificed some serial throughput to gain more parallel throughput. On top of that, nVidia's GPUs are clocked higher so, if parallel throughput isn't the rendering bottleneck, it comes down to how quick each CU/SM is with any given workload, which will favor nVidia thanks to the beefier SMs and higher clocks.

#28

JATownes

The Lurker

"In the case of async compute, Futuremark is using it to overlap rendering passes, though they do note that 'the asynchronous compute workload per frame varies between 10-20%.' " Source: www.anandtech.com/show/10486/futuremark-releases-3dmark-time-spy-directx12-benchmark

It seems no one has noticed this. AMD cards are not shining like they did in the Vulkan Doom patch, because TimeSpy has very limited use of Async workloads. Nvidia cards show less gain than the AMD cards, and that is with very limited usage. Take the use of Async workloads up to 60-70% per frame and the AMD cards would have dramatic increases, just like in the Vulkan and AotS demos.

Correct me if I am misinterpreting the quote, but in my opinion it appears to me this is why AMD cards are not showing the same dramatic increase we are seeing elsewhere using Async.

JAT

#29

LezNato

www.3dmark.com/3dm/13263332

RX480 got potential

#30

Fluffmeister

JATownes"In the case of async compute, Futuremark is using it to overlap rendering passes, though they do note that 'the asynchronous compute workload per frame varies between 10-20%.' " Source: www.anandtech.com/show/10486/futuremark-releases-3dmark-time-spy-directx12-benchmark

It seems no one has noticed this. AMD cards are not shining like they did in the Vulkan Doom patch, because TimeSpy has very limited use of Async workloads. Nvidia cards show less gain than the AMD cards, and that is with very limited usage. Take the use of Async workloads up to 60-70% per frame and the AMD cards would have dramatic increases, just like in the Vulkan and AotS demos.

Correct me if I am misinterpreting the quote, but in my opinion it appears to me this is why AMD cards are not showing the same dramatic increase we are seeing elsewhere using Async.

JAT

No I say it's in fact the opposite, async compute is used heavily in Time Spy, it's worth reading the nicely detailed technical guide:

s3.amazonaws.com/download-aws.futuremark.com/3DMark_Technical_Guide.pdf

It's also interesting is it uses FL 11_0 for maximum compatibility.

#31

JATownes

The Lurker

FluffmeisterNo I say it's in fact the opposite, async compute is used heavily in Time Spy, it's worth reading the nicely detailed technical guide:

s3.amazonaws.com/download-aws.futuremark.com/3DMark_Technical_Guide.pdf

It's also interesting is it uses FL 11_0 for maximum compatibility.

I have read the tech guide, but still do not understand how this is considered "heavy usage". How can a 10-20% async workload be considered "heavy use"?

Please note I am not being argumentative, and will happily conceded if it is being "heavily used", but I would like someone to explain how 10-20% workload is considered "heavy". I would assume, like most things, even to be considered "regular" usage would be around 50%.

JAT

#32

Fluffmeister

JATownesI have read the tech guide, but still do not understand how this is considered "heavy usage". How can a 10-20% async workload be considered "heavy use"?

Please note I am not being argumentative, and will happily conceded if it is being "heavily used", but I would like someone to explain how 10-20% workload is considered "heavy". I would assume, like most things, even to be considered "regular" usage would be around 50%.

JAT

Well as it shows a large part of the scene illumination and in turn things like the ambient occlusion are all down asynchronously:

For example:

Before the main illumination passes, asynchronous compute shaders are used to cull lights, evaluate illumination from prebaked environment reflections, compute screen-space ambient occlusion, and calculate unshadowed surface illumination. These tasks are started right after G-buffer rendering has finished and are executed alongside shadow rendering.

And other stuff like particles:

Particles are simulated on the GPU using asynchronous compute queue. Simulation work is submitted to the asynchronous queue while G-buffer and shadow map rendering commands are submitted to the main command queue.

Asynchronous compute is therefore fundamental to how the scene is generated and in turn rendered.

The workload is then clearly very high as shown here:

So yeah, basically it's pretty fundamental to the test.

#33

JATownes

The Lurker

FluffmeisterWell as it shows a large part of the scene illumination and in turn things like the ambient occlusion are all down asynchronously:

For example:

And other stuff like particles:

Asynchronous compute is therefore fundamental to how the scene is generated and in turn rendered.

The workload is then clearly very high as shown here:

So yeah, basically it's pretty fundamental to the test.

I understand how async works, and what it is being used for. I concede it us being USED, but it appears it is being very UNDER utilized.

Please explain how a 10-20% async workload is "heavy use". That's seems like a really low workload usage, statistically.

#34

Fluffmeister

JATownesI understand how async works, and what it is being used for. I concede it us being USED, but it appears it is being very UNDER utilized.

Please explain how a 10-20% async workload is "heavy use". That's seems like a really low workload usage, statistically.

The crossover between async and all the other tasks the GPU deals with is 10-20% per frame, the GPU has other things to deal with you know. ;)

I get that async is the new buzzword word that people cling too, but why do you think it should be 60%+? Clearly workloads can vary from app to app, but what specific compute tasks do you think this benchmark doesn't address?

Are you suggesting this test isn't stressful for modern GPU's?

#35

JATownes

The Lurker

FluffmeisterThe crossover between async and all the other tasks the GPU deals with is 10-20% per frame, the GPU has other things to deal with you know. ;)

I get that async is the new buzzword word that people cling too, but why do you think it should be 60%+? Clearly workloads can vary from app to app, but what specific compute tasks do you think this benchmark doesn't address?

Are you suggesting this test isn't stressful for modern GPU's?

I understand that the GPU is busy with other tasks as well, however even Anandtech implies the low usage: "In the case of async compute, Futuremark is using it to overlap rendering passes, though they do note that 'the asynchronous compute workload per frame varies between 10-20%."

FluffmeisterParticles are simulated on the GPU using asynchronous compute queue. Simulation work is submitted to the asynchronous queue while G-buffer and shadow map rendering commands are submitted to the main command queue.

Doesn't this state that it is not being fully utilized? Gbuffer and shadow map rendering commands are able to be asynchronously executed, but are being submitted to the main command queue, and are not being done asynchronously...why?

I am not remotely suggesting that it is not stressful on modern GPUs, but are you saying that 80-90% of all the Compute Units of the GPU are being used 100% of the time during the benchmark, leaving only 10-20% for compute and copy commands over what is being used for 3D rendering commands? I do not believe that is accurate. It simply appears that async commands to the compute units are being under utilized and being limited to particular instructions.

Like I said, maybe I am misinterpreting, but I haven't seen anything showing the contrary. I'm just hoping someone with more knowledge than me can explain it to me.

#36

Fluffmeister

I get where you're coming from, but where is the evidence it's too low and relative to what exactly?

Like i said workloads can vary drastically on an app by app basis, there doesn't have to be a right or wrong way, what matters is there is another baseline to compare, Futuremark after all claim most of the major parties had input into it's development, and I'd put more credence on them than some random game dev known to have pimped one brand or another in the past.

#37

JATownes

The Lurker

FluffmeisterI get where you're coming from, but where is the evidence it's too low and relative to what exactly?

Like i said workloads can vary drastically on an app by app basis, there doesn't have to be a right or wrong way, what matters is there is another baseline to compare, Futuremark after all claim most of the major parties had input into it's development, and I'd put more credence on them than some random game dev known to have pimped one brand or another in the past.

Hahaha. Truth! I'm stoked about the new bench, (scored 6551 woot!), I'm just looking for clarity on how this whole async thing works. Thanks for a little education.

#38

evernessince

FluffmeisterI get where you're coming from, but where is the evidence it's too low and relative to what exactly?

Like i said workloads can vary drastically on an app by app basis, there doesn't have to be a right or wrong way, what matters is there is another baseline to compare, Futuremark after all claim most of the major parties had input into it's development, and I'd put more credence on them than some random game dev known to have pimped one brand or another in the past.

I trust Futuremark's claims about as much as the project cars devs. They have been involved in benchmark fixing in the past using Intel Compilers.

Their new benchmark doesn't use Async compute in many scenarios where it should be universally usable in any game. My guess is the "input" they received from Nvidia was to do as little with Async as possible as Nvidia cards only support Async through drivers.

We know that proper use of Async yeilds a large advantage for AMD cards. Every game that has utilized it correctly has shown so.

#39

Fluffmeister

evernessinceI trust Futuremark's claims about as much as the project cars devs. They have been involved in benchmark fixing in the past using Intel Compilers.

Their new benchmark doesn't use Async compute in many scenarios where it should be universally usable in any game. My guess is the "input" they received from Nvidia was to do as little with Async as possible as Nvidia cards only support Async through drivers.

We know that proper use of Async yeilds a large advantage for AMD cards. Every game that has utilized it correctly has shown so.

Cool story bro.

#40

Prima.Vera

HoodA middlin' result for my 2-3 year old system (i7-4790K/GTX 780 Ti)
www.3dmark.com/spy/38286
I need a GPU upgrade - seriously considering 980 Ti, prices are around $400 even for hybrid water-cooled models. And they're available right now, unlike the new cards. If I wait a few months, they'll get even lower...

The demo was a slide show for me on 3440x1440, but the score same as yours, 3576. I dont think is that bad...

#41

JATownes

The Lurker

Prima.VeraThe demo was a slide show for me on 3440x1440, but the score same as yours, 3576. I dont think is that bad...

Demo was very choppy for me at 3440x1440 as well, but scores were good.

#42

efikkan

ShurikNI think NV uses pre-emptive through drivers, AMD uses it through hardware.
As Rejzor stated, NV is doing it with brute force. As long as they can ofc its fine. Anand should have shown Maxwell with async on/off for comparison.

Async shaders is a feature of CUDA, and is now also used by a Direct3D 12 benchmark, proving beyond any doubt that it's supported in hardware. Async shaders has been supported since Kepler with very limited support, greatly improved on Maxwell and refined in Pascal. It's a core feature of the architectures, anyone who has read the white papers would know that.

#43

Paganstomp

#44

Pewzor

AssimilatorThis benchmark uses async and GTX 1080, GTX 1070 and GTX Titan X outperform everything from the red camp according to Guru3D.

Tweaktown and few other sources all have fury x beating Titan X.
www.tweaktown.com/articles/7785/3dmark-time-spy-dx12-benchmarking-masses/index3.html

john_Maybe the first time we see Nvidia cards gaining something from Async. Futuremark will have to give a few explanations to the world, if in a year from now, their benchmark is the only thing that shows gains in Pascal cards.
On the other hand this is good news. If that dynamic load balancing that Nvidia cooked there, works, It means that developers will have NO excuse to not use async in their games, which will mean at least a 5-10% better performance in ALL future titles.

Nvidia gained a sizable performance in Doom's Vulkan api benches as well particularly for 1070 and 1080 pascal cards. That's even without nvidia's software version of "async compute".

It's just that maxwell don't really do async compute in any meaningful fashion. (which nVidia said they could improve with a driver update, about 4 months ago).

Also shows how much more superior Vulkan is compare to DX12. Too bad I doubt 3dMark would make a Vulkan api version of time spy.

#45

ShurikN

efikkanAsync shaders is a feature of CUDA, and is now also used by a Direct3D 12 benchmark, proving beyond any doubt that it's supported in hardware. Async shaders has been supported since Kepler with very limited support, greatly improved on Maxwell and refined in Pascal. It's a core feature of the architectures, anyone who has read the white papers would know that.

Kepler... Maxwell??
Maxwell cards gain 0.1% performance increase with async on.
Core feature... Yeah right, dont make me laugh

#46

efikkan

ShurikNKepler... Maxwell??
Maxwell cards gain 0.1% performance increase with async on.
Core feature... Yeah right, dont make me laugh

Since some of you still don't understand the basics, I'm saying this once again:
- The primary purpose of async shaders is to utilize different resources for different purposes simultaneously.
- Rendering and compute does primarily utilize the exact same resources, so an already saturated GPU will only show minor gains.
- The fact that Radeon 200/300/RX400 series shows gains from utilizing the same resources for different tasks is proof that their GPUs are underutilized (which is confirmed by their low performance per GFlop). So it's a problem of their own making, which they have found a way for the game developers to "partially solve". It's a testament to their own inferior architecture, not to Nvidia's "lack of features".

All of this should be obvious. But when you guys can't even be bothered to get the basic understanding of the GPU architectures before you fill the forums with this trash, you have clearly prove yourself unqualified for a technical discussion.

#47

ShurikN

Digital Foundry: Will we see async compute in the PC version via Vulkan?

Billy Khan: Yes, async compute will be extensively used on the PC Vulkan version running on AMD hardware. Vulkan allows us to finally code much more to the ;metal'. The thick driver layer is eliminated with Vulkan, which will give significant performance improvements that were not achievable on OpenGL or DX.

www.eurogamer.net/articles/digitalfoundry-2016-doom-tech-interview

#48

JATownes

The Lurker

efikkanSince some of you still don't understand the basics, I'm saying this once again:
- The primary purpose of async shaders is to utilize different resources for different purposes simultaneously.
- Rendering and compute does primarily utilize the exact same resources, so an already saturated GPU will only show minor gains.
- The fact that Radeon 200/300/RX400 series shows gains from utilizing the same resources for different tasks is proof that their GPUs are underutilized (which is confirmed by their low performance per GFlop). So it's a problem of their own making, which they have found a way for the game developers to "partially solve". It's a testament to their own inferior architecture, not to Nvidia's "lack of features".

All of this should be obvious. But when you guys can't even be bothered to get the basic understanding of the GPU architectures before you fill the forums with this trash, you have clearly prove yourself unqualified for a technical discussion.

I agree with everything you stated, but draw a different conclusion, and here is why:

- The primary purpose of async shaders is to be able to accept varied instructions from the scheduler for different purposes simultaneously.
- Rendering and compute does primarily utilize the exact same resources, so an already saturated scheduler and pipeline will only show minor gains.
- The fact that Radeon 200/300/RX400 series shows gains from utilizing the same resources for different tasks is proof that their GPU scheduler is able to send more instructions to different shaders than the competition, allowing them to work at full capacity (which is confirmed by their higher performance when using a more efficient API and efficiently coded engine). So it's a solution of their own making, which they have found a way for the game developers to fully utilize. It's a testament to their own architecture that multiple generations are getting substantial gains when the market utilized the given resources correctly.

Now that all of the consoles will be using Compute Units with a scheduler that can make full use of the shaders, I have a feeling most game will start being written to fully utilize them, and NV's arch will have to be reworked to include a larger path for the scheduler. I explained it to my son like this: Imagine a grocery store with a line of people (instructions) waiting to check out, but there is only one cashier (scheduler)...what async does is opens other lanes with more cashiers so that more lines of people can get out of the store faster to their car (shaders). AMD's Aync Compute Engine opens LOTS of lanes, while the NV scheduler opens a few to handle certain lines of people (like the express lane in this analogy).

It appears TimeSpy has limited use of Async, as only certain instructions are being routed through the async scheduler, while most a being routed through the main schedule. 10-20% async workload is not fully utilizing the scheduler of AMDs cards, even 4 generations back.

My 2 Cents.

JAT

#49

efikkan

JATownesThe fact that Radeon 200/300/RX400 series shows gains from utilizing the same resources for different tasks is proof that their GPU scheduler is able to send more instructions to different shaders than the competition,

No, it proves that the GPU was unable to saturate those CUs with a single task.
If parallelizing two tasks requiring the same resources yields a performance increase, then some resources had to be idling in the first place. Any alternative would be impossible.

JATownesIt's a testament to their own architecture that multiple generations are getting substantial gains when the market utilized the given resources correctly.

When they need a bigger 8602 GFlop/s GPU to match a 5632 GFlop/s GPU it's clearly an inefficient archiecture. If AMD scaled as well as Nvidia Fury X would outperform GTX 980 Ti by ~53% and AMD would kick Nvidia's ass.

JATownesNow that all of the consoles will be using Compute Units with a scheduler that can make full use of the shaders, I have a feeling most game will start being written to fully utilize them, and NV's arch will have to be reworked to include a larger path for the scheduler.

Even with the help of async shaders AMD are still not able to beat Nvidia with or without them. When facing an architecture with is ~50% more efficient it's not going to be enough.

#50

JATownes

The Lurker

efikkanNo, it proves that the Scheduler was unable to saturate those CUs with a single task.
If parallelizing two tasks requiring the same resources yields a performance increase, then some resources had to be idling in the first place, because they were unable to get instructions from the Scheduler. Any alternative would be impossible.

The difference is in the way tasks are handed out, and the whole point is to get more instructions to idle shaders. But they are two dramatically different approaches. NVidia is best using limited async with instructions running in a mostly serial nature.

So that is the way nVidia approaches multiple workloads. They have very high granularity in when they are able to switch between workloads. This approach bears similarities to time-slicing, and perhaps also SMT, as in being able to switch between contexts down to the instruction-level. This should lend itself very well for low-latency type scenarios, with a mostly serial nature. Scheduling can be done just-in-time.

AMD on the other hand seems to approach it more like a ‘multi-core’ system, where you have multiple ‘asynchronous compute engines’ or ACEs (up to 8 currently), which each processes its own queues of work. This is nice for inherently parallel/concurrent workloads, but is less flexible in terms of scheduling. It’s more of a fire-and-forget approach: once you drop your workload into the queue of a given ACE, it will be executed by that ACE, regardless of what the others are doing. So scheduling seems to be more ahead-of-time (at the high level, the ACEs take care of interleaving the code at the lower level, much like how out-of-order execution works on a conventional CPU).

And until we have a decent collection of software making use of this feature, it’s very difficult to say which approach will be best suited for the real-world. And even then, the situation may arise, where there are two equally valid workloads in widespread use, where one workload favours one architecture, and the other workload favours the other, so there is not a single answer to what the best architecture will be in practice.

Source: scalibq.wordpress.com/

This is why NVidia cards shine so well, APIs today send out instructions in a mostly serial nature, wherein preemption works relatively well...however the new APIs are able to be used with inherently parallel workloads, which causes AMD cards to shine.

Please bear in mind I am not bashing either approach, NV cards are pure muscle, and I love it! but that also comes with a price. AMDs approach to bring that kind of power without needing the brute force approach is good for everyone, and is more cost effective when utilized correctly.

Add your own comment

Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

91 Comments on Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

Related News

91 Comments on Futuremark Releases 3DMark Time Spy DirectX 12 Benchmark

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts