Wednesday, October 19th 2016
Closer to the Metal: Shader Intrinsic Functions
Shader intrinsic functions stand as a partial solution for granting developers more control over existing computational resources and how they are leveraged. This capability (much touted by AMD as a performance-enhancing feature on their GCN-based products) essentially exposes features and capabilities that exist on the hardware developers are programming for, but wouldn't generally be able to access. This can happen either because they're being abstracted by a high-level API (Application Programming Interface, like DX11), or because the API isn't functionally able to access them. To understand why high-level APIs such as DX11 don't usually offer support for a piece of hardware's full feature list, or full processing capabilities, we must first look at the basic architecture of a given computer system.As you can see, there are usually multiple layers a given task must go through in order for it to be processed at a hardware level. You might be wondering why do we even need so many layers in the first place, and why wasn't this enabled before. There are many technical reasons for this, but one of the strongest is simply the breadth of different hardware available for your buying and assembling pleasure. Unlike the console ecosystem, where hardware is fixed and, as a result, predictable in its performance metrics and command execution, the PC ecosystem is fractured in countless hardware combinations. You may have an AMD, CMT-enabled (Clustered Multi-Threading) FX-8350, an SMT-enabled (Simultaneous Multi-Threading) i7 6700K or anything in between, paired with a GCN RX 480 or a Pascal GTX 1070… And all that hardware has particularities in regards as to how it processes the same task, and the type of commands you need to input in order to get a given result. So, DX11, DX12 and Vulkan serve as what we call an abstraction layer.Abstraction layers essentially simplify the programmer's work - they "hide" and automate a given command's underlying processes, particular implementation and hardware-specific code paths, so that the programmer only has to worry about what commands he wants to use - and voila. The high-level API converts a given command (let's imagine, for simplicity's sake, "draw frame") into its equivalent, non-abstracted, hardware code, and runs it with a good-enough optimization on most hardware to deliver those awesome (insert your favorite game here) frames. To elaborate a little: imagine you have a command called "Stack". On a high-level API like DX11, this command will be interpreted and values for its inner workings wil be automatically given, based on general hardware compatibility: how many levels to stack, when to stack them, and when to stop the operation. But since these aren't optimized, your hardware will use somewhat of a brute-force approach. With a low-level API, developers can now set the exact values for the "Stack" command's inner workings, optimized for your hardware, so it never goes out of budget, and none of those sexy stream processors are left idle.
The problem with the former, high-level approach, of course, is that generalizations and simplifications aren't as efficient as running an optimized, hardware-specific code path, and may sometimes even deny access to hardware features over lack of support from the high-level API. The thing with DX12 and Vulkan's low-level capabilities is that with them, in specific scenarios, developers can mostly ignore abstraction layers (some compiler checks are still used to make sure the code is within expected parameters). This allows them to code so as to take advantage of hardware-specific features, sometimes accelerating workloads by up to 2x compared to the high-level approach. This is the basic principle of low-level API's: and something that is enabled, at least partially, by shader intrinsic functions.
Going back to the different layers on a system, imagine, for argument's sake, that it takes 5 ms for a task to compute and go through each of the layers until it is executed by the hardware - in the example image given above, that would mean 5x 5ms = 25 ms. And now imagine you can effectively avoid going through all those figurative hoops, going straight from the app's hardware processing requirements to the hardware. You now have reduced your 25 ms computation to a mere 10 ms, which frees up computation time for other tasks. This is what shader intrinsic functions really are: pieces of code that when recognized by the low-level API, are allowed to move directly towards the hardware, bypassing other, time-consuming layers.
The problem with this approach must seem obvious to you by now: while abstraction layers do add overhead to any given computing task, they do so while simplifying, sometimes by orders of magnitude, the coding process. Closer to the metal programming has in its greatest strength what also amounts to its greatest flaw: the ability to directly leverage hardware resources needs specific, time-consuming programming for the functions that were largely automatic before. This not only means more developer resources, but also a system that is more prone to errors: and debugging five lines of code is very different from debugging fifty lines of it. One must also keep in mind that closer to the metal programming, on behalf of it targeting more specifically only a subset of existing hardware, ends up leaving behind users of older, unsupported hardware.AMD's specific application of shader intrinsic functions in low-level graphics APIs such as Vulkan and DX12 stem from AMD's grasp on the console market (with their CPUs and GPUs powering all three current-generation games consoles), as well as their previous work on Mantle, which went on to become embedded in today's Vulkan library, and arguably gave Microsoft the push it needed to include low-level access to their DX12. This means that programmers are already leveraging optimized, feature-specific code paths in their console game implementations, which in turn, leads to AMD wanting to give them access to those same features on the PC hardware that supports it, reaping the benefits of hardware-specific optimizations for their GCN architecture. That said, this doesn't mean NVIDIA doesn't have their own shader intrinsic functions that developers can take advantage of: through their GameWorks initiative, NVIDIA allows programmers to add extensions not natively supported by DX's HLSL (High Level Shading Language), while also allowing shader intrinsic functions to be leveraged as part of their CUDA ecosystem. An important distinction between the two companies' approach is that while NVIDIA requires developers to use their specific GamesWorks libraries (which are proprietary, and not accessible on AMD's cards), AMD's approach is more open, being accessible in open standards such as GPUOpen and Vulkan's libraries.
Shader intrinsics are just a part of what a low-level API needs to be, and aren't particularly game-changing in and of themselves. That said, shader intrinsics will never be at their best on PC hardware, simply because of how the ecosystem is fractured by the amount of possible, updated or not-so-up-to-date systems. The best part of PC gaming is also, in this case and at this point in time, its greatest drawback towards obtaining perfect performance from any given system. But shader intrinsics are indeed a step forward towards giving developers more control over the features they implement and how they are run, and stand side by side with other technologies which will, in time, steer us towards ever more performant systems.
The problem with the former, high-level approach, of course, is that generalizations and simplifications aren't as efficient as running an optimized, hardware-specific code path, and may sometimes even deny access to hardware features over lack of support from the high-level API. The thing with DX12 and Vulkan's low-level capabilities is that with them, in specific scenarios, developers can mostly ignore abstraction layers (some compiler checks are still used to make sure the code is within expected parameters). This allows them to code so as to take advantage of hardware-specific features, sometimes accelerating workloads by up to 2x compared to the high-level approach. This is the basic principle of low-level API's: and something that is enabled, at least partially, by shader intrinsic functions.
Going back to the different layers on a system, imagine, for argument's sake, that it takes 5 ms for a task to compute and go through each of the layers until it is executed by the hardware - in the example image given above, that would mean 5x 5ms = 25 ms. And now imagine you can effectively avoid going through all those figurative hoops, going straight from the app's hardware processing requirements to the hardware. You now have reduced your 25 ms computation to a mere 10 ms, which frees up computation time for other tasks. This is what shader intrinsic functions really are: pieces of code that when recognized by the low-level API, are allowed to move directly towards the hardware, bypassing other, time-consuming layers.
The problem with this approach must seem obvious to you by now: while abstraction layers do add overhead to any given computing task, they do so while simplifying, sometimes by orders of magnitude, the coding process. Closer to the metal programming has in its greatest strength what also amounts to its greatest flaw: the ability to directly leverage hardware resources needs specific, time-consuming programming for the functions that were largely automatic before. This not only means more developer resources, but also a system that is more prone to errors: and debugging five lines of code is very different from debugging fifty lines of it. One must also keep in mind that closer to the metal programming, on behalf of it targeting more specifically only a subset of existing hardware, ends up leaving behind users of older, unsupported hardware.AMD's specific application of shader intrinsic functions in low-level graphics APIs such as Vulkan and DX12 stem from AMD's grasp on the console market (with their CPUs and GPUs powering all three current-generation games consoles), as well as their previous work on Mantle, which went on to become embedded in today's Vulkan library, and arguably gave Microsoft the push it needed to include low-level access to their DX12. This means that programmers are already leveraging optimized, feature-specific code paths in their console game implementations, which in turn, leads to AMD wanting to give them access to those same features on the PC hardware that supports it, reaping the benefits of hardware-specific optimizations for their GCN architecture. That said, this doesn't mean NVIDIA doesn't have their own shader intrinsic functions that developers can take advantage of: through their GameWorks initiative, NVIDIA allows programmers to add extensions not natively supported by DX's HLSL (High Level Shading Language), while also allowing shader intrinsic functions to be leveraged as part of their CUDA ecosystem. An important distinction between the two companies' approach is that while NVIDIA requires developers to use their specific GamesWorks libraries (which are proprietary, and not accessible on AMD's cards), AMD's approach is more open, being accessible in open standards such as GPUOpen and Vulkan's libraries.
Shader intrinsics are just a part of what a low-level API needs to be, and aren't particularly game-changing in and of themselves. That said, shader intrinsics will never be at their best on PC hardware, simply because of how the ecosystem is fractured by the amount of possible, updated or not-so-up-to-date systems. The best part of PC gaming is also, in this case and at this point in time, its greatest drawback towards obtaining perfect performance from any given system. But shader intrinsics are indeed a step forward towards giving developers more control over the features they implement and how they are run, and stand side by side with other technologies which will, in time, steer us towards ever more performant systems.
26 Comments on Closer to the Metal: Shader Intrinsic Functions
Though yes, the term 'lord' is in the name.
It worth pointing out the difference in open architecture that AMD is offering VS the walled garden of Nvidia. One has to wonder how long that garden will be closed as the move towards a common architecture between both companies and DX12 will reveal code path to developers and many others as it becomes an increasingly thin shim between the OS and hardware.
If you guys continue with these kind of articels, it would be good imo to create a seperate section for them with a direct link somewhere in the top bar between Home-Reviews-Forum. Would be a shame for all the effort that goes into such a piece, to just get forgotten between news articels.
That is indeed a relevant distinction. I'll try and sprinkle it on the piece :toast:
Normally, they would ignore it and say it's useless and no one will use it. After a few years, they would be drug kicking and screaming into compliance. I think they know it will be the API with its design to be universal and big names behind it.
Just my 50 cents...
I dont see why we cant benefit and eat the cake too. Nvidia, IMO, is holding us back now.
- APIs such as Direct3D, OpenGL, Vulkan, and vendor specific ones like Mantle and Cg, all provide a set of API calls which serves as an interface between the game and the driver.
- Shader programs are pieces of code executing in the GPU cores. Shader programs are usually written in a high level language like HLSL or GLSL, converted to an IR for distribution and compiled to machine code by the driver. Shader intrinsic functions is all about writing hardware specific shader programs directly in assembly, potentially creating more optimal code. It has nothing to do with potential abstractions in the API calls on the CPU side. This illustration does not match rendering at all, this is 100% wrong.
Rendering is done by using a number of API calls to build and control what we call a pipeline. Traditionally we had what we call a "fixed pipeline", allowing the programmer to only enable and slightly adjust some hardware implemented features. Back then there were no shader programs, everything was done through a huge number of API calls, and yet it was not very flexible.
Shader programs allows the programmer to implement parts of the rendering pipeline themselves. The pipeline is still controlled by API calls, but stages of the pipline can be customized to a large extent, allowing the developer to implement vertex manipulations, lightning effects, fog, transparency, texture blending, and post-processing effects like blur themselves.
The term "shader program" is actually quite confusing, but in modern rendering it refers to pieces of code executing on the GPU, which can do geometry, compute and more. Initially it was primarily used for creating shading effects, so the name has stuck. (Actually, the term is also used for non-GPU shading code used in various 3D modelling programs dating all the way back to the late 80s.)
Getting back to your claims, the GPU shader code executes directly inside the GPU. The "OS and applications, "kernel", etc. has nothing to do with this. The only abstraction involved is the transition from a high level shading language to assembly.
As a little side note; a customized shader might of course require adjustments in the API calls used.
Despite all the actors describing Direct3D 12 and Vulkan as "low level APIs", it's important to understand what is meant by "low level features". These APIs leverages greater control over the internals of how the driver manages the queue, allocations, etc., but it does not greater abilities to control this on the GPU side, like the pipeline flow, GPU threads/internal scheduling, etc. So in terms of GPU features exposed to shader programs, the new APIs currently bring nothing new. I'm hoping the next iteration of APIs will do this; move more flexibility to the shaders. We all know there can at times be great benefits from optimizing the pipeline and/or shaders for specific hardware, but usually it's a matter of resources. Most game developers don't even prioritize writing a decent pipeline in the first place, so doing these tweaks should not be the primary concern.
Talking of abstractions, most games have a much larger cause of overhead: the engine itself. Let's take a much hyped game like AofS, using like 100.000 API calls to render a pretty basic scene. Any graphics programmer would know they could well known techniques like instancing and batching to improve the performance by a factor of 10. Rendering with a high number of API calls is certainly the most inefficient way to utilize a GPU, customizing the pipeline or the shaders with such major "defects" is basically "putting lipstick on a pig". As mentioned, that description has nothing to do with how rendering works, tasks does not propagate through the levels as you described. Regardless of which API a game use, the API is the interface towards the driver, which in turn sends native commands to the GPU. Each command doesn't propagate through the levels causing the program to wait for the result. Ever since conception, both Direct3D and OpenGL has been designed as async APIs*. The game builds what we call a queue (which builds up the pipeline) and dispatches it to the driver, the game then continues to build the queue for the next frame while the driver is feeding the GPU.
*) Not to be confused with the unrelated feature "async compute".
So your calculation of 5 × 5 ms = 25 ms have no relation to reality. And even if a game used 25 ms for compute, the whole frame would probably take more than 100 ms, resulting in less than 10 FPS, so this overhead clearly does not exist as you described. That's quite a few mistakes in a single sentence.
1) Hardware specific shaders and API features have existed for many years, that's not new.
2) Direct3D 12 was in the works since 2010/2011, Mantle originated from early Direct3D 12 work, not the other way around.
3) Vulkan is built on SPIR-V. It got some inspiration from Mantle in terms of the front-end, but the underlying architecture is derived from SPIR. That's not true at all. Both vendors have open and proprietary parts. Gameworks is mostly open, while some requires an NDA. Almost all of it runs on AMD GPUs, claiming otherwise is untrue. They also provide the most extensive collection of examples and best practices for modern graphics development.
And GPUOpen is no "open standard" at all, it's mostly a collection of renamed tools and libraries which has existed for years. And do I need to remind you which vendor who were the last to provide Vulkan support? And still to this date fail to provide stable OpenGL support.
No vendor is even close to perfect, but there is no doubt that no one has done more to promote open standards than Nvidia. So please show some professionalism, and stop painting the picture as one being the champion of openness while the other being the evil proprietary one.
That's what I've always wanted on my PC..........the "better" lower FPS.
I really Hope AMD drops OpenGL Support.OpenGL has belonged to Past and needs to die.
If you read the article above, you'd realize you can't really have an issue with something that's "low level," only implementations within it.
I agree about open standards though. Except for the part about dropping OpenGL support. I like my old games to run, yo. That and almost ALL linux ports rely on it. It's hardly obsolete at this point, just being phased out. You're lining the accused up for the firing squad before you've even read them their rights.
That said, you should have kept in mind that this isn't supposed to be neither a deep dive, nor a white paper. This is simply trying to explain in some more detail what exactly is meant by these shader intrinsic functions. So, you should look at this piece as an abstraction layer unto itself, not as an be-all-end-all exploration. Some inaccuracies are inevitable. You are right, of course, and I did mention HLSL, though I'm not familiar with previous implementations of the subject. And while this is, obviously, a PR spin, I think AMD deserves to do it, based on the fact that it is now much more relevant than it was before, simply based on the current architecture proximity between consoles and PC. You just have to look at XBOX 360's architecture and compare it to the XBOX One's or PS4's to see that today, GPUs in consoles are much closer to their PC counterparts than ever before. That is why I agree that this subject has more relevance now, and why I accept that AMD spins it that way. Like I said above, thank you for this. If anyone wants to, they can read this and better understand what are the underlying systems. It isn't 100% wrong, since it isn't meant to match rendering. This is simply so that readers can understand what is meant by layers, and how a "given computer system" operates. I never claimed it to be graphics-related. It just serves to show that there are usually underlying processes between the OS and the hardware executing code. I am fully aware that 25ms is impossibly huge, since for VR, for example, a single frame must be rendered at around 13.3ms for achieving the 90fps threshold. Like I said, "imagine, for argument's sake". It's an abstraction. Thank you again for the rest of your write-up, as it again goes into more detail than I wanted to in this piece, but is still very much relevant to the subject at hand. I won't even dignify that with an answer. Just re-read what I wrote and you'll see how that was completely blown out of proportion and uncalled for.
github.com/GPUOpen-Effects/TressFX/releases/tag/v3.1.1
Holy shit, source code!!!! Free to modify.
VS
docs.nvidia.com/gameworks/content/artisttools/hairworks/HairWorks_sdkSamples.html
path = "NvHairWorksDx11.win64.D.dll"
yayyy, they give us .dll's..... casue having the dll is the same as source code right? Like how MS releases their source code with every OS, and every program doesn't include .dll's? Right?
Cause when you download and agree to
""NVIDIA GameWorks SDK" means the set of instructions for computers, in executable form only and in any media (which may include diskette, CD-ROM, downloadable internet, hardware, or firmware) comprising NVIDIA's proprietary Software Development Kit and related media and printed materials, including reference guides, documentation, and other manuals, installation routines and support files, libraries, sample art files and assets, tools, support utilities and any subsequent updates or adaptations provided by NVIDIA, whether with this installation or as separately downloaded (unless containing their own separate license terms and conditions)."
"
In addition, you may not and shall not permit others to:
I. modify, reproduce, de-compile, reverse engineer or translate the NVIDIA GameWorks SDK; or
II. distribute or transfer the NVIDIA GameWorks SDK other than as part of the NVIDIA GameWorks Application."
"
3. Redistribution; NVIIDA GameWorks Applications. Any redistribution of the NVIDIA GameWorks SDK (in accordance with Section 2 above) or portions thereof must be subject to an end user license agreement including language that
a) prohibits the end user from modifying, reproducing, de-compiling, reverse engineering or translating the NVIDIA GameWorks SDK;
b) prohibits the end user from distributing or transferring the NVIDIA GameWorks SDK other than as part of the NVIDIA GameWorks Application;"
So, for a GTX 580 for example, it supports DX11.0, so I'd expect the GF110 GPU in it to support this and not have any features "left over", or conversely, not fully support all of DX11.0's features.
I get that the hardware could support some unofficial and undocumented features not in the API allowing for "trick shot" special effects, but these shouldn't be significant.
Great article raevenlord. :)