- Joined
- Oct 9, 2007
- Messages
- 47,424 (7.51/day)
- Location
- Hyderabad, India
System Name | RBMK-1000 |
---|---|
Processor | AMD Ryzen 7 5700G |
Motherboard | ASUS ROG Strix B450-E Gaming |
Cooling | DeepCool Gammax L240 V2 |
Memory | 2x 8GB G.Skill Sniper X |
Video Card(s) | Palit GeForce RTX 2080 SUPER GameRock |
Storage | Western Digital Black NVMe 512GB |
Display(s) | BenQ 1440p 60 Hz 27-inch |
Case | Corsair Carbide 100R |
Audio Device(s) | ASUS SupremeFX S1220A |
Power Supply | Cooler Master MWE Gold 650W |
Mouse | ASUS ROG Strix Impact |
Keyboard | Gamdias Hermes E2 |
Software | Windows 11 Pro |
At the heart of the GeForce RTX 4090 is the gigantic AD102 silicon, which we broadly detailed in an older article. Built on the 4 nm silicon fabrication process, this chip measures 608 mm² in die-area, and crams in 76.3 billion transistors. We now have our first look into the silicon-level block diagram of the AD102, including the introduction of several new components.
The AD102 features a PCI-Express 4.0 x16 host interface, and a 384-bit GDDR6X memory interface. The Gigathread Engine acts as a the main resource allocation component of the silicon. Ada introduces the Optical Flow Accelerator, a component crucial for DLSS 3 to generate entire frames without involving the graphics rendering machinery. The chip features double the number of media-encoding hardware engines as "Ampere," including hardware-accelerated AV1 encode/decode. Multiple accelerators mean that multiple streams of videos can be transcoded (helpful in a media production environment), or transcoding is performed at twice the FPS rate (each encoder takes turns at encoding a single frame).
The main graphics rendering components of the AD102 are the GPCs (graphics processing clusters). There are 12 of these, compared to 7 on the previous-generation GA102. Each GPC shares a raster engine and render backends with six TPCs (texture processing clusters). Each TPC packs two SMs (streaming multiprocessors), the indivisible number-crunching machinery of the NVIDIA GPU. The SM is where maximum architectural innovation is done by NVIDIA. Each SM packs a 3rd generation RT core, a 128 KB L1 cache, and four TMUs, among four clusters that each pack 16 FP32 CUDA cores, 16 concurrent FP32+INT32 CUDA cores, 4 load/store units, a tiny L0 cache with warp-scheduler and threat-dispatch; a register file, and the all-important 4th generation Tensor core.
Each SM hence packs a total of 128 CUDA cores, 4 Tensor cores, and an RT core. There are 12 SM per GPC, so 1,536 CUDA cores, 48 Tensor cores, and 12 RT cores; per GPC. Twelve GPCs hence add up to 18,432 CUDA cores, 576 Tensor cores, and 144 RT cores. Each GPC contributes 16 ROPs, so there are a mammoth 192 ROPs on the silicon. An L2 cache serves as town-square for the various GPCs, memory controllers, and the PCIe host interface, to exchange data. NVIDIA didn't mention the size of this L2 cache, but it is said to be significantly larger than the previous generation, and is playing a major role in lubricating the memory sub-system enough that NVIDIA can retain the same 21 Gbps @ 384-bit data-rate of the previous-generation.
NVIDIA is introducing shader execution reordering, (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so it is more efficiently processed by the SIMD components. This is expected to have a particularly big impact on rendering games with ray tracing. A GPU works best when the same operation can be executed on multiple targets. For example, when rendering a triangle, each pixel runs the same shader in parallel. With ray tracing, each ray at a time can execute a completely different piece of code, because it goes in a slightly different direction. With SER, the GPU will "sort" the operations, to create chunks of identical tasks and execute them in parallel. In Cyberpunk 2077 with its new Overdrive graphics preset that significantly dials up RT calculations per pixel, SER improves performance up to 44 percent. NVIDIA is developing Portal RTX, a mod for the original game with RTX effects added. Here, SER improves performance by 29 percent. It is also said to have a 20 percent performance impact on the Racer RTX interactive tech-demo we'll see this November. NVIDIA commented that there's various SER approaches and the best choice vary by-game, so they exposed the shader reordering functionality to game developers as an API, so they have control over how the sorting algorithm works, to best optimize their performance.
Displaced micro-mesh engine is a revolutionary feature introduced with the new 3rd generation RT core, which accelerates the displaced micro-mesh feature. Just as mesh shaders and tessellation have had a profound impact on improving performance with complex raster geometry, allowing game developers to significantly increase geometric complexity; DMMs is a method to reduce the complexity of the bounding-volume hierarchy (BVH) data-structure, which is used to determine where a ray hits geometry. Previously the BVH had to capture even the smallest details, to properly determine the intersection point.
The BVH now needn't have data for every single triangle on an object, but can represent objects with complex geometry as a coarse mesh of base triangles, which greatly simplifies the BVH data structure. A simpler BVH means less memory consumed and helps to greatly reduce ray tracing CPU load, because the CPU only has to generate a smaller structure. With older "Ampere" and "Turing" RT cores, each triangle on an object had to be sampled at high overhead, so the RT core could precisely calculate ray intersection for each triangle. With Ada, the simpler BVH, plus the displacement maps can be sent to the RT core, which is now able to figure out the exact hit point on its own. NVIDIA has seen 11:1 to 28:1 compression in total triangle counts. This reduces BVH compile speedups by 7.6x to over 15x, in comparison to the older RT core; and reducing its storage footprint by anywhere between 6.5 to 20 times. DMMs could reduce disk- and memory bandwidth utilization, utilization of the PCIe bus, as well as reduce CPU utilization. NVIDIA worked with Simplygon and Adobe to add DMM support for their tool chains.
Opacity Micro Meshes (OMM) is a new feature introduced with Ada to improve rasterization performance, particularly with objects that have alpha (transparency data). Most low-priority objects in a 3D scene, such as leaves on a tree, are essentially rectangles with textures on the leaves where the transparency (alpha) creates the shape of the leaf. RT cores have a hard time intersecting rays with such objects, because they're not really in the shape that they appear (they're really just rectangles with textures that give you the illusion of shape. Previous-generation RT cores had to have multiple interactions with the rendering stage to figure out the shape of a transparent object, because they couldn't test for alpha by themselves.
This has been solved by using OMMs. Just as DMMs simplify geometry by creating meshes of micro-triangles; OMMs create meshes of rectangular textures that align with parts of the texture that aren't alpha, so the RT core has a better understanding of the geometry of the object, and can correctly calculate ray intersections. This has a significant performance impact on shading performance in non-RT applications, too. Practical applications of OMMs aren't just low-priority objects such as vegetation, but also smoke-sprites and localized fog. Traditionally there was a lot of overdraw for such effects, because they layered multiple textures on top of each other, that all had to be fully processed by the shaders. Now only the non-opaque pixels get executed—OMMs provide a 30 percent speedup with graphics buffer fill-rates, and a 10 percent impact on frame-rates.
DLSS 3 introduces a revolutionary new feature that promises a doubling in frame-rate at comparable quality, it's called AI frame-generation. While it has all the features of DLSS 2 and its AI super-resolution (scaling up a lower-resolution frame to native resolution with minimal quality loss); DLSS 3 can generate entire frames simply using AI, without involving the graphics rendering pipeline. Every alternating frame with DLSS 3 is hence AI-generated, without being a replica of the previous rendered frame.
This is possible only on the Ada graphics architecture, because of a hardware component called optical flow accelerator (OFA), which assists in predicting what the next frame could look like, by creating what NVIDIA calls an optical flow-field. OFA ensures that the DLSS 3 algorithm isn't confused by static objects in a rapidly-changing 3D scene (such as a race sim). The process heavily relies on the performance uplift introduced by the FP8 math format of the 4th generation Tensor core.
A third key ingredient of DLSS 3 is Reflex. By reducing the rendering queue to zero, Reflex plays a vital role in ensuring the frame-times with DLSS 3 are at an acceptable level, and a render-queue doesn't confuse the upscaler. A combination of OFA and 4th gen Tensor core is why the Ada architecture is required to use DLSS 3, and why it won't work on older architectures.
View at TechPowerUp Main Site
The AD102 features a PCI-Express 4.0 x16 host interface, and a 384-bit GDDR6X memory interface. The Gigathread Engine acts as a the main resource allocation component of the silicon. Ada introduces the Optical Flow Accelerator, a component crucial for DLSS 3 to generate entire frames without involving the graphics rendering machinery. The chip features double the number of media-encoding hardware engines as "Ampere," including hardware-accelerated AV1 encode/decode. Multiple accelerators mean that multiple streams of videos can be transcoded (helpful in a media production environment), or transcoding is performed at twice the FPS rate (each encoder takes turns at encoding a single frame).
The main graphics rendering components of the AD102 are the GPCs (graphics processing clusters). There are 12 of these, compared to 7 on the previous-generation GA102. Each GPC shares a raster engine and render backends with six TPCs (texture processing clusters). Each TPC packs two SMs (streaming multiprocessors), the indivisible number-crunching machinery of the NVIDIA GPU. The SM is where maximum architectural innovation is done by NVIDIA. Each SM packs a 3rd generation RT core, a 128 KB L1 cache, and four TMUs, among four clusters that each pack 16 FP32 CUDA cores, 16 concurrent FP32+INT32 CUDA cores, 4 load/store units, a tiny L0 cache with warp-scheduler and threat-dispatch; a register file, and the all-important 4th generation Tensor core.
Each SM hence packs a total of 128 CUDA cores, 4 Tensor cores, and an RT core. There are 12 SM per GPC, so 1,536 CUDA cores, 48 Tensor cores, and 12 RT cores; per GPC. Twelve GPCs hence add up to 18,432 CUDA cores, 576 Tensor cores, and 144 RT cores. Each GPC contributes 16 ROPs, so there are a mammoth 192 ROPs on the silicon. An L2 cache serves as town-square for the various GPCs, memory controllers, and the PCIe host interface, to exchange data. NVIDIA didn't mention the size of this L2 cache, but it is said to be significantly larger than the previous generation, and is playing a major role in lubricating the memory sub-system enough that NVIDIA can retain the same 21 Gbps @ 384-bit data-rate of the previous-generation.
NVIDIA is introducing shader execution reordering, (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so it is more efficiently processed by the SIMD components. This is expected to have a particularly big impact on rendering games with ray tracing. A GPU works best when the same operation can be executed on multiple targets. For example, when rendering a triangle, each pixel runs the same shader in parallel. With ray tracing, each ray at a time can execute a completely different piece of code, because it goes in a slightly different direction. With SER, the GPU will "sort" the operations, to create chunks of identical tasks and execute them in parallel. In Cyberpunk 2077 with its new Overdrive graphics preset that significantly dials up RT calculations per pixel, SER improves performance up to 44 percent. NVIDIA is developing Portal RTX, a mod for the original game with RTX effects added. Here, SER improves performance by 29 percent. It is also said to have a 20 percent performance impact on the Racer RTX interactive tech-demo we'll see this November. NVIDIA commented that there's various SER approaches and the best choice vary by-game, so they exposed the shader reordering functionality to game developers as an API, so they have control over how the sorting algorithm works, to best optimize their performance.
Displaced micro-mesh engine is a revolutionary feature introduced with the new 3rd generation RT core, which accelerates the displaced micro-mesh feature. Just as mesh shaders and tessellation have had a profound impact on improving performance with complex raster geometry, allowing game developers to significantly increase geometric complexity; DMMs is a method to reduce the complexity of the bounding-volume hierarchy (BVH) data-structure, which is used to determine where a ray hits geometry. Previously the BVH had to capture even the smallest details, to properly determine the intersection point.
The BVH now needn't have data for every single triangle on an object, but can represent objects with complex geometry as a coarse mesh of base triangles, which greatly simplifies the BVH data structure. A simpler BVH means less memory consumed and helps to greatly reduce ray tracing CPU load, because the CPU only has to generate a smaller structure. With older "Ampere" and "Turing" RT cores, each triangle on an object had to be sampled at high overhead, so the RT core could precisely calculate ray intersection for each triangle. With Ada, the simpler BVH, plus the displacement maps can be sent to the RT core, which is now able to figure out the exact hit point on its own. NVIDIA has seen 11:1 to 28:1 compression in total triangle counts. This reduces BVH compile speedups by 7.6x to over 15x, in comparison to the older RT core; and reducing its storage footprint by anywhere between 6.5 to 20 times. DMMs could reduce disk- and memory bandwidth utilization, utilization of the PCIe bus, as well as reduce CPU utilization. NVIDIA worked with Simplygon and Adobe to add DMM support for their tool chains.
Opacity Micro Meshes (OMM) is a new feature introduced with Ada to improve rasterization performance, particularly with objects that have alpha (transparency data). Most low-priority objects in a 3D scene, such as leaves on a tree, are essentially rectangles with textures on the leaves where the transparency (alpha) creates the shape of the leaf. RT cores have a hard time intersecting rays with such objects, because they're not really in the shape that they appear (they're really just rectangles with textures that give you the illusion of shape. Previous-generation RT cores had to have multiple interactions with the rendering stage to figure out the shape of a transparent object, because they couldn't test for alpha by themselves.
This has been solved by using OMMs. Just as DMMs simplify geometry by creating meshes of micro-triangles; OMMs create meshes of rectangular textures that align with parts of the texture that aren't alpha, so the RT core has a better understanding of the geometry of the object, and can correctly calculate ray intersections. This has a significant performance impact on shading performance in non-RT applications, too. Practical applications of OMMs aren't just low-priority objects such as vegetation, but also smoke-sprites and localized fog. Traditionally there was a lot of overdraw for such effects, because they layered multiple textures on top of each other, that all had to be fully processed by the shaders. Now only the non-opaque pixels get executed—OMMs provide a 30 percent speedup with graphics buffer fill-rates, and a 10 percent impact on frame-rates.
DLSS 3 introduces a revolutionary new feature that promises a doubling in frame-rate at comparable quality, it's called AI frame-generation. While it has all the features of DLSS 2 and its AI super-resolution (scaling up a lower-resolution frame to native resolution with minimal quality loss); DLSS 3 can generate entire frames simply using AI, without involving the graphics rendering pipeline. Every alternating frame with DLSS 3 is hence AI-generated, without being a replica of the previous rendered frame.
This is possible only on the Ada graphics architecture, because of a hardware component called optical flow accelerator (OFA), which assists in predicting what the next frame could look like, by creating what NVIDIA calls an optical flow-field. OFA ensures that the DLSS 3 algorithm isn't confused by static objects in a rapidly-changing 3D scene (such as a race sim). The process heavily relies on the performance uplift introduced by the FP8 math format of the 4th generation Tensor core.
A third key ingredient of DLSS 3 is Reflex. By reducing the rendering queue to zero, Reflex plays a vital role in ensuring the frame-times with DLSS 3 are at an acceptable level, and a render-queue doesn't confuse the upscaler. A combination of OFA and 4th gen Tensor core is why the Ada architecture is required to use DLSS 3, and why it won't work on older architectures.
View at TechPowerUp Main Site