The core issue on the road to photorealistic rendering is that Moore's Law—the principle predicting the doubling of computational power every two years—has slowed significantly in recent years. This makes it clear that new techniques are needed to achieve the next leap in realism. Neural rendering, a technology that's been evolving in recent years, could hold the key.
Neural rendering technologies, like DLSS (Deep Learning Super Sampling), have made significant strides in improving image quality while reducing the computational load. These advancements allow for the generation of high-quality visuals with a fraction of the resources typically required.
To optimize Blackwell for neural rendering, several key design goals were outlined:
Optimization for Neural Networks: Blackwell is built to excel with neural network algorithms, reducing the memory footprint to allow more simultaneous tasks and improved performance.
Quality of Service: With modern workloads running asynchronously (e.g., physics simulations, AI tasks, and rendering), ensuring high performance across these workloads is crucial. Blackwell's architecture ensures that all processes are balanced and efficiently handled.
Energy Efficiency: As power consumption becomes an increasingly critical factor, Blackwell has been designed with energy efficiency in mind, making it suitable for everything from high-performance desktops to energy-sensitive laptops.
Fifth-Generation Tensor Cores: The new tensor cores are optimized for neural rendering and come with a high-speed format, INT4, which significantly boosts throughput while reducing memory requirements by half.
Fourth-Generation RT Cores: The new RT cores focus on handling mega geometry more efficiently. These cores can now process larger and more complex scenes, with better performance in both standard and advanced geometry.
AI Management Processor (AMP): This processor helps in scheduling AI tasks alongside graphics rendering, ensuring smooth, high-performance operation for complex workloads.
Improved Shader Multiprocessors (SM): The Blackwell SM has been optimized for neural shaders, offering twice the bandwidth and improved throughput for handling complex tasks, especially those involving deep learning and neural shading.
GDDR7 memory, a new industry standard that delivers twice the speed of GDDR6 while cutting power consumption in half. GDDR7 uses PAM3 signaling, which increases noise immunity and allows for higher frequencies at lower power. This translates to higher bandwidth and improved energy efficiency, addressing two key challenges in high-performance graphics.
The Blackwell SM doubles the INT32 bandwidth and throughput by giving all shader cores the ability to run either INT32 or FP32, unlike Ada, which supported that only on half the cores. They also made the Tensor Cores accessible from the shaders, by using the new DirectX Cooperative Vectors API.
Shader Execution Reordering (SER), which is basically shaders generating work for other shaders has been improved by a factor of two, this also benefits Work Graphs.
One of the standout features of the Blackwell architecture is the integration of GDDR7 memory, which sets a new industry standard. Developed by multiple vendors, GDDR7 memory offers significant improvements over its predecessor, GDDR6. It is not only twice as fast as GDDR6, but it also consumes half the power per bit of data transferred.
The biggest change is the signaling technology. GDDR6X uses PAM4 signaling, which relies on four levels of logic to sample the clock. This design allows for a certain width of the "data eye," which refers to the amount of noise immunity in the signal. The larger the data eye, the cleaner and faster the data can be transferred.
GDDR7 moves to PAM3 signaling, which uses three levels of logic instead of four. This shift results in a larger data eye, allowing for higher frequencies and better performance. PAM3 can run at higher speed than PAM4, which means it will transmit more data per second, despite transferring less data per clock cycle.
Ray tracing (RT) has seen major improvements with Blackwell's fourth-generation RT cores. These cores include a triangle cluster intersection engine designed specifically for handling mega geometry. The integration of a triangle cluster compression format and a lossless decompression engine allows for more efficient processing of complex geometry.
NVIDIA highlighted that these improvements lead to a significant performance boost, with triangle throughput doubling compared to previous generations. This advancement enables much more complex scenes to be rendered, with ray tracing calculations becoming far more efficient.
Blackwell's Tensor Cores support INT4 and FP4, which means RT operations can execute in these smaller data lower-precision formats, which not only makes them run twice as fast, but they also use half the memory. The drawback is that some precision is lost, which is probably not a big deal for real-time interactive graphics in games.
The integration of AI models into gaming presents new challenges in maintaining a smooth and responsive experience. Scheduling becomes critical as both game rendering and AI tasks, like large language models (LLMs) for digital avatars, compete for resources. Delays in AI responses, known as "time to first response," can break immersion, while interruptions in game frame pacing can cause stutter. To address this, the AI Management Processor (AMP) was introduced as a programmable solution. Positioned at the front of the GPU, AMP precisely manages task scheduling, ensuring that AI processes, such as dialogue generation, do not interfere with game rendering, optimizing both smoothness and responsiveness for a seamless user experience.
NVIDIA's Max Q philosophy focuses on two main objectives: maximizing performance within a defined power budget and efficiently managing power during idle periods. By refining these technologies with each generation, NVIDIA continues to push the boundaries of efficiency.
One notable advancement is in the development of DLSS 4, a neural rendering technology that not only accelerates traditional rendering processes but also enhances energy efficiency. By co-designing hardware specifically for DLSS 4, NVIDIA has been able to achieve significant power savings, with examples such as GDDR7 providing twice the efficiency of GDDR6.
Moreover, Blackwell GPUs take power management a step further. Through enhanced frequency adjustment—over a thousand times faster than previous generations—and the implementation of deeper power states, NVIDIA achieves precise power management. This means that GPUs can enter and exit power-saving states almost instantly, reducing overall energy consumption.
NVIDIA's approach to power gaming leverages a multi-tiered strategy. Instead of relying on a single deep power state, GPUs gradually enter progressively deeper states as needed. This method ensures efficient power usage without compromising performance. For example, during idle periods, GPUs can rapidly switch between clock gating and power gating states, shutting down parts of the chip to save energy while still remaining responsive when needed.
Additionally, Blackwell features a second voltage rail, allowing the core and memory systems to operate at different voltages for distinct workloads. This separation facilitates better performance within a given power budget while achieving a 15x reduction in the time it takes to rail gate the core, further optimizing battery life for gaming laptops.
An innovative aspect of NVIDIA's technology is accelerated frequency switching. By dynamically adjusting frequencies in real-time, GPUs can adapt to different workloads efficiently. For instance, when handling light tasks like physics simulations, the GPU can boost performance, whereas during heavy workloads involving multiple active cores, frequencies can be adjusted to maintain balance and save power.
This technology enables NVIDIA GPUs to achieve better performance without sacrificing energy efficiency, creating a balance between performance and power consumption.
One of the most notable upgrades is the addition of support for DisplayPort 2.1. This new feature allows users to enjoy high refresh rates on larger displays with just a single cable, significantly enhancing the overall visual experience. DisplayPort 2.1 offers increased bandwidth, enabling smoother, more responsive displays, which are essential for gaming, content creation, and other high-performance tasks.
Another standout feature is high-speed hardware flip metering. This technology is particularly important for DLSS 4 (Deep Learning Super Sampling), an AI-driven technology designed to enhance image quality and frame rates. High-speed hardware flip metering optimizes the pacing of frame delivery, ensuring a more efficient and smoother experience in games and applications that use DLSS 4. This helps maintain consistent performance even under demanding conditions.
On the encoder and decoder side, Blackwell architecture introduces several improvements aimed at enhancing video encoding and decoding efficiency. Notably, the company has added support for AV1 Ultra High Quality, which improves the visual fidelity of video streams. The architecture also doubles the throughput of H.264 decoding, a popular video compression standard, allowing for faster and more efficient processing. Furthermore, Blackwell now supports Multi-view AQBC (Adaptive Quality-Based Compression) and 4:2:2 encode and decode, a key format for video creators, ensuring higher-quality video content while maintaining manageable file sizes.