Thursday, May 11th 2017
NVIDIA GV100 Silicon Detailed
NVIDIA at the GTC 2017 event, announced its next-generation "Volta" GPU architecture. As with its current "Pascal" architecture, "Volta" was unveiled in its biggest, most feature-rich implementation, the Tesla V100 HPC board, driven by the GV100 silicon. Given the HPC applications of NVIDIA's Tesla family of products, the GV100 has certain components that won't make it to the consumer GeForce family. Despite these, the GV100 is the pinnacle of NVIDIA's silicon engineering. According to the GPU block diagram released by the company, the GV100 has a similar component hierarchy to previous-generation NVIDIA chips, with some major changes to its basic number-crunching machinery, the streaming multiprocessor (SM).
The "Volta" streaming multiprocessor (SM) on the GV100 silicon features both FP32 and FP64 CUDA cores. Consumer graphics implementations of "Volta" which drive future GeForce products could lack those specialized FP64 cores. Each SM features 64 FP32 CUDA cores, and 32 FP64 cores. The FP64 cores can handle 32-bit, 16-bit, and even primitive 8-bit operations. The GV100 features 80 SMs, so you're looking at 5,120 FP32 and 2,560 FP64 CUDA cores. In addition, Volta introduces a component called Tensor cores, specialized machinery designed to speed up deep-learning training and neural net building. An SM has 8 of these, so the GV100 has 640. As with FP64 cores, Tensor cores may not make it to consumer-graphics implementations. Given its SM count, the GV100 features 320 TMUs. NVIDIA clocked the GV100 to run at 1455 MHz boost.The Tesla V100 is advertised to offer 50% higher FP32 and FP64 peak performance over the "Pascal" based Tesla P100. Its peak FP32 throughput is rated at 15 TFLOP/s, with 7.5 TFLOP/s FP64 peak throughput. The Tensor cores "effectively" run at 120 TFLOP/s to perform their very specialized task of training deep-learning neural nets. These components feature matrix-matrix multiplication units, which is a key math operation in neural net training. They accelerate neural net building/training by 12X.Built on the new 12 nanometer process, the GV100 is a multi-chip module with a large, 815 mm² GPU die, with a gargantuan transistor-count of 21.1 billion, neighbored by four 32 Gbit HBM2 memory stacks, which make up 16 GB of memory. These stacks interface with the GV100 over a 4096-bit wide memory interface, through a silicon interposer. At 1 GHz, this memory setup could cushion the GV100 with a memory bandwidth of 1 TB/s. HBM2 could still be exclusive to the Tesla family of products in NVIDIA's product-stack, as it continues to be expensive to implement in the consumer-segment for NVIDIA. Besides FP64 and Tensor cores, consumer implementations of "Volta" could feature inexpensive yet suitably fast GDDR6 memory. One of the pioneering manufacturers of HBM, SK Hynix, even demonstrated GDDR6 at GTC, so unless NVIDIA is fighting for its life in performance against AMD, we expect it to stick to GDDR6 in the consumer segment.The Tesla V100 HPC card will be developed in two packages - integrated boards with NVLink interface for more high-density farm builds, and add-on card with PCI-Express interface for workstations. It will be sold through specialized retail channels.
The "Volta" streaming multiprocessor (SM) on the GV100 silicon features both FP32 and FP64 CUDA cores. Consumer graphics implementations of "Volta" which drive future GeForce products could lack those specialized FP64 cores. Each SM features 64 FP32 CUDA cores, and 32 FP64 cores. The FP64 cores can handle 32-bit, 16-bit, and even primitive 8-bit operations. The GV100 features 80 SMs, so you're looking at 5,120 FP32 and 2,560 FP64 CUDA cores. In addition, Volta introduces a component called Tensor cores, specialized machinery designed to speed up deep-learning training and neural net building. An SM has 8 of these, so the GV100 has 640. As with FP64 cores, Tensor cores may not make it to consumer-graphics implementations. Given its SM count, the GV100 features 320 TMUs. NVIDIA clocked the GV100 to run at 1455 MHz boost.The Tesla V100 is advertised to offer 50% higher FP32 and FP64 peak performance over the "Pascal" based Tesla P100. Its peak FP32 throughput is rated at 15 TFLOP/s, with 7.5 TFLOP/s FP64 peak throughput. The Tensor cores "effectively" run at 120 TFLOP/s to perform their very specialized task of training deep-learning neural nets. These components feature matrix-matrix multiplication units, which is a key math operation in neural net training. They accelerate neural net building/training by 12X.Built on the new 12 nanometer process, the GV100 is a multi-chip module with a large, 815 mm² GPU die, with a gargantuan transistor-count of 21.1 billion, neighbored by four 32 Gbit HBM2 memory stacks, which make up 16 GB of memory. These stacks interface with the GV100 over a 4096-bit wide memory interface, through a silicon interposer. At 1 GHz, this memory setup could cushion the GV100 with a memory bandwidth of 1 TB/s. HBM2 could still be exclusive to the Tesla family of products in NVIDIA's product-stack, as it continues to be expensive to implement in the consumer-segment for NVIDIA. Besides FP64 and Tensor cores, consumer implementations of "Volta" could feature inexpensive yet suitably fast GDDR6 memory. One of the pioneering manufacturers of HBM, SK Hynix, even demonstrated GDDR6 at GTC, so unless NVIDIA is fighting for its life in performance against AMD, we expect it to stick to GDDR6 in the consumer segment.The Tesla V100 HPC card will be developed in two packages - integrated boards with NVLink interface for more high-density farm builds, and add-on card with PCI-Express interface for workstations. It will be sold through specialized retail channels.
23 Comments on NVIDIA GV100 Silicon Detailed
Its not like only gamers are consumers.
Their point makes sense, its a bit like those concept cars that get shown on car shows with all kinds of nifty gadgets that never make it into actual production cars, what is the point then?
If this card actually has features that we cant ever get, then again, what is the point of making them or talking about them at all?
... with GDDR6 :)
You all know this is what its gonna be. Volta will be the usual 30-35% perf bump on each price point within the Geforce stack. From what I could read on GV100, all the new bits are for enterprise, not GFX.
With GDDR6 up to 16gb/s they have more than enough headroom to cover that perf bump, they could even stretch it out to the Volta Refresh seeing as 10gb/s > 16gb/s is +60%.
@Vayra86 I'd love to see a 30-35% performance increase, but my gut feeling tells me Nvidia will try to milk it a little more. I hope I'm wrong.
:lovetpu:
Plus, consumer Pascal doesn't have disabled FP64 units. It's a different silicon, built without them. See: forums.anandtech.com/threads/gp100-and-gp104-are-different-architectures.2473319/ (resources were even added in the consumer chip, where it made sense)
GTX**80 - 3584 SP
GTX**70 - 2688 SP
GTX**60 - 1792 SP
Expect more restrictions on clock speed.
EDIT:
GTX**80Ti will be whatever leftover of what nvidia can't sell as GV100.
GTX**50 will be something much smaller.. I guess.
says google.
Nvidia may do that for 2 reasons:
- Sell as much yields as possible.
- Locking the performance ( in case if these gpus can clock really high ? ).
EDIT: I don't think they are going to leave the GTX 2070 and GTX 2080 that close in configuration ( Nvidia was so bothered by those who did overclock their GTX970s ... remember).If they are going to disable 4 SMs from the GTX 2080, then they are probably going to disable more than 25% of the GTX 2070.
And now, Jensen announces NVIDIA DGX-1 with eight Telsa v100. It’s labeled on the slide as the “essential instrument of AI research. What used to take a week now takes a shift. It replaces 400 servers. It offers 960 tensor TFLOPS. It will ship in Q3. It will cost $149,000. He notes that if you get one now powered by Pascal, you’ll get a free upgrade to Volta.
Turns out, there’s also a small version of DGX-1, DGXX Station. Think of it as a personal sized one. It’s liquid cooled and whisper quiet. Every one of our deep learning engineers has one.
It has four Tesla V100s. It’s $69K. Order it now and we’ll deliver it in Q3. “So place your order now,” he avers. via NVIDIA
PS. On speculation for GTX 1200 series (probable name, not "2080", skipping 10 gens) I think new Titan Xv will have 4096-4584 or even up to 5120 cores (fully activated chip). GTX 1280 could have 3072 to 3584 shaders, 2560-3072 should be new GTX 1270 (partly deactivated chip) making these pretty powerful at the 400-600 dollar range, 4k gaming will be a easy thing by then, being a "normal" high end gamer. Enthusiasts will have over 100 fps for 4k without using SLI.