Thursday, May 14th 2020

NVIDIA GA100 Scalar Processor Specs Sheet Released

NVIDIA today unveiled the GTC 2020, online event, and the centerpiece of it all is the GA100 scalar processor GPU, which debuts the "Ampere" graphics architecture. Sifting through a mountain of content, we finally found the slide that matters the most - the specifications sheet of GA100. The GA100 is a multi-chip module that has the 7 nm GPU die at the center, and six HBM2E memory stacks at its either side. The GPU die is built on the TSMC N7P 7 nm silicon fabrication process, measures 826 mm², and packing an unfathomable 54 billion transistors - and we're not even counting the transistors on the HBM2E stacks of the interposer.

The GA100 packs 6,912 FP32 CUDA cores, and independent 3,456 FP64 (double-precision) CUDA cores. It has 432 third-generation tensor cores that have FP64 capability. The three are spread across a gargantuan 108 streaming multiprocessors. The GPU has 40 GB of total memory, across a 6144-bit wide HBM2E memory interface, and 1.6 TB/s total memory bandwidth. It has two interconnects: a PCI-Express 4.0 x16 (64 GB/s), and an NVLink interconnect (600 GB/s). Compute throughput values are mind-blowing: 19.5 TFLOPs classic FP32, 9.7 TFLOPs classic FP64, and 19.5 TFLOPs tensor cores; TF32 156 TFLOPs single-precision (312 TFLOPs with neural net sparsity enabled); 312 TFLOPs BFLOAT16 throughout (doubled with sparsity enabled); 312 TFLOPs FP16; 624 TOPs INT8, and 1,248 TOPS INT4. The GPU has a typical power draw of 400 W in the SXM form-factor. We also found the architecture diagram that reveals GA100 to be two almost-independent GPUs placed on a single slab of silicon. We also have our first view of the "Ampere" streaming multiprocessor with its FP32 and FP64 CUDA cores, and 3rd gen tensor cores. The GeForce version of this SM could feature 2nd gen RT cores.
Add your own comment

101 Comments on NVIDIA GA100 Scalar Processor Specs Sheet Released

Vya Domus
MuhammedAbdothe new FP64 tensor ops apply to any FP64 workload

OK, we're making shit up ?

Pure FP64 performance sits at 9.7 TFLOPS, Tensor FP64 at 19.5. They are not interchangeable you blithering ... genius. If you write a kernel that adds two FP64 vectors, you'll get 9.7 TFLOPS not 19.5.

What you are talking about applies to the TF32 format which doesn't require any change in code, there is no "TF64" format with reduced precision. However that doesn't apply to any workload either. THEY ARE DIFFERENT EXECUTION UNITS FOR DIFFERENT THINGS. One does scalar FMAs (FP64), one does the two dimensional equivalent of FMAs (Tensor FP64).

It just get's better and better with you, I am getting tired of lecturing you and making fun of you along the way. It's time to stop, the longer we're at it the lesser your knowledge on the subject gets.
Posted on Reply
Dante UchihaMaybe this will help to understand the relationship between die size, yields and GPU cost.
Dude are you seriously trying to back up your argument with some random Reddit post ( which is not even close to be accurate to begin with ) ??? Come on now i though you were serious !
Dante UchihaHow do these huge tensor cores do not take up space and increase the die size ?
For starter stop repeating the same misinformation that has been debunked , the SM diagram is just a general representation of the architecture and represents in no way the physical size of individual segments , this is public knowledge !

Furthermore this means you didn't even read my post before hitting the reply button . If Tensor Cores were taking so much space how do you explain that GA 100 die size has increased compared to GV 100 despite Tensor Core count having significantly decreased at the same time ???
Posted on Reply
Vya DomusPure FP64 performance sits at 9.7 TFLOPS, Tensor FP64 at 19.5. They are not interchangeable you blithering ... genius. If you write a kernel that adds two FP64 vectors, you'll get 9.7 TFLOPS not 19.5.
You have to read more man:
Third-generation Tensor Cores:
  • Acceleration for all data types, including FP16, BF16, TF32, FP64, INT8, INT4, and Binary.
  • FP64 Tensor Core operations deliver unprecedented double-precision processing power for HPC, running 2.5x faster than V100 FP64 DFMA operations.
To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.
The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100, reducing instruction fetches, scheduling overhead, register reads, datapath power, and shared memory read bandwidth.
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

I hope this was educational for you.
Posted on Reply
Vya Domus
MuhammedAbdoTo meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.
Tensor operations

Tensor operations


This is a a tensor operation :

Those "FP64" units you see in the SM diagram don't do tensor operation, they just do scalar ops. Different units, for different workloads. "FP64 workload" means scalar, not tensor. "Tensor FP64" means tensors not scalar. Even a high school senior would understand this.

Tensor FP64 != FP64. A tensor operation is a mixed precision computation, you can't call it "FP64 workload" because it's not pure 64 bit floating point all throughout the computation. You can't do arbitrary computations with those cores like you can with the scalar FP64 ones either.

Let me ask you this, if they mean the same thing why mention two separate figures, 9.7 and 19.5 ? What's the 9.7 for ? Cupcake operations ?

See, one is just "Peak FP64" and the other very clearly says "Peak FP64 Tensor Core". Meaning that big rectangle in the SM diagram. Different units for different workloads.

I am convinced you can't be educated, you are missing both the will and the capacity to understand this. I find it bewildering you can literally post something that proves you wrong and still not admit it. You are something else, I really hope they're paying you for this intellectual suicide.

I have all day by the way, I am your nemesis.
Posted on Reply
Vya DomusSee, one is just "Peak FP64" and the other very clearly says "Peak FP64 Tensor Core". Meaning that big rectangle in the SM diagram.
It's written right in front of you: Tensor Cores now have the capabilities to accelerate FP64 code, so it's actually faster than the regular CUDA cores in that respect, you are truly hopeless if you can't understand what the whitepaper states.
To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.
The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100,
And thus we get to that result:
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.
1+1=2, get it now?
Posted on Reply
Vya Domus
MuhammedAbdoTensor Cores now have the capabilities to accelerate FP64 code
They have the ability to accelerate tensor data structures not scalar code. That's why they are called tensor cores. Different units for different workloads.
MuhammedAbdoEach SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100.
Yep, those are scalar FP64 units. That's how they get that "Peak FP64" throughput, these would comprise your normal FP64 workloads. There are 32 FP64 units per SM which can do 2 floating point computations per clock cycle (FMA), so that would be 32*108*2*1400 = 9 676 800 FLOPS.

MuhammedAbdoThe A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.
Yep, those are tensor cores which do tensor operations not scalar FP64 FMAs, IEEE compliant just means the elements within the data structure meets those standards, the computation is not your standard FMA. Those are used for tensor workloads not pure FP64 workloads, get it now you highly intelligent individual* ?.

* may be subject to sarcasm

Different units for different workloads. You can't add two FP64 vectors with Tensor Cores because you'd get a wrong answer, you can however implement tensor operation with FP64 units.
Posted on Reply
Vya DomusYep, those are scalar FP64 units. That's how they get that "Peak FP64" throughput, these would comprise your normal FP64 workloads. There are 32 FP64 units per SM which can do 2 floating point computations per clock cycle (FMA), so that would be 32*108*2*1400 = 9 676 800 FLOPS.
No, 128 FP64 operations per SM, multiplied by 108 SMs, multiplied by 1410MHz clocks speed = 19,491,840‬ FLOPs = 19.5 TF of FP64.
Vya DomusThey have the ability to accelerate tensor data structures not scalar code. That's why they are called tensor cores. Different units for different workloads.
No, Tensor cores now accelerate IEEE-compliant FP64 computations, which means regular scalar FP64.
Vya DomusYep, those are tensor cores which do tensor operations not scalar FP64 FMAs. Those are used for tensor workloads not pure FP64 workloads, get it now you highly intelligent individual* ?.
You lack the knowledge about the subject, you don't read Ampere whitepaper, you can't comprehend, or do math correctly, yet you continue in your ignorant sarcasm, I rest my case now.
Posted on Reply
Vya Domus
MuhammedAbdoNo, Tensor cores now accelerate IEEE-compliant FP64 computations, which means regular scalar FP64.
Tensor != scalar. Otherwise, they wouldn't differentiate between the two.
MuhammedAbdoNo, 128 FP64 operations per SM, multiplied by 108 SMs, multiplied by 1410MHz clocks speed = 19,491,840‬ FLOPs = 19.5 TF of FP64.

Here we go again :

There are 8*4=32 FP64 units in each SM, fucking count them, use your fingers like in preschool. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.

WHEN THEY SAY 128 FP64 OPERATIONS PER CLOCK IN ONE SM THEY ARE ALSO COUNTING THE TENSOR CORES ! But those are different units that do different things, they are mixing together different things in order to fool the likes you. They've done a good job.

Let me get this straight, are you suggesting Nvidia wrote that "Peak FP64" figure ... by mistake ?


It's like you're making up your own reality.
Posted on Reply
Vya DomusThere are 8*4=32 FP64 units in each SM, fucking count them. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.
There are two types of units that accelerate FP64 computations in A100: regular old FP64 CUDA cores and Tensor Cores.

The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100, DFMA: is Double Percision Fused Multiply Add. So each matrix multiply-add instruction now replaces 8 FP64 fused multiply-add instructions. 1 replaces 8.
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100.
Where do you think those 128 FP64 operations come from?
Vya DomusHere we go again :

There are 8*4=32 FP64 units in each SM, fucking count them. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.
Why are you insisting on ignoring everything NVIDIA is saying in their whitepaper? Ego much?
Posted on Reply
Vya Domus
MuhammedAbdoWhy are you insisting on ignoring everything NVIDIA is saying in their whitepaper?
You're ignoring everything that there is and choosing to adhere to your made up ideas. You can't read, you can't count, you can't differentiate between things, the list goes on. My ego consists in proving you wrong every step along the way, you started this, like you always do, not me. You don't want this, then don't tell me I'm wrong when I'm not, it's that simple.
MuhammedAbdoWhere do you think those 128 FP64 operations come from?
Read above, they include the tensor ops. 8*4*2 (32 FP64 units) + 8*4*2 (4 Tensor cores) = 128 operations per clock cycle. Tada !

FP64 units can only do 64 ops per SM per clock cycle. That provides your FP64 throughput which is different from FP64 tensor throughput. They are not the same, you can't quantify them all in one group and Nvidia makes this very clear by providing two figures : 9.7 TFLOPS of FP64 and 19.5 TFLOPS of Tensor FP64. It seems like it's you who can't read anything from their white paper.
Posted on Reply
Vya DomusRead above, they include the tensor ops. FP64 units can only do 64 ops per SM per clock cycle. That provides your FP64 throughput which is different from FP64 tensor throughput. They are not the same, you can't quantify them all in one group and Nvidia makes this very clear by providing two figures : TFLOPS FP64 9.7 and TFLOPS FP64 Tensor19.5. It seems like it's you who can't read anything from their white paper.
Wrong, NVIDIA states that tensor cores now accelerate FP64 scalar code, just like they accelearate FP16 scalar code, that's why they compare the tensor FP64 output to the CUDA FP64 output of V100. And that's how they reached 128 FP64 operation per clock per SM, and how they achieved their 2.5X FP64 win.

LEARN to read the whitepaper.
Vya DomusMy ego consists in proving you wrong every step along the way, you started this not me like you always do.
Your ego is creaking under the immense pressure of the massive amount of BS inside. You can't admit you were wrong, you are a bad loser.
Posted on Reply
Vya Domus
MuhammedAbdoWrong, NVIDIA states that tensor cores now accelerate FP64 scalar code
It doesn't, this is purely in your imagination.
MuhammedAbdoYour ego is creaking under the immense pressure of the massive amount of BS inside. You can't admit you were wrong, you are a bad loser.
Only thing that's creaking under pressure must be your cranium. You're running out of steam buddy.
Posted on Reply
Vya DomusIt doesn't, this is purely your imagination.
To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU
Posted on Reply
Vya DomusIEEE compliant just means the elements in the data structure adheres to the FP64 standard. That data structure isn't scalar.
No it means EXACTLY that. Explain the 2.5X increase in output now please.
Posted on Reply
Vya Domus
MuhammedAbdoNo it means EXACTLY that. Explain the 2.5X increase in output now please.
It's a vector not scalar. Why do you think that one tensor core can do 8 FP64 per clock cycle ? It's a vector format (conceptually it's a matrix) not a scalar, otherwise you'd have 8 independent units and then you'd just have more FP64 units. Just how stubborn and unintelligent can you be for not understanding this ?

Tensor core != FP64 core.
MuhammedAbdoExplain the 2.5X increase in output now please.
They've simply increased the amount of units.
Posted on Reply
Vya DomusIt's a vector not scalar. Why do you think that one tensor core can do 8 FP64 per clock cycle ? It's a vector format (conceptually it's a matrix) not scalar, otherwise you'd have 8 independent units and then you'd just have more FP64 units. Just how stubborn and unintelligent can you be for not understanding this ?
The end result is that tensors in Ampere now accelerate regular FP64 code, amounting to 2.5X Volta.
Vya DomusThey've simply increased the amount of units.
Which units?
Posted on Reply
Vya Domus
MuhammedAbdoThe end result is that tensors in Ampere now accelerate regular FP64 code, amounting to 2.5X Volta.
You can't accelerate scalar code with tensors, they are not the same data structure nor can they do the same operations. Show me the exact piece of information which explicitly states this.

Can you branch within in a tensor like you can with scalar CUDA code ? No, you can't.
Posted on Reply
Vya DomusYou can't accelerate scalar code with tensors, they are not the same data structure nor can they do the same operations. Show me the exact piece of information which explicitly states this.
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

And you didn't answer LOL, which units?
Posted on Reply
Vya Domus
MuhammedAbdoEach SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.



Do you just like, filter out the words you don't like ? Show me where it says this : "scalar code can now be accelerated by tensor operations". Stop spamming this shit which I've explained thoroughly that it has nothing to do with scalar code. Nvidia provides two performance metrics, one for scalar FP64 performance (9.7 TFLOPS) and one for FP64 tensor ops (19.5 TFLOPS) because they are different types of operations handled by different units.
MuhammedAbdoAnd you didn't answer LOL, which units?

Let's make a deal, I tell you which units when you tell me why does Nvidia provide two different metrics for what you claim is the same thing because I kept asking this as well and you didn't answer me. OK ?
Posted on Reply
Vya DomusDo you just like, filter out the words you don't like ? Show me where it says this : "scalar code can now be accelerated by tensor operations". Stop spamming this shit which I've explained thoroughly that it has nothing to do with scalar code. Nvidia provides two performance metrics, one for scalar FP64 performance (9.7 TFLOPS) and one for FP64 tensor ops (19.5 TFLOPS) because they are different types of operations handled by different units.
Tensor has nothing to do with it, it's just done on Tensor cores, but the computations themselves are not tensor.
peak FP64 throughput of 19.5 TFLOPS
Vya DomusLet's make a deal, I tell you which units when you tell me why does Nvidia provide two different metrics for what you claim is the same thing because I kept asking this as well and you didn't answer me. OK ?
Because there is indeed two types of units that can do FP64 on the A100, regular CUDA cores and Tensor Cores.

For example, V100 lacks INT8 hardware, it can be done on software on the CUDA cores, but A100 can do INT8 with Tensor cores at a massive speed increase, and it can do 20X the amount of V100. Likewise with FP16 code, it can be done with CUDA cores, but the Tensor cores can do them at much faster speeds in V100 and A100.

NVIDIA reports the peak performance for each unit, but that's irrelevant, because at the end of the day, the A100 can do 2.5X the FP64 output of V100, which was the original point that went over your head.
Posted on Reply
This would be a much better conversation without the barbs...I can only assume the user I don't see is doing the same.
Posted on Reply
Vya Domus
MuhammedAbdoBecause there is indeed two types of units that can do FP64 on the A100, regular CUDA cores and Tensor Cores.
If they would all do just scalar FP64 then they would all be called FP64 units. But they're are not, why in the world would they insist on calling some units differently if they are the same thing ? And why would they use different performance metrics for the two ?

Do you understand how nonsensical this is ?
MuhammedAbdothe A100 can do 2.5X the tensor FP64 output of V100
Fixed it
MuhammedAbdoNVIDIA reports the peak performance for each unit, but that's irrelevant
It's only irrelevant in your head. Imagine thinking that a billion dollar company trying to sell millions of dollars worth of equipment would just write a bunch of worthless crap in their specifications. :roll:

They are peak performance metrics for two separate things. You can't use Tensor core to run scalar code on them, it simply doesn't work like that. The PF64 units can do branching, masking, execute complex mathematical functions, bit wise instructions, etc. Tensor cores can't do any of those things, they just do one single bloody computation : A * B + D. Unless you show me where this is explicitly mentioned and explained you are straight up delusional and making shit up. You don't have the slightest clue how these things even work, otherwise it would be painfully obvious to you how dumb what you're saying is.

Anyway to answer your question, the throughput of tensor operations is higher because obviously the throughput of Tensor Cores is higher and there are more SMs. Tensor operations use the tensor units. It's like I am explaining this to a 5 year old.
Posted on Reply
Vya DomusAnyway to answer your question, the throughput of tensor operations is higher because obviously the throughput of Tensor Cores is higher and there are more SMs.
Finally you understand now! Bravo!
Vya DomusFixed it
Nope, that's NVIDIA's statement, not mine, am I to understand that you are correcting the words of a billion dollar company trying to sell millions of dollars worth of equipment who would just write a bunch of worthless crap in their arch whitepaper?

In the end, NVIDIA clearly mentions 3 things:

Tenor cores are now compliant with accelerating accelerate IEEE-compliant FP64 computations
Each FP64 matrix multiply add op now replaces 8 FMA FP64 operation
Meaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the throughput of V100

So A100 is 2.5X the V100 FP64 througput.

Case closed.
Posted on Reply
Vya Domus
MuhammedAbdoTenor cores are now compliant with accelerating accelerate IEEE-compliant tensor FP64 computations
Fixed it. Tensor cores do tensor operations. That's why they are called tensor cores.
MuhammedAbdoMeaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the tensor FP64 throughput of V100
Fixed it. Also, every SM can do 64 FP64 ops meaning it can do a total of 9.7 TFLOPS of PF64. Also in your whitepaper which isn't a whitepaper by the way it's just a god damn blog post they specify different performance metrics for the two because they represent different workloads. Are you correcting the words of a billion dollar company trying to sell millions of dollars worth of equipment who would just write a bunch of worthless crap in their arch whitepaper?
MuhammedAbdoCase closed.
Case reopened and closed.

You are so wrong, stubborn and unintelligent, you've exceed all my expectations from past discussions with you. Anyway I thought you "rested" your case many comments ago, why are you still here ?

EgO mUcH ? Remember if you don't want to deal with me anymore then don't tell I'm wrong when I'm not. It's that simple, otherwise we can go on forever, I have all day as I said.
Posted on Reply
Add your own comment
Oct 9th, 2024 03:18 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts