• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA GA100 Scalar Processor Specs Sheet Released

Joined
Mar 26, 2009
Messages
179 (0.03/day)
You wrote "FP64 workloads", you genius. That's pure FP64 not tensor ops, you're clueless and stubborn.
If you spent half a minute reading about AI stuff you wouldn't say that, the new FP64 tensor ops apply to any FP64 workload, the user don't even have to change the code. Go spend sometime reading the Ampere whitepaper and deepdives and then come back when you have a better understanding of things.

Sheesh.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
the new FP64 tensor ops apply to any FP64 workload

:roll:

OK, we're making shit up ?

Pure FP64 performance sits at 9.7 TFLOPS, Tensor FP64 at 19.5. They are not interchangeable you blithering ... genius. If you write a kernel that adds two FP64 vectors, you'll get 9.7 TFLOPS not 19.5.

What you are talking about applies to the TF32 format which doesn't require any change in code, there is no "TF64" format with reduced precision. However that doesn't apply to any workload either. THEY ARE DIFFERENT EXECUTION UNITS FOR DIFFERENT THINGS. One does scalar FMAs (FP64), one does the two dimensional equivalent of FMAs (Tensor FP64).

It just get's better and better with you, I am getting tired of lecturing you and making fun of you along the way. It's time to stop, the longer we're at it the lesser your knowledge on the subject gets.
 
Last edited:
Joined
Oct 4, 2017
Messages
712 (0.27/day)
Location
France
Processor RYZEN 7 5800X3D
Motherboard Aorus B-550I Pro AX
Cooling HEATKILLER IV PRO , EKWB Vector FTW3 3080/3090 , Barrow res + Xylem DDC 4.2, SE 240 + Dabel 20b 240
Memory Viper Steel 4000 PVS416G400C6K
Video Card(s) EVGA 3080Ti FTW3
Storage XPG SX8200 Pro 512 GB NVMe + Samsung 980 1TB
Display(s) ROG Strix OLED XG27AQDMG
Case NR 200
Power Supply CORSAIR SF750
Mouse Logitech G PRO
Keyboard Meletrix Zoom 75 GT Silver
Software Windows 11 22H2
Maybe this will help to understand the relationship between die size, yields and GPU cost.

Dude are you seriously trying to back up your argument with some random Reddit post ( which is not even close to be accurate to begin with ) ??? Come on now i though you were serious !

How do these huge tensor cores do not take up space and increase the die size ?

For starter stop repeating the same misinformation that has been debunked , the SM diagram is just a general representation of the architecture and represents in no way the physical size of individual segments , this is public knowledge !

Furthermore this means you didn't even read my post before hitting the reply button . If Tensor Cores were taking so much space how do you explain that GA 100 die size has increased compared to GV 100 despite Tensor Core count having significantly decreased at the same time ???
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
Pure FP64 performance sits at 9.7 TFLOPS, Tensor FP64 at 19.5. They are not interchangeable you blithering ... genius. If you write a kernel that adds two FP64 vectors, you'll get 9.7 TFLOPS not 19.5.
You have to read more man:


Third-generation Tensor Cores:
  • Acceleration for all data types, including FP16, BF16, TF32, FP64, INT8, INT4, and Binary.
  • FP64 Tensor Core operations deliver unprecedented double-precision processing power for HPC, running 2.5x faster than V100 FP64 DFMA operations.
To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.

The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100, reducing instruction fetches, scheduling overhead, register reads, datapath power, and shared memory read bandwidth.

Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

I hope this was educational for you.
 
Last edited:
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.

Tensor operations

Tensor operations

Tensor
operations

This is a a tensor operation :

1589550038350.png


Those "FP64" units you see in the SM diagram don't do tensor operation, they just do scalar ops. Different units, for different workloads. "FP64 workload" means scalar, not tensor. "Tensor FP64" means tensors not scalar. Even a high school senior would understand this.

Tensor FP64 != FP64. A tensor operation is a mixed precision computation, you can't call it "FP64 workload" because it's not pure 64 bit floating point all throughout the computation. You can't do arbitrary computations with those cores like you can with the scalar FP64 ones either.

Let me ask you this, if they mean the same thing why mention two separate figures, 9.7 and 19.5 ? What's the 9.7 for ? Cupcake operations ?

k.png


See, one is just "Peak FP64" and the other very clearly says "Peak FP64 Tensor Core". Meaning that big rectangle in the SM diagram. Different units for different workloads.

I am convinced you can't be educated, you are missing both the will and the capacity to understand this. I find it bewildering you can literally post something that proves you wrong and still not admit it. You are something else, I really hope they're paying you for this intellectual suicide.

I have all day by the way, I am your nemesis.
 
Last edited:
Joined
Mar 26, 2009
Messages
179 (0.03/day)
See, one is just "Peak FP64" and the other very clearly says "Peak FP64 Tensor Core". Meaning that big rectangle in the SM diagram.
It's written right in front of you: Tensor Cores now have the capabilities to accelerate FP64 code, so it's actually faster than the regular CUDA cores in that respect, you are truly hopeless if you can't understand what the whitepaper states.

To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.

The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100,
And thus we get to that result:

Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

1+1=2, get it now?
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Tensor Cores now have the capabilities to accelerate FP64 code

They have the ability to accelerate tensor data structures not scalar code. That's why they are called tensor cores. Different units for different workloads.

Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100.

Yep, those are scalar FP64 units. That's how they get that "Peak FP64" throughput, these would comprise your normal FP64 workloads. There are 32 FP64 units per SM which can do 2 floating point computations per clock cycle (FMA), so that would be 32*108*2*1400 = 9 676 800 FLOPS.

YOU CAN'T OBTAIN THAT 19.5 TFLOP RATING WITH THESE UNITS.

The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

Yep, those are tensor cores which do tensor operations not scalar FP64 FMAs, IEEE compliant just means the elements within the data structure meets those standards, the computation is not your standard FMA. Those are used for tensor workloads not pure FP64 workloads, get it now you highly intelligent individual* ?.

* may be subject to sarcasm

Different units for different workloads. You can't add two FP64 vectors with Tensor Cores because you'd get a wrong answer, you can however implement tensor operation with FP64 units.
 
Last edited:
Joined
Mar 26, 2009
Messages
179 (0.03/day)
Yep, those are scalar FP64 units. That's how they get that "Peak FP64" throughput, these would comprise your normal FP64 workloads. There are 32 FP64 units per SM which can do 2 floating point computations per clock cycle (FMA), so that would be 32*108*2*1400 = 9 676 800 FLOPS.
No, 128 FP64 operations per SM, multiplied by 108 SMs, multiplied by 1410MHz clocks speed = 19,491,840‬ FLOPs = 19.5 TF of FP64.

They have the ability to accelerate tensor data structures not scalar code. That's why they are called tensor cores. Different units for different workloads.
No, Tensor cores now accelerate IEEE-compliant FP64 computations, which means regular scalar FP64.

Yep, those are tensor cores which do tensor operations not scalar FP64 FMAs. Those are used for tensor workloads not pure FP64 workloads, get it now you highly intelligent individual* ?.
You lack the knowledge about the subject, you don't read Ampere whitepaper, you can't comprehend, or do math correctly, yet you continue in your ignorant sarcasm, I rest my case now.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
No, Tensor cores now accelerate IEEE-compliant FP64 computations, which means regular scalar FP64.

Tensor != scalar. Otherwise, they wouldn't differentiate between the two.

No, 128 FP64 operations per SM, multiplied by 108 SMs, multiplied by 1410MHz clocks speed = 19,491,840‬ FLOPs = 19.5 TF of FP64.

1589552424244.png


Here we go again :

There are 8*4=32 FP64 units in each SM, fucking count them, use your fingers like in preschool. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.

WHEN THEY SAY 128 FP64 OPERATIONS PER CLOCK IN ONE SM THEY ARE ALSO COUNTING THE TENSOR CORES ! But those are different units that do different things, they are mixing together different things in order to fool the likes you. They've done a good job.

Let me get this straight, are you suggesting Nvidia wrote that "Peak FP64" figure ... by mistake ?

:roll:

It's like you're making up your own reality.
 
Last edited:
Joined
Mar 26, 2009
Messages
179 (0.03/day)
There are 8*4=32 FP64 units in each SM, fucking count them. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.
There are two types of units that accelerate FP64 computations in A100: regular old FP64 CUDA cores and Tensor Cores.

The new double precision matrix multiply-add instruction on A100 replaces eight DFMA instructions on V100, DFMA: is Double Percision Fused Multiply Add. So each matrix multiply-add instruction now replaces 8 FP64 fused multiply-add instructions. 1 replaces 8.

Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100.
Where do you think those 128 FP64 operations come from?

Here we go again :

There are 8*4=32 FP64 units in each SM, fucking count them. Each does 1 FMA meaning 64 in total, 64*108*1400 = 9 676 800 FLOPS.
Why are you insisting on ignoring everything NVIDIA is saying in their whitepaper? Ego much?
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Why are you insisting on ignoring everything NVIDIA is saying in their whitepaper?

You're ignoring everything that there is and choosing to adhere to your made up ideas. You can't read, you can't count, you can't differentiate between things, the list goes on. My ego consists in proving you wrong every step along the way, you started this, like you always do, not me. You don't want this, then don't tell me I'm wrong when I'm not, it's that simple.

Where do you think those 128 FP64 operations come from?

Read above, they include the tensor ops. 8*4*2 (32 FP64 units) + 8*4*2 (4 Tensor cores) = 128 operations per clock cycle. Tada !

FP64 units can only do 64 ops per SM per clock cycle. That provides your FP64 throughput which is different from FP64 tensor throughput. They are not the same, you can't quantify them all in one group and Nvidia makes this very clear by providing two figures : 9.7 TFLOPS of FP64 and 19.5 TFLOPS of Tensor FP64. It seems like it's you who can't read anything from their white paper.
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
Read above, they include the tensor ops. FP64 units can only do 64 ops per SM per clock cycle. That provides your FP64 throughput which is different from FP64 tensor throughput. They are not the same, you can't quantify them all in one group and Nvidia makes this very clear by providing two figures : TFLOPS FP64 9.7 and TFLOPS FP64 Tensor19.5. It seems like it's you who can't read anything from their white paper.
Wrong, NVIDIA states that tensor cores now accelerate FP64 scalar code, just like they accelearate FP16 scalar code, that's why they compare the tensor FP64 output to the CUDA FP64 output of V100. And that's how they reached 128 FP64 operation per clock per SM, and how they achieved their 2.5X FP64 win.

LEARN to read the whitepaper.
My ego consists in proving you wrong every step along the way, you started this not me like you always do.
Your ego is creaking under the immense pressure of the massive amount of BS inside. You can't admit you were wrong, you are a bad loser.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Wrong, NVIDIA states that tensor cores now accelerate FP64 scalar code

It doesn't, this is purely in your imagination.

Your ego is creaking under the immense pressure of the massive amount of BS inside. You can't admit you were wrong, you are a bad loser.

Only thing that's creaking under pressure must be your cranium. You're running out of steam buddy.
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
It doesn't, this is purely your imagination.

To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C

Man you just don't want to give up spamming the same thing over and over.

IEEE compliant just means the elements in the data structure adheres to the FP64 standard. That data structure isn't scalar, it's a matrix (technically a vector).
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
No it means EXACTLY that. Explain the 2.5X increase in output now please.

It's a vector not scalar. Why do you think that one tensor core can do 8 FP64 per clock cycle ? It's a vector format (conceptually it's a matrix) not a scalar, otherwise you'd have 8 independent units and then you'd just have more FP64 units. Just how stubborn and unintelligent can you be for not understanding this ?

Tensor core != FP64 core.

Explain the 2.5X increase in output now please.

They've simply increased the amount of units.
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
It's a vector not scalar. Why do you think that one tensor core can do 8 FP64 per clock cycle ? It's a vector format (conceptually it's a matrix) not scalar, otherwise you'd have 8 independent units and then you'd just have more FP64 units. Just how stubborn and unintelligent can you be for not understanding this ?
The end result is that tensors in Ampere now accelerate regular FP64 code, amounting to 2.5X Volta.
They've simply increased the amount of units.
Which units?
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
The end result is that tensors in Ampere now accelerate regular FP64 code, amounting to 2.5X Volta.

You can't accelerate scalar code with tensors, they are not the same data structure nor can they do the same operations. Show me the exact piece of information which explicitly states this.

Can you branch within in a tensor like you can with scalar CUDA code ? No, you can't.
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
You can't accelerate scalar code with tensors, they are not the same data structure nor can they do the same operations. Show me the exact piece of information which explicitly states this.
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

And you didn't answer LOL, which units?
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100. The A100 Tensor Core GPU with 108 SMs delivers a peak FP64 throughput of 19.5 TFLOPS, which is 2.5x that of Tesla V100.

Tensor

Tensor

Tensor


Do you just like, filter out the words you don't like ? Show me where it says this : "scalar code can now be accelerated by tensor operations". Stop spamming this shit which I've explained thoroughly that it has nothing to do with scalar code. Nvidia provides two performance metrics, one for scalar FP64 performance (9.7 TFLOPS) and one for FP64 tensor ops (19.5 TFLOPS) because they are different types of operations handled by different units.


And you didn't answer LOL, which units?

Lawl.

Let's make a deal, I tell you which units when you tell me why does Nvidia provide two different metrics for what you claim is the same thing because I kept asking this as well and you didn't answer me. OK ?
 
Joined
Mar 26, 2009
Messages
179 (0.03/day)
Do you just like, filter out the words you don't like ? Show me where it says this : "scalar code can now be accelerated by tensor operations". Stop spamming this shit which I've explained thoroughly that it has nothing to do with scalar code. Nvidia provides two performance metrics, one for scalar FP64 performance (9.7 TFLOPS) and one for FP64 tensor ops (19.5 TFLOPS) because they are different types of operations handled by different units.
Tensor has nothing to do with it, it's just done on Tensor cores, but the computations themselves are not tensor.

peak FP64 throughput of 19.5 TFLOPS

Let's make a deal, I tell you which units when you tell me why does Nvidia provide two different metrics for what you claim is the same thing because I kept asking this as well and you didn't answer me. OK ?
Because there is indeed two types of units that can do FP64 on the A100, regular CUDA cores and Tensor Cores.

For example, V100 lacks INT8 hardware, it can be done on software on the CUDA cores, but A100 can do INT8 with Tensor cores at a massive speed increase, and it can do 20X the amount of V100. Likewise with FP16 code, it can be done with CUDA cores, but the Tensor cores can do them at much faster speeds in V100 and A100.

NVIDIA reports the peak performance for each unit, but that's irrelevant, because at the end of the day, the A100 can do 2.5X the FP64 output of V100, which was the original point that went over your head.
 
Last edited:
Joined
Dec 31, 2009
Messages
19,375 (3.52/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
This would be a much better conversation without the barbs...I can only assume the user I don't see is doing the same.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Because there is indeed two types of units that can do FP64 on the A100, regular CUDA cores and Tensor Cores.

If they would all do just scalar FP64 then they would all be called FP64 units. But they're are not, why in the world would they insist on calling some units differently if they are the same thing ? And why would they use different performance metrics for the two ?

Do you understand how nonsensical this is ?

the A100 can do 2.5X the tensor FP64 output of V100

Fixed it

NVIDIA reports the peak performance for each unit, but that's irrelevant

It's only irrelevant in your head. Imagine thinking that a billion dollar company trying to sell millions of dollars worth of equipment would just write a bunch of worthless crap in their specifications. :roll:

They are peak performance metrics for two separate things. You can't use Tensor core to run scalar code on them, it simply doesn't work like that. The PF64 units can do branching, masking, execute complex mathematical functions, bit wise instructions, etc. Tensor cores can't do any of those things, they just do one single bloody computation : A * B + D. Unless you show me where this is explicitly mentioned and explained you are straight up delusional and making shit up. You don't have the slightest clue how these things even work, otherwise it would be painfully obvious to you how dumb what you're saying is.

Anyway to answer your question, the throughput of tensor operations is higher because obviously the throughput of Tensor Cores is higher and there are more SMs. Tensor operations use the tensor units. It's like I am explaining this to a 5 year old.
 
Last edited:
Joined
Mar 26, 2009
Messages
179 (0.03/day)
Anyway to answer your question, the throughput of tensor operations is higher because obviously the throughput of Tensor Cores is higher and there are more SMs.
Finally you understand now! Bravo!
Nope, that's NVIDIA's statement, not mine, am I to understand that you are correcting the words of a billion dollar company trying to sell millions of dollars worth of equipment who would just write a bunch of worthless crap in their arch whitepaper?

In the end, NVIDIA clearly mentions 3 things:

Tenor cores are now compliant with accelerating accelerate IEEE-compliant FP64 computations
Each FP64 matrix multiply add op now replaces 8 FMA FP64 operation
Meaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the throughput of V100

So A100 is 2.5X the V100 FP64 througput.

Case closed.
 
Top