NVIDIA GA100 Scalar Processor Specs Sheet Released

Vya Domus · May 15, 2020

MuhammedAbdo said:
Tenor cores are now compliant with accelerating accelerate IEEE-compliant tensor FP64 computations

Fixed it. Tensor cores do tensor operations. That's why they are called tensor cores.

MuhammedAbdo said:
Meaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the tensor FP64 throughput of V100

Fixed it. Also, every SM can do 64 FP64 ops meaning it can do a total of 9.7 TFLOPS of PF64. Also in your whitepaper which isn't a whitepaper by the way it's just a god damn blog post they specify different performance metrics for the two because they represent different workloads. Are you correcting the words of a billion dollar company trying to sell millions of dollars worth of equipment who would just write a bunch of worthless crap in their arch whitepaper?

MuhammedAbdo said:
Case closed.

Case reopened and closed.

You are so wrong, stubborn and unintelligent, you've exceed all my expectations from past discussions with you. Anyway I thought you "rested" your case many comments ago, why are you still here ?

EgO mUcH ? Remember if you don't want to deal with me anymore then don't tell I'm wrong when I'm not. It's that simple, otherwise we can go on forever, I have all day as I said.

dyonoctis · May 15, 2020

At the end of the day, the only thing that matter is that tensor cores/FP64 are not going to benefits games that don't use DLSS. Ampere for gaming is probably going to be different.

MuhammedAbdo · May 15, 2020

Vya Domus said:
different performance metrics for the two

NVIDIA is directly comparing the V100 FP64 output to the A100 FP64 2.5X output, which means they are directly comparable.

Vya Domus said:
Fixed it. Tensor cores do tensor operations. That's why they are called tensor cores.

Adding imaginary stuff out of your ass isn't fixing anything, it just proves how fragile and flawed your logic is, that you resorted to adding stuff that isn't there to convince yourself you are still right! I pity you.

Dante Uchiha · May 15, 2020

RH92 said:
Dude are you seriously trying to back up your argument with some random Reddit post ( which is not even close to be accurate to begin with ) ??? Come on now i though you were serious !

For starter stop repeating the same misinformation that has been debunked , the SM diagram is just a general representation of the architecture and represents in no way the physical size of individual segments , this is public knowledge !

Furthermore this means you didn't even read my post before hitting the reply button . If Tensor Cores were taking so much space how do you explain that GA 100 die size has increased compared to GV 100 despite Tensor Core count having significantly decreased at the same time ???

That post on reddit is not far from reality.

You're comparing orange to apples. You can't use different architectures as basis... Do I really need to explain the density difference between Volta at 12nm (24mT/mm²) vs Ampere at 7nm (65mT/mm²) ?

Vya Domus · May 15, 2020

MuhammedAbdo said:
NVIDIA is directly comparing the V100 FP64 output to the A100 FP64 2.5X output, which means they are directly comparable.

Nope, they are comparing the FP64 throughput separately from the FP64 tensor throughput because they are different things.

MuhammedAbdo said:
Adding imaginary stuff out of your ass isn't fixing anything

Gaslighting, classic. You've made shit up such as tensor cores running scalar code and claiming that Nvidia said so when they didn't. You claim all metrics are actually the same thing and the people at Nvidia are just a bunch of idiots wasting their time writing irrelevant shit. You live in a parallel world buddy, I think your condition is called cognitive dissonance. For your own mental health, go check a doctor.

I pity you that you pity me

.

But you didn't answer, what are you still doing here ? Let the case rest buddy, you said it's settled. Feeling insecure about the nonsense that you wrote ?

dyonoctis · May 15, 2020

To be fair, nvidia seems to have made a few typo in one of their slides where they forgot to add "TC" next to to V100, wich make it looks like they are pitting FP32/64 TC against classic FP32/64.

MuhammedAbdo · May 15, 2020

Vya Domus said:
You've made shit up such as tensor cores running scalar code and claiming that Nvidia said so when they didn't.

How many phrases should I quote from the whitepaper?

Tenor cores are now compliant with accelerating IEEE-compliant FP64 computations
Each FP64 matrix multiply add op now replaces 8 FMA FP64 operation
Meaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the throughput of V100

Vya Domus said:
But you didn't answer, what are you still doing here ? Let the case rest buddy, you said it's settled. Feeling insecure about the nonsense that you wrote ?

I am here simply to educate.

dyonoctis said:
To be fair, nvidia seems to have made a few typo in one of their slides where they forgot to add "TC" next to to V100, wich make it looks like they are pitting FP32/64 TC against classic FP32/64.

There is no typo, they did the same on this official slide:

Vya Domus · May 15, 2020

MuhammedAbdo said:
How many phrases should I quote from the whitepaper?

It's not a whitepaper, stop saying this. You're not citing some sort of scientific paper buddy, it's a damn blog post on their website. And your "whitepaper", by the way, agrees with me not you.

MuhammedAbdo said:
Tenor cores are now compliant with accelerating IEEE-compliant FP64 computations

It doesn't mean anything if they are compliant they are distinct units different from the normal FP64 units as the SM diagram clearly shows because they do different computations.

MuhammedAbdo said:
Meaning each SM is now capable of 128 FP64 op per clock which achieves 2.5X the tensor FP64 throughput of V100

Ｆｉｘｅｄ　ｉｔ．

Don't worry, just as you can spam the same incorrect statements a million times I can also correct you every time.

MuhammedAbdo said:
I am here simply to educate.

I did not need or request your worthless education. I mean for one thing you are absolutely clueless, who do you think you are, a scholar ? On some random tech forum wasting your time spamming the same shit over and over ?

Wake up to the real world buddy, you ain't educating anyone. :roll:

MuhammedAbdo · May 15, 2020

Vya Domus said:
It doesn't mean anything if they are compliant they are distinct units different from the normal FP64 units as the SM diagram clearly shows because they do different computations.

Figuring out the double precision floating point performance boost moving from Volta to Ampere is easy enough. Paresh Kharya, director of product management for datacenter and cloud platforms, said in a prebriefing ahead of the keynote address by Nvidia co-founder and chief executive officer Jensen Huang announcing Ampere that peak FP64 performance for Ampere was 19.5 teraflops (using Tensor Cores), 2.5X larger than for Volta. So you might be thinking that the FP64 unit counts scaled with the increase of the transistor density, more or less. But actually, the performance of the raw FP64 units in the Ampere GPU only hits 9.7 teraflops, half the amount running through the Tensor Cores (which did not support 64-bit processing in Volta.)

Nvidia Unifies AI Compute With “Ampere” GPU

The in-person GPU Technical Conference held annually in San Jose may have been canceled in March thanks to the coronavirus pandemic, but behind the scenes

www.nextplatform.com

Sucks to be you I guess.

As already mentioned, the A100 is 2.5x more efficient in accelerating FP64 workloads compared to the V100. This was achieved by replacing the traditional DFMA instructions with FP64 based matrix multiply-add. This reduces the scheduling overhead and shared memory bandwidth requirement by cutting down on instruction fetches.

NVIDIA Ampere Architectural Analysis: A Look at the A100 Tensor Core GPU | Hardware Times

NVIDIA yesterday launched the first chip based on the 7nm Ampere architecture. While not exactly a GPU, it still features the same basic design that will later be used in the consumer Ampere cards. The Tesla A100 or as NVIDIA calls it, “The A100 Tensor Core GPU” is an accelerator that speeds up...

www.hardwaretimes.com

With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores.

Double-Precision Tensor Cores Speed High-Performance Computing — Agenparl

(AGENPARL) – SANTA CLARA (CALIFORNIA), gio 14 maggio 2020 What you can see, you can understand. Simulations help us understand the mysteries of black holes and see how a protein spike on the coronavirus causes COVID-19. They also let designers create everything from sleek cars to jet engines...

agenparl.eu

Ouch, suck for you again!

Vya Domus · May 15, 2020

MuhammedAbdo said:
Paresh Kharya, director of product management for datacenter and cloud platforms, said in a prebriefing ahead of the keynote address by Nvidia co-founder and chief executive officer Jensen Huang announcing Ampere that peak FP64 performance for Ampere was 19.5 teraflops (using Tensor Cores), 2.5X larger than for Volta. So you might be thinking that the FP64 unit counts scaled with the increase of the transistor density, more or less. But actually, the performance of the raw FP64 units in the Ampere GPU only hits 9.7 teraflops, half the amount running through the Tensor Cores (which did not support 64-bit processing in Volta.)

Paresh Kharya put in exactly how it is, that's tensor performance from tensor cores, what he says agrees with me, that sucks for you I guess ! Be it TF32, tensor FP32/FP64, it's without question tensor performance not scalar, these workloads aren't interchangeable. No one thinks FP64 should scale specifically with transistors maybe apart from you, it almost never does, what it does scale well with usually is shader count.

MuhammedAbdo said:
Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

Read that again, slowly and carefully, "a new mode that accelerates double-precision matrix multiply-accumulate operations". That's that A * B + C thing I mentioned back a while ago, that's all that these tensor cores do, they can't execute scalar code, FP64 units can do a lot more. So, I am right again.

Imagine this, everything you post agrees with me not with you. Sucks to be you I guess !

By the way, can we like schedule these. Like for instance, let's post one comment every half an hour or something ?

Breit · May 15, 2020

Is there anything we can do to help you guys find an end to your discussion? It starts to get boring. Just saying...

Jinxed · May 16, 2020

Fixed function hardware will always be more space efficient than a general compute unit. A general compute unit has to perform many types of operations, lots of different instructions and all those need some transistor allocations in the design. A fixed function unit like a Tensor Core only performs a very limited set of operations or a single one. In case of Tensor Cores that operation is called FMA (https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add). Fixed function units therefore need only a fraction of the transistor allocations in the design compared to a general compute unit, because they only ever need to perform a fraction of the functionality. In effect, to achive the same performace for a specific operation, the space of a fixed function unit could be much smaller on the chip. You could also form the fixed function units into larger groups, sharing some common resources. In that case, you could have a group of fixed function units as big as a general compute unit, providing many times more performance compared to the general compute unit, but only for that small set of operations (like FMA for Tensor Cores). It's essentially an optimization. Sacrificing a more universal approach in favor of performance. You could also form very large groups of these fixed function units, provided it makes sense in terms of sharing common resources like cache and work schedulers. Those may be much larger that general compute units, but would also provide even more significant performance (an optimization of an optimization).

And in fact the GA100 is extemely efficient in what it was designed for - AI training/inference. GV100 only supports accelerated tensor operations for the FP16 format, so that is the best base comparison, comparing tensor operations on GV100 with tensor operations on the GA100. All the other types of operations on a GV100, like FP32, INT8, etc. fall back to general compute units (they are not accelerated by Tensor Cores). FP16 performance of a GV100 is 125 TOPS. For a GA100, that is 310 TOPS baseline (2.5x better), 625 TOPS (5x better) with the sparse feature on (but that is a logical optimization, not raw performance). So we have 125 TOPS for GV100 at 250W and 310TOPS for GA100 at 400W. With basic math skills you can easily see that Ampere's energy efficiency (performance per watt) is actually 55% better compared to GV100. That's raw FMA performance. With the optimization features on, it can actually reach up to 200% better energy efficiency.

I can understand why AMD fanboys like Vya Domus are bitter. AMD's "AI accelerators" offer only a tiny fraction of performance compared to Ampere. They are not really AI accelerators - you can get orders of magnitude better performance from hardware from Google, Nvidia and other companies. Can you run AI training/inferencing on the AMD cards? Yes, but you can do that on any x86 CPU as well. Would doing so on an AMD card make any sense? No, just like it doesn't make sense on a CPU anymore. Fixed function hardware like Nvidia's Tensor Cores on the GA100 or Google's Tensor Processing Unit are way better for this task.

Also Vya Domus, unfortunately for you MuhammedAbdo is generally correct. Nvidia compares the FP32 tensor performance on Ampere with FP32 non-tensor performance on Volta, simply because Volta does not support FP32 tensor operations (falls back to general compute units) and Ampere does. The only issue with Muhammed's statement is that he also reverted the implication backwards, which is of course incorrect. Tensor Cores cannot perform the full set of FP32 operations that a general compute unit can. However the rest of your statements, Vya Domus, are incorrect and show that you have very little understanding of the technology.

Vya Domus · May 16, 2020

Jinxed said:
Tensor Cores cannot perform the full set of FP32 operations that a general compute unit can.

That was the only point I've ever made, I couldn't care less about Nvidia's comparison. They've compered performance only in the context of tensor ops and not general compute, anyone with an ounce of intelligence understood that, apart from your mate muhamed whatever who is still convinced this GPU runs normal generic CUDA code on Tensor cores. Of course being a colossal fanboy yourself you couldn't help but automatically assumed I was trying to offend in some way your beloved brand. Nope, I was just explaining thoroughly why our friend doesn't know on what planet he is.

Jinxed said:
The only issue with Muhammed's statement is that he also reverted the implication backwards

Which shows he was just regurgitating copy pasted information with no basic understanding of how these things work, otherwise he would have caught onto the fact that what he was claiming is physically impossible.

Jinxed said:
However the rest of your statements, Vya Domus, are incorrect and show that you have very little understanding of the technology.

Funny how I can understand on a low level why these units can perform a limited set of instructions which makes them incapable of running normal CUDA code but ultimately I have very little understanding of how these theologies work

. Kinda bizarre isn't it ? Don't worry we can embark on an epic comment chain like above and we can see how little my understanding is. Don't get your hopes up though, I wrote enough CUDA and OpenCL to know my way around.

Also, nice new account with posts only about calling people AMD fanboys bro. Welcome to TPU

.

Jinxed · May 17, 2020

And that is exactly why I post. Only when there's a bitter AMD fanboy that doesn't have a clue what he's saying.

As for your "that was my only statement", let me recap:

Sad reacts only, all those "RTX 3060 as fast as a 2080ti" seem out of this world right know.

Here you're missing the fact that GA100 is focused entirely on AI (it even has it in the full name - Nvidia A100 Tensor Core GPU) and it's performance has nothing to do with how games are going to perform on other Ampere chips.

But this one has an entire GPC disabled due to horrendous yields, I presume, and probably because it would throw even that eye watering 400W TDP out the window.

Here you fail to understand that the large chips are actually designed with fabrication errors in mind from the start and where you miss that the 400W TDP still translates to 55% increase in energy efficiency compared to Volta.

Comparing SM counts and power is a totally legit way of inferring efficiency, how else would you do it?

By actually measuring the performance and the dividing that by power consumption. As has been done here on TPU for years. But you're the expert. You tell the big boss here that all his reviews were wrong and that he should've inferred efficiency from SM counts and all those perf/power measurements were useless. Go ahead.

In other words if let's say we have a GPU with N/2 shaders at 2 Ghz it will generally consume more power than a GPU with N shaders at 1 Ghz.

Looking at Pascal vs Polaris/Vega/GCN in general - Pascal with much smaller chips and much higher frequencies did have lower power consumption.

Vega 64 balanced with standard BIOS @ 1274 MHz - 292W, 1080Ti standard @ 1481 MHz 231W

AMD Radeon RX Vega 64 8 GB Review

Our AMD Radeon RX Vega 64 review confirms that the company achieved major performance improvements over their last-generation Polaris and Fiji cards: Vega is faster than the GTX 1080. We tested six different performance configurations of the Vega 64, with surprising results.

www.techpowerup.com

While at the same time 1080ti has +30% to +40% more performance, 1080ti is a 471 mm² chip at 16nm, while Vega 64 is a 486 mm² chip at 14nm. AMD = larger, slower, power hungry. And it's been the same story throughout the Polaris/Vega/Maxwell/Pascal/Turing generations. So I'm curious - what kind of data are you basing your statement on?

GA100 has 20% more shaders compared to V100 but also consumes 60% more power. It doesn't take much to see that efficiency isn't that great. It's not that hard to infer these things, don't overestimate their complexity.

Here you are comparing general compute unit performance, ignoring the fact the GA100 chip design invested heavily into fixed function units (tensor cores) and actually achieves 2.5x raw performance increase, with +55% energy efficiency compared to Volta.

Those "FP64" units you see in the SM diagram don't do tensor operation, they just do scalar ops. Different units, for different workloads.

This is perhaps the most startling showcase of how you have no clue what you're talking about. Of course it is possible to do tensor operations on general compute units (SMs). And in fact that is what Volta was doing for anything else besides FP16 tensor ops and it is what even AMD GPUs are doing. Radeons do not have tensor cores, yet it's no problem to run let's say Google's TensorFlow on that hardware. Why? 3D graphics is actually mostly about matrix and vector multiplications, dot products etc., so general compute units are quite good at it - much better than CPUs, not as good as fixed function units like Tensor Cores.

I wrote enough CUDA and OpenCL to know my way around.

It's very obvious to anyone by now that you have not. You are lacking the very essentials required to do that.

To quote your own evaluation of the other guy, "I am convinced you can't be educated, you are missing both the will and the capacity to understand this." It fits you better that it fits him.

Vya Domus · May 17, 2020

Jinxed said:
Here you're missing the fact that GA100 is focused entirely on AI (it even has it in the full name - Nvidia A100 Tensor Core GPU) and it's performance has nothing to do with how games are going to perform on other Ampere chips.

Ampere will be used in consumer gaming products : https://www.techpowerup.com/267090/nvidia-ampere-designed-for-both-hpc-and-geforce-quadro

This means it's totally reasonable to look at this chip and infer future performance in a consumer GPU. The number of SMs , clock speeds, power envelope will vary but the architecture wont. Of course if you don't know much it's going to seem like you can't extrapolate performance, that's not surprising.

Jinxed said:
Here you fail to understand that the large chips are actually designed with fabrication errors in mind from the start and where you miss that the 400W TDP still translates to 55% increase in energy efficiency compared to Volta.

18% of the shaders are disabled, that's a huge amount, that's not meant to improve redundancy. You add one, maybe two SMs for that not 20 almost a fifth of the total SMs, they made a chip too large to be viable fully enabled on this current node. Don't be a bitter fanboy and look at things objectively and pragmatically.

V100 which was almost as large was fully enabled from day one, guess that one never had any fabrication errors right ? Nah, more like your explanation is just wrong.

Jinxed said:
So I'm curious - what kind of data are you basing your statement on?

Just raw FP32 performance. You could factor in FP16/FP64 performance and then the Pascal equivalents would look orders of magnitude less efficient. But none of that matters because I was speaking purely from the perspective of how ICs behave, power increases linearly with frequency but voltage is squared. Therefor as a general rule a chip twice as large but running at half the frequency (and therefor it would require lower voltage) would be more efficient simply by matters of physics, maybe this was too complicated for you too understand. Don't push your self too hard.

Jinxed said:
AMD = larger, slower, power hungry.

OK fanboy. That's what this is all about, isn't it ? You're just a bitter Nvidia fanboy that has nothing better to do, you don't want to discuss anything, you just want to bash a brand. That's sad and pathetic.

Jinxed said:
Here you are comparing general compute unit performance, ignoring the fact the GA100 chip design invested heavily into fixed function units (tensor cores) and actually achieves 2.5x raw performance increase, with +55% energy efficiency compared to Volta.

It achieves 2.5X more performance and 55% better efficiency in some workloads, not all. You're just starting to regurgitate the same stuff over and over, a lot like your friend. Well, I am fairly convinced this is just an alt account. Hi there buddy.

Jinxed said:
This is perhaps the most startling showcase of how you have no clue what you're talking about. Of course it is possible to do tensor operations on general compute units (SMs). And in fact that is what Volta was doing for anything else besides FP16 tensor ops and it is what even AMD GPUs are doing. Radeons do not have tensor cores, yet it's no problem to run let's say Google's TensorFlow on that hardware. Why? 3D graphics is actually mostly about matrix and vector multiplications, dot products etc., so general compute units are quite good at it - much better than CPUs, not as good as fixed function units like Tensor Cores.

What's startling is that even though you're trying to scour through my old comments like some creepy detective wanna be, I made myself very clear that those units are general purpose and can execute any sort of code. It's obvious I was referring to native tensor ops using hardware, but you are so caught up in your bitter fanboy rampage you are desperately trying to find anything to quote me on. Sad, really fucking sad.

They are peak performance metrics for two separate things. You can't use Tensor core to run scalar code on them, it simply doesn't work like that. The PF64 units can do branching, masking, execute complex mathematical functions, bit wise instructions, etc. Tensor cores can't do any of those things, they just do one single bloody computation : A * B + D. Unless you show me where this is explicitly mentioned and explained you are straight up delusional and making shit up. You don't have the slightest clue how these things even work, otherwise it would be painfully obvious to you how dumb what you're saying is.

Jinxed said:
It's very obvious to anyone by now that you have not.

Try me, or you're too scared of showing us how little you know ? Don't be, you've already shown that, might as well go all in.

Jinxed said:
It fits you better that it fits him.

"him", riiiiight

MuhammedAbdo · May 17, 2020

Jinxed said:
This is perhaps the most startling showcase of how you have no clue what you're talking about. Of course it is possible to do tensor operations on general compute units (SMs). And in fact that is what Volta was doing for anything else besides FP16 tensor ops and it is what even AMD GPUs are doing. Radeons do not have tensor cores, yet it's no problem to run let's say Google's TensorFlow on that hardware. Why? 3D graphics is actually mostly about matrix and vector multiplications, dot products etc., so general compute units are quite good at it - much better than CPUs, not as good as fixed function units like Tensor Cores.

Ouch, that's gotta hurt.

Vya Domus said:
It achieves 2.5X more performance and 55% better efficiency in some workloads, not all. You're just starting to regurgitate the same stuff over and over, a lot like your friend. Well, I am fairly convinced this is just an alt account. Hi there buddy.

Ooh, butt hurt much?!

Fact is you really have no clue do you? Only a rabid AMD fanboy would focus on traditional FP32 for an AI chip, especially when the new TF32 format is 20 times higher than previous gen.
And only a rabid AMD fanboy would lack the imagination that NVIDIA will cut tensor core count to a 1/4 (as they are now miles faster than before), cut the HPC stuff out, remove NVLink, clock the chip higher and achieve a solid gaming GPU with at least 50% power efficiency than previous gen and the competition (it's already higher than 50% effeciency, 54 bilion transistor running 400w, compared to 10 billion in 5700XT running 225w)!

Vya Domus · May 17, 2020

MuhammedAbdo said:
Only a rabid AMD fanboy would focus on traditional FP32 for an AI chip, especially when the new TF32 format is 20 times higher than previous gen.
And only a rabid AMD fanboy would lack the imagination that NVIDIA will cut tensor core count to a 1/4 (as they are now miles faster than before), cut the HPC stuff out, remove NVLink, clock the chip higher and achieve a solid gaming GPU with at least 50% power efficiency than previous gen and the competition (it's already higher than 50% effeciency, 54 bilion transistor running 400w, compared to 10 billion in 5700XT running 225w)!

You getting heated up buddy ? Chill out, drink some of that sweet, sweet Nvidia kool-aid and post for the millionth time the same braindead shit. Oh, and don't forget to make another alt while you're at it just so you can post the same crap all over again. It's like you're stuck in a hellish loop where you're forced to post the same "Nvidia X% better" over and over, that has to take it's tool on your sanity even if you're an avid Nvidia fanboy such as yourself.

Is someone making you do this at gun point ? Should we inform the authorities ? Write us an SOS message or something.

Fiendish · May 18, 2020

Vya Domus said:
V100 which was almost as large was fully enabled from day one, guess that one never had any fabrication errors right ? Nah, more like your explanation is just wrong.

According to the Volta whitepaper and other documentation, the full GV100 GPU had 84 SMs and no product was ever released with more than 80 SMs enabled, which means they were NEVER able to achieve a fully enabled implementation.

Bytales · May 18, 2020

theoneandonlymrk said:
From others this is the A100 not GA100

The GA100 is the full fat 8192 GPU.

I want the full FAT 8192 GPU, with 6x16 HBM memory chips, 600watts bumped clocks, triple 8 power connector standard, all for 81920 cent-dollars. 10 cents per "whatever the heck its called taht it has 8192 of them" seems fair to me.

TheoneandonlyMrK · May 18, 2020

Bytales said:
I want the full FAT 8192 GPU, with 6x16 HBM memory chips, 600watts bumped clocks, triple 8 power connector standard, all for 800 dollars.

Well in five or ten years, you might get one on Craigslist for $800, ain't no one getting it sooner for that price bro.

Jinxed · May 18, 2020

Vya Domus said:
Ampere will be used in consumer gaming products : https://www.techpowerup.com/267090/nvidia-ampere-designed-for-both-hpc-and-geforce-quadro

This means it's totally reasonable to look at this chip and infer future performance in a consumer GPU.

What you fail to mention is that Huang specifically stated during the pre-GTC call that the gaming Ampere GPUs will have a different configuration. The ratios of on-chip resources will be very different. We can expect similar changes here as with Volta -> Turing. FP64 will be removed, Tensor Core count reduced, RT cores added, SM units will be optimized for FP16/32 and INT ops. There is absolutely nothing reasonable about infering gaming performance of Ampere from GA100 which is an AI-focused design.

Vya Domus said:
18% of the shaders are disabled, that's a huge amount, that's not meant to improve redundancy.

The exact value is 15.5%, but thanks for showing even more how desperate you are to twist the truth. Chip designers were designing chips with yields in mind for many years. Why would that change now? The whole idea of being able to disable parts of the chip is about this and it's been with us for a very long time.

Vya Domus said:
Just raw FP32 performance. You could factor in FP16/FP64 performance and then the Pascal equivalents would look orders of magnitude less efficient.

First, there's nothing wrong with FP16 performance, second, why would I care for FP64 on a gaming GPU? No. AMD has for a long been making claims about their so called "raw performance", which was of course never delivered. It's about the frames per second a GPU can provide versus the energy it comsumes doing so. You still haven't explained why here on TPU and everywhere else the metric is perf (in frames per second)/watt, while you're suggesting otherwise. Tell us.

The key is efficiency here. Efficiency is about how well the GPU scheduler can deliver work to the existing resources of a GPU, in other words how well it can keep the resources busy. And it's not a choice on AMD's side to over-provision the compute cores (your very theoretical "raw performance"). Their scheduler architectures are inferior to Nvidia's, so they must provide more compute units in order to compete, since they cannot keep them all busy and therefore lack efficiency. They also do the same with clock frequencies. Polaris/Vega were designed for much lower optimal frequencies (perf/power), but because of the leaps in performance Maxwell/Pascal made, AMD had to set the core clocks on their Polaris/Vega/RDNA architectures way past the optimal point. Again the reason is to remain at least a little bit competitive. That's the main reason for the terrible power efficiency AMD has. And even with RDNA it's still there. It's just hidden by the 7nm node improvements. Let's not forget that 7nm RDNA GPUs are barely catching up to 12nm Turing chips. Just the process difference alone is almost 4 times the MTMM. Efficiency is simply not there, even with RDNA. We'll see that with Ampere gaming GPUs. Then we'll have a reasonable comparison (almost, since the RDNA GPUs lack raytracing and many other features).

Vya Domus said:
OK fanboy. That's what this is all about, isn't it ? You're just a bitter Nvidia fanboy that has nothing better to do, you don't want to discuss anything, you just want to bash a brand. That's sad and pathetic.

Looks like my original remark hurt you more than I expected. Good that you are trying to repeat it. Imitation is the highest form of flattery.

Vya Domus said:
It achieves 2.5X more performance and 55% better efficiency in some workloads, not all.

It achieves 2.5x more performance and 55% better efficiency in workloads for which it was designed. What you're trying to do, and trust me that everyone here does see your funny attempt, is evaluating a car based on how well it can fly, then trying to say it's not a good car, because it doesn't fly well. (although amusingly in this case even at flying the "car" would still perform much better than competition's best attempt, e.g. the GA100 classic compute is still much better than that of AMD compute cards)

Vya Domus said:
What's startling is that even though you're trying to scour through my old comments like some creepy detective wanna be, I made myself very clear that those units are general purpose and can execute any sort of code.

Hey, don't be butthurt about saying you never made any other statements and then being slapped in the face with said "non-existent" statements with gusto. And your statement regading the alleged inability of general compute units to execute tensor ops is quite clear. Let me repost it here:

Those "FP64" units you see in the SM diagram don't do tensor operation, they just do scalar ops. Different units, for different workloads.

How does "they just do scalar ops" mean "can execute any sort of code"? Hilarious.

Vya Domus said:
Try me, or you're too scared of showing us how little you know ? Don't be, you've already shown that, might as well go all in.

Please do.

Vya Domus said:
V100 which was almost as large was fully enabled from day one, guess that one never had any fabrication errors right ? Nah, more like your explanation is just wrong.

Fiendish said:
According to the Volta whitepaper and other documentation, the full GV100 GPU had 84 SMs and no product was ever released with more than 80 SMs enabled, which means they were NEVER able to achieve a fully enabled implementation.

I intentionally ommited that in my original reply, because you sir rule and deserve a quote!

Gmr_Chick · May 19, 2020

OK, it's been...interesting seeing @Vya Domus, @MuhammedAbdo, and @Jinxed engage in a fruitless battle of wits, but seriously now gentlemen. Take your battle to a PM conversation. You can argue to your heart's content there.

EarthDog · May 19, 2020

Where is the staff? I reported this days ago... insults slung L and R... lol...

It's good info, but the barbs just sully the conversation.

Jinxed · May 19, 2020

Gmr_Chick said:
OK, it's been...interesting seeing @Vya Domus, @MuhammedAbdo, and @Jinxed engage in a fruitless battle of wits, but seriously now gentlemen. Take your battle to a PM conversation. You can argue to your heart's content there.

I'm hardly engaging in their battle. 3 posts, compared to their multiple-page rant, is nothing. And you're right, it seems fruitless at this point.

tajoh111 · May 21, 2020

Vya Domus said:
But this one has an entire GPC disabled due to horrendous yields, I presume, and probably because it would throw even that eye watering 400W TDP out the window. There has to be one fully enabled chip right ? One would assume there would be different 100s.

To be honest this is borderline Thermi 2.0, a great compute architecture that can barley be implemented in actual silicon due to power and yields. These aren't exactly Nvidia's brightest hours in terms of chip design, it seems like they bit more than what they could chew, the chip was probably cut down in a last minute decision.

Suffice to say I doubt we'll see the full 8192 shaders in any GPU this generation, I doubt they could realistically fit that in a 250W power envelope and it seems like GA100 runs at 1.4 Ghz, no change from Volta nor from Turing probably. Let's see 35% more shaders than Volta but 60% more power and same clocks. It's not shaping up to be the "50% more efficient and 50% faster per SM" some hoped for.

I hope you made the same comments about radeon vii.

In shipping form, radeon VII and Radeon mi50 do not come fully enabled(6% disabled), only increase fp32 while moving to 7nm by 9% while using the same power vs Vega 64(a inefficient chip to begin with). In addition, Vega 20 does not represent remotely as ambitious a leap as Nvidia as A100 as it less than 1 half the size and about a quarter of the amount of transistors.

One more reason why Nvidia has to disable quite a bit of the chip is pure volume.

Do you know how much Nvidia Data center + professional visualization revenue is? Last Quarter it was 1.3 billion dollars. That is near the revenue of AMD's CPU and graphic division which produced 1.43 billion dollars for q1. Nvidia financials tomorrow will likely produce a figure that is equal to this value.

Considering this, Nvidia must deliver enormous volume which means that yields have to take a hit to deliver to their customers. As a result, Nvidia yields for these chips have to suffer to deliver the volume wanted by their customers and this will continue be a problem in the future.

Analysts are predicting Nvidia's data center revenue to grow from 5.5 billion annually to 20 billion which is the revised prediction after A100 was released. Nvidia's market capitalization is close to Intel's right now. The reason being that the data center market is growing at an explosive speed and products from Intel and AMD are not perceived a threat in the near future(analysts already know what next gen AMD and Intel products look like as they are well connected). From the tone of your post, it seems you perceive A100 as a failure but your own bias is blurring your vision to what should be obvious. Look at the reaction from the markets(NVDA stock has grown 15% since A100 release and analysts have revised Nvidia stock value target from $275 to $420), the tangible benefits of A100 to the data center market and you will realize how blind you were to the success of this product. Scoffing off a100 prowess is just showing your own ignorance.

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	AMD Ryzen 3700x
Motherboard	asus ROG Strix B-350I Gaming
Cooling	Deepcool LS520 SE
Memory	crucial ballistix 32Gb DDR4
Video Card(s)	RTX 3070 FE
Storage	WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s)	LG GL850
Case	Dan A4 H2O
Audio Device(s)	sennheiser HD58X
Power Supply	Corsair SF600
Mouse	MX master 3
Keyboard	Master Key Mx
Software	win 11 pro

System Name	Avell old monster - Workstation T1 - HTPC
Processor	i7-3630QM\i7-5960x\Ryzen 3 2200G
Cooling	Stock.
Memory	2x4Gb @ 1600Mhz
Video Card(s)	HD 7970M \ EVGA GTX 980\ Vega 8
Storage	SSD Sandisk Ultra li - 480 GB + 1 TB 5400 RPM WD - 960gb SDD + 2TB HDD

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	AMD Ryzen 3700x
Motherboard	asus ROG Strix B-350I Gaming
Cooling	Deepcool LS520 SE
Memory	crucial ballistix 32Gb DDR4
Video Card(s)	RTX 3070 FE
Storage	WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s)	LG GL850
Case	Dan A4 H2O
Audio Device(s)	sennheiser HD58X
Power Supply	Corsair SF600
Mouse	MX master 3
Keyboard	Master Key Mx
Software	win 11 pro

System Name	Blackbird
Processor	AMD Threadripper 3960X 24-core
Motherboard	Gigabyte TRX40 Aorus Master
Cooling	Full custom-loop water cooling, mostly Aqua Computer and EKWB stuff!
Memory	4x 16GB G.Skill Trident-Z RGB @3733-CL14
Video Card(s)	Nvidia RTX 3090 FE
Storage	Samsung 950PRO 512GB, Crusial P5 2TB, Samsung 850PRO 1TB
Display(s)	LG 38GN950-B 38" IPS TFT, Dell U3011 30" IPS TFT
Case	CaseLabs TH10A
Audio Device(s)	Edifier S1000DB
Power Supply	ASUS ROG Thor 1200W (SeaSonic)
Mouse	Logitech MX Master
Keyboard	SteelSeries Apex M800
Software	MS Windows 10 Pro for Workstation
Benchmark Scores	A lot.

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	The Captain (2.0)
Processor	Ryzen 7 7700X
Motherboard	Asus ROG Strix X670E-A
Cooling	280mm Arctic Liquid Freezer II, 4x Be Quiet! 140mm Silent Wings 4 (1x exhaust 3x intake)
Memory	32GB (2x16) Kingston Fury Beast CL30 6000MT/s
Video Card(s)	MSI GeForce RTX 3070 SUPRIM X
Storage	1x Crucial MX500 500GB SSD; 1x Crucial MX500 500GB M.2 SSD; 1x WD Blue HDD, 1x Crucial P5 Plus
Display(s)	Asus ROG Swift PG32UCDM (main); Asus ROG Swift PG27AQDM (secondary)
Case	Phanteks Evolv X (Anthracite Gray)
Power Supply	Corsair RMx (2021) 1000W 80-Plus Gold
Mouse	Varies based on mood/task; is currently Razer Basilisk V3 Pro or Razer Cobra Pro
Keyboard	Varies based on mood; currently Razer Deathstalker V2 Pro TKL