NVIDIA Ampere A100 Has 54 Billion Transistors, World's Largest 7nm Chip

btarunr · May 14, 2020

Not long ago, Intel's Raja Koduri claimed that the Xe HP "Ponte Vecchio" silicon was the "big daddy" of Xe GPUs, and the "largest chip co-developed in India," larger than the 35 billion-transistor Xilinix VU19P FPGA co-developed in the country. It turns out that NVIDIA is in the mood for setting records. The "Ampere" A100 silicon has 54 billion transistors crammed into a single 7 nm die (not counting transistor counts of the HBM2E memory stacks).

NVIDIA claims a 20 Times boost in both AI inference and single-precision (FP32) performance over its "Volta" based predecessor, the Tesla V100. The chip also offers a 2.5X gain in FP64 performance over "Volta." NVIDIA has also invented a new number format for AI compute, called TF32 (tensor float 32). TF32 uses 10-bit mantissa of FP16, and the 8-bit exponent of FP32, resulting in a new, efficient format. NVIDIA attributes its 20x performance gains over "Volta" to this. The 3rd generation tensor core introduced with Ampere supports FP64 natively. Another key design focus for NVIDIA is to leverage the "sparsity" phenomenon in neural nets, to reduce their size, and improve performance.

A new HPC-relevant feature being introduced with A100 is multi-instance GPU, which allows multiple complex applications to run on the same GPU without sharing resources such as memory bandwidth. The user can now partition a physical A100 into up to 7 virtual GPUs of varying specs, and ensure that an application running on one of the vGPUs doesn't eat into the resources of the other. As for real-world performance, NVIDIA claims that the A100 beat the V100 by a factor of 7 at BERT.

The DGX-A100 system crams 5 petaflops of compute peformance onto a single "graphics card" (a single node), and starts at $199,000 a piece.

View at TechPowerUp Main Site

Fluffmeister · May 14, 2020

Wowzers, this bad boy is going to get a lot of love in HPC market.

Vya Domus · May 14, 2020

NVIDIA will also invent a new number format for AI compute, called TF32 (tensor float 32). TF32 uses 10-bit mantissa of FP16, and the 8-bit exponent of FP32

Wouldn't that be ... TF19 ? 10 + 8 + 1 (sign) bits

Anymal · May 14, 2020

Age of Amperes!

theGryphon · May 14, 2020

Fluffmeister said:
Wowzers, this bad boy is going to get a lot of love in HPC market.

Had the exact same reaction as I was reading...

londiste · May 14, 2020

54B transistors is three times as much as TU102 (18.6B, Titan RTX, RTX2080Ti). Volta has 21.1B.

renz496 · May 14, 2020

so what it's exact FP32/64 performance?

Vya Domus · May 14, 2020

renz496 said:
so what it's exact FP32/64 performance?

My guess around 30 TFLOPS FP32, this probably has about twice the shaders of V100 but probably no where the clock speed.

agentnathan009 · May 14, 2020

Ampere to Ponte Vecchio, “who’s your daddy now?”

R0H1T · May 14, 2020

How about Zen4 (or 5?) with upto 128 real cores :nutkick:

ironcerealbox · May 14, 2020

Vya Domus said:
Wouldn't that be ... TF19 ? 10 + 8 + 1 (sign) bits

Yar? Yar!

Good catch! I noticed it too.

However, what's really going on is that they are creating their own alternative to BF16 (brainfloat16) and calling it TF32: keeping the 8 bits for exponent from fp32 but using only 10 bits for the fraction (precision) from fp16. This keeps the approximate range of FP32 while keeping the precision of FP16 (half precision). This is different than BF16 since BF16 only keeps 7 bits for precision. So, you can get better (or what some of my students say..."gooder") approximations with TF32 when converting back to FP32 (by padding the last 13 bits of precision on FP32 with zeroes) instead of with BF16 (where you would pad the last 16 bits with zeroes).

Does it really matter that much in the end? I'm not a professional with experience in AIs or DNNs but I suppose that FP32 approximations from TF32 is better and only ever so slightly slower than FP32 approximations from BF16. It is somewhat clever with what they did.

But, back to the whole TF19 bit (no pun intended!): I think it's a marketing move as TF32 "sounds" better. It's really TF19 with 1 bit sign, 8 bit exponent, and 10 bit fraction but, hey, "TF32" FTW.

Vya Domus said:
My guess around 30 TFLOPS FP32, this probably has about twice the shaders of V100 but probably no where the clock speed.

7FF+ is supposedly on par with clock speeds or slightly better than 12FF. If it is indeed 8192 cores, then we'd have around 29.5 TFLOPS to 32.8 TFLOPS (1.8 GHz to 2 GHz, respectively). "TF32" could be as high as 655.4 TFLOPS, or, if one has a cool $200K lying around, you can get that monstrosity that JSH has been baking and get 8x 655.4 TFLOPS = 5.243 PFLOPS of "TF32" performance. I mean...saying PFLOPS like "Pee-Flops" is just ridiculous...

Aaaaaaand...I'm getting off topic.

Aquinus · May 14, 2020

Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

Fluffmeister · May 14, 2020

agentnathan009 said:
Ampere to Ponte Vecchio, “who’s your daddy now?”

Heh, and finally we see a chip worthy of "Poor Volta" too.

ironcerealbox · May 14, 2020

Aquinus said:
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

Their [Nvidia] next architecture is Hopper and it is confirmed to be their first using MCMs. Hell, it might be the only thing confirmed about Hopper.

midnightoil · May 14, 2020

Fluffmeister said:
Wowzers, this bad boy is going to get a lot of love in HPC market.

You're forgetting that the big boys in HPC already know pretty much what AMD / NVIDIA / Intel have in the pipeline for the next 2 gens.

AMD have been scooping most of the biggest contracts lately, and most of the really big contracts in the last 18 months have been aiming at CDNA2 / Hopper, not Ampere or CDNA1.

That 20x figure is pure fantasy land.

Vya Domus · May 14, 2020

midnightoil said:
That 20x figure is pure fantasy land.

I wouldn't necessarily say that. They're changing the metric, they've done it before with "gigarays", it's the good old " We're faster ! " *with an asterisk* .

midnightoil said:
You're forgetting that the big boys in HPC already know pretty much what AMD / NVIDIA / Intel have in the pipeline for the next 2 gens.

That's true and it's telling, if they still decided to go with AMD it shows that maybe they're weren't as impressed with what Nvidia was about to have.

ironcerealbox said:
7FF+ is supposedly on par with clock speeds or slightly better than 12FF.

The problem isn't that it couldn't clock, it's power. V100 runs at 1400Mhz but this has more than twice the transistors, if they want to maintain the 250W power envelope it's just not possible to have it run at the same clock speed.

Aquinus said:
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

I find this exceedingly strange too. These 800 mm^2 dies are going to be insanely expensive and so will be the actual products, I think businesses are willing to spend up to a point. With more and more dedicated silicon out there for inferencing/training I don't think Nvidia is in a position to ask even more money for something that people can get else were for a fraction of the cost. They are optimizing their chips for too many things, some time ago that used to be an advantage but it's slowly becoming an Achilles heel.

xkm1948 · May 14, 2020

Fluffmeister said:
Wowzers, this bad boy is going to get a lot of love in HPC market.

Exactly.

Also i find it really funny that so many "home grown HPC experts" suddenly show up here claiming CDNA2 or whatever made-up crap.

HPC is all about ecosystem where both software and hardware are needed in excellent shape. Pay attention to how much Nvidia CEO acknowledged the software developers. Without a thriving software ecosystem, the hardware by themselves are nothing. In the field of AI, nobody is currently able to compete with Nvidia's software and hardware integration.

Computing hardware is only half (Hell, actually 1/3) of the deal. There is software which is a HUGE part, as well as inter-connecting hardware.

midnightoil · May 14, 2020

The contracts are already published.

Get a clue.

Yosar · May 14, 2020

xkm1948 said:
HPC is all about ecosystem where both software and hardware are needed in excellent shape. Pay attention to how much Nvidia CEO acknowledged the software developers. Without a thriving software ecosystem, the hardware by themselves are nothing. In the field of AI, nobody is currently able to compete with Nvidia's software and hardware integration.

Computing hardware is only half (Hell, actually 1/3) of the deal. There is software which is a HUGE part, as well as inter-connecting hardware.

If you pay millions of dollars for hardware, you also pay for _specialized_ software for this hardware. You surely don't buy hardware for 200 000 dollars (for 1 card) to run 3DS MAX Studio. Your eco-system doesn't matter. This whole hardware is your eco-system. The deal is for eco-system.

R-T-B · May 15, 2020

Aquinus said:
Hah! Yields are probably as good as winning the lottery. Does nVidia really think that going this direction is a good idea? Huge monolithic dies are such a waste of resources because yields are abysmal. I guess that's okay when a business is willing to spend whatever it takes to have a leg up. Personally, I think that until we start seeing MCM solutions to compute scaling, I'm reluctant to believe that any gains are going to be substantial or long lasting in this market.

tl;dr: Big dies are not the answer.

They are when the customer is willing to pay top dollar, like HPC.

Aquinus · May 15, 2020

R-T-B said:
They are when the customer is willing to pay top dollar, like HPC.

I think that depends on what the alternatives cost and how easy or hard it is to build the software to work on any of the HPC solutions a business might be considering. None of this changes the fact though that a massive die like this is going to put a huge premium on the hardware, which means less money for the other things that also matter. What good is great hardware if you have to skimp on the software side of things?

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	ASUS ROG Strix B450-E Gaming
Cooling	DeepCool Gammax L240 V2
Memory	2x 8GB G.Skill Sniper X
Video Card(s)	Palit GeForce RTX 2080 SUPER GameRock
Storage	Western Digital Black NVMe 512GB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

Processor	AMD Ryzen 7 3700X
Motherboard	MSI MAG B550 TOMAHAWK
Cooling	AMD Wraith Prism
Memory	Team Group Dark Pro 8Pack Edition 3600Mhz CL16
Video Card(s)	NVIDIA GeForce RTX 3080 FE
Storage	Kingston A2000 1TB + Seagate HDD workhorse
Display(s)	Samsung 50" QN94A Neo QLED
Case	Antec 1200
Power Supply	Seasonic Focus GX-850
Mouse	Razer Deathadder Chroma
Keyboard	Logitech UltraX
Software	Windows 11

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Alienation from family
Processor	i7 7700k
Motherboard	Hero VIII
Cooling	Macho revB
Memory	16gb Hyperx
Video Card(s)	Asus 1080ti Strix OC
Storage	960evo 500gb
Display(s)	AOC 4k
Case	Define R2 XL
Power Supply	Be f*ing Quiet 600W M Gold
Mouse	NoName
Keyboard	NoNameless HP
Software	You have nothing on me
Benchmark Scores	Personal record 100m sprint: 60m

System Name	3950X Workstation
Processor	AMD Ryzen 9 3950X
Motherboard	ASUS Crosshair VIII Impact
Cooling	Cryorig C1 with Noctua NF-A12x15
Memory	G.Skill F4-3600C16D-32GTZNC
Video Card(s)	ASUS GTX 1650 LP OC
Storage	2 x Corsair MP510 1920GB M.2 SSD
Case	Realan E-i7
Power Supply	G-Unique 400W
Software	Win 10 Pro
Benchmark Scores	https://smallformfactor.net/forum/threads/the-saga-of-the-little-gem-continues.12877/

NVIDIA Ampere A100 Has 54 Billion Transistors, World's Largest 7nm Chip

btarunr

Editor & Senior Moderator

Fluffmeister

Vya Domus

Anymal

theGryphon

londiste

renz496

Vya Domus

agentnathan009

R0H1T

ironcerealbox

Aquinus

Resident Wat-man

Fluffmeister

ironcerealbox

midnightoil

Vya Domus

xkm1948

midnightoil

Yosar

New Member

R-T-B

Aquinus

Resident Wat-man

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

System Name	Mirkwood
Processor	AMD RYZEN 7 3800X
Motherboard	ASUS ROG Crosshair VII Hero (Wi-Fi) AM4 AMD X470
Cooling	Noctua D15S with additional Noctua NF-A12x25 FLX fan
Memory	G.SKILL Flare X Series CL16 3200Mhz 16GB (4 x 8GB)
Video Card(s)	GIGABYTE Radeon RX 570 DirectX 12 GV-RX570GAMING-4GD 4GB
Storage	Crucial MX500 M.2 2280 500GB SATA III; WD Black 1TB Performance Desktop Hard Disk Drive
Display(s)	Philips 246E9QDSB 24" Frameless Monitor, Full HD IPS, 129% sRGB, 75Hz, FreeSync
Case	Corsair Graphite Series 780T
Audio Device(s)	Klipsch R-41PM powered monitors and SVS SB-2000 sub
Power Supply	Corsair HX650
Mouse	Logitech Wireless Performance Mouse MX
Keyboard	Old Logitech keyboard
Software	Windows 10 Pro 64Bit

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

System Name	Virtual Reality / Bioinformatics
Processor	Undead CPU
Motherboard	Undead TUF X99
Cooling	Noctua NH-D15
Memory	GSkill 128GB DDR4-3000
Video Card(s)	EVGA RTX 3090 FTW3 Ultra
Storage	Samsung 960 Pro 1TB + 860 EVO 2TB + WD Black 5TB
Display(s)	32'' 4K Dell
Case	Fractal Design R5
Audio Device(s)	BOSE 2.0
Power Supply	Seasonic 850watt
Mouse	Logitech Master MX
Keyboard	Corsair K70 Cherry MX Blue
VR HMD	HTC Vive + Oculus Quest 2
Software	Windows 10 P

System Name	Pioneer
Processor	Ryzen R9 9950X
Motherboard	GIGABYTE Aorus Elite X670 AX
Cooling	Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory	64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise IoT 2024