Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

nguyen · Aug 21, 2021

dragontamer5788 said:
NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.

Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell

jesdals · Aug 21, 2021

Nice to see that Intels finest wine only need watercooling... but hey any GPU news of new cards is good news

dragontamer5788 · Aug 21, 2021

nguyen said:
obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much

RTX A6000 Deep Learning Benchmarks

PyTorch and TensorFlow training speeds on models like ResNet-50, SSD, and Tacotron 2. Compare performance of the RTX 3090, 3080, A100, V100, and A6000.

lambdalabs.com

The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

I don't know where you're getting the bullshit numbers that the 3090 is faster than an A100, but... its just not true. Under any reasonable benchmark, like Linpack, A100 is something like 10-ish TFlops double-precision and 20-ish TFlops single-precision.

nguyen · Aug 21, 2021

dragontamer5788 said:
The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance

dragontamer5788 · Aug 21, 2021

nguyen said:
Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance

A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.

nguyen · Aug 21, 2021

dragontamer5788 said:
A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.

My overclocked 3090 is getting 40.6 TFLOPS of single precision (FP32) and only 660GFLOPS of FP64

A100 SM is different from GA102 SM as A100 is more focused on Tensor performance
GA102 SM vs A100 SM

TumbleGeorge · Aug 21, 2021

nguyen said:
Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell

Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!

Vya Domus · Aug 21, 2021

45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.

TumbleGeorge said:
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF).

It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.

jabbadap · Aug 21, 2021

TumbleGeorge said:
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!

To be fair, rtx 3090 would get those numbers on this same clpeak benchmark, which Intel uses. So numbers are apples to apples, product audience is just wholly different and thus comparing them is just academic.

iO · Aug 21, 2021

True competition will be Hopper and MI200.
I mean, it's a marvel of engineering and 45 TFlops is a lot, but the A100 was announced in May 2020. Beating a 2 year old card when it's finally going to be released sometime in 2022 sounds somewhat less impressive...

RedelZaVedno · Aug 21, 2021

Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.

dragontamer5788 · Aug 21, 2021

RedelZaVedno said:
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.

Its like the old joke goes: What an engineer can accomplish in 1 month, two engineers can accomplish in 2 months.

The primary constraint of engineering is time. Money allows for a bigger product to be built, but not necessarily a better product. Given the timeline here, I'm sure the Intel GPU won't be the best. Something weird will happen and not work as expected.

What I'm looking for is a "good first step", not necessarily the best, but a product that shows that Intel knows why NVidia and AMD GPUs have done so well in supercomputing circles. Maybe generation 2 or 3 will be actually competitive.

chodaboy19 · Aug 21, 2021

Specs and powerpoint slides are great, but let's see what happens when intel actually ships the silicon...

dragontamer5788 · Aug 21, 2021

nguyen said:
A100 SM is different from GA102 SM as A100 is more focused on Tensor performance

Point. I accept your benchmark, but note that its somewhat impractical. I forgot that Turing cores did the "two FP32 instructions per clock tick" (similar to Pentium's dual-pipeline design way back in the day). That offers a paper-gain of 2x peak FP32 flops, but in practice, most code can't take advantage of it entirely.

Though in practice I still assert that the A100 is superior (again, 80GB of 2TBps HBM2e RAM, 10TFlops of double-precision performance, etc. etc.). Almost any GPU-programmer would rather have twice the SMs / cores rather than double the resources spent per SM.

In any case, A100 is still king of NVidia's lineup. Its a few years old however.

ValenOne · Aug 21, 2021

AMD's Instinct MI200 'Aldebaran' is being shipped to customers. LOL. MI200 more than 50 TFlops FP32.

Vya Domus said:
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.

It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.

Raytracing denoise compute shader runs on CUDA cores, hence RTX 3090's TFLOPS advantage is shown.

Direct Storage decompression on PC is done via Compute Shader (GpGPU) path.

Mesh Shader (similar to Compute Shader) is done on CUDA cores, hence RTX 3090's advantage is shown.

TumbleGeorge said:
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!

3090's TF is real via Compute Shader (GpGPU) path. Pixel Shader path is bottlenecked by raster hardware.

nguyen said:
Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell

Tensor is for pack math INT4, INT8, INT16, and FP16 with FP32 result and it's less flexible than CUDA cores .

Vayra86 · Aug 21, 2021

Vya Domus said:
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.

It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.

Well put

Raja definitely has the biggest one now.

Richards · Aug 22, 2021

nguyen said:
huh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.

It beats the a100

Minus Infinity · Aug 22, 2021

btarunr said:
FP64 is 1:1 FP32. So its FP64 throughput is identical.

It has all the ingredients to be a cloud gaming GPU.

Wow, very impressive. Thanks for the heads up.

Valantar · Aug 22, 2021

While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.

Minus Infinity · Aug 23, 2021

Valantar said:
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.

I think that's the point. Intel just releasing a competitive product is good news for the market. Trying to beat others in every metric in their first go is a recipe for unending delays. Worry about power on the next iteration.

I'm hoping Intel DG2 is 3070 level of performance at 15% less price even if it uses more power, it will sell well if they can build reasonable supply and drivers are stable and updated regularly.

Richards · Aug 23, 2021

Valantar said:
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.

Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking

Valantar · Aug 23, 2021

Richards said:
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking

Lol, at what power draw? I mean, sure, well probably get there eventually, but that's years and years in the future. And everyone is going mcm in the next generation or two.

nnunn · Aug 24, 2021

Re: comparing fp32 on GA100 (TSMC A100) vs GA102 (Samsung 3090/3080 Ti), the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

The 3090 runs its fp32 cores at higher clock speeds, and can use its 32-bit integer path to double up the old (Turing) fp32 rate, hence that 36 TFLOPS. But as dragontamer5788 notes, to feed these cores, you need bandwidth. The A100 gets this from its magical but pricey HBM2e; the 3090 forces things by using GDDR6X on a 384-bit bus.

PS: if you think 3090's cost a lot, try to find the price for A100 !!

Vya Domus · Aug 24, 2021

nnunn said:
the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

"Would/should be using" doesn't make sense, there are many workloads that cannot be accelerated by tensor ops.

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX850
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

System Name	R9 5950x/Skylake 6400
Processor	R9 5950x/i5 6400
Motherboard	Gigabyte Aorus Master X570/Asus Z170 Pro Gaming
Cooling	Arctic Liquid Freezer II 360/Stock
Memory	4x8GB Patriot PVS416G4440 CL14/G.S Ripjaws 32 GB F4-3200C16D-32GV
Video Card(s)	7900XTX/6900XT
Storage	RIP Seagate 530 4TB (died after 7 months), WD SN850 2TB, Aorus 2TB, Corsair MP600 1TB / 960 Evo 1TB
Display(s)	3x LG 27gl850 1440p
Case	Custom builds
Audio Device(s)	-
Power Supply	Silverstone 1000watt modular Gold/1000Watt Antec
Software	Win11pro/win10pro / Win10 Home / win7 / wista 64 bit and XPpro

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX850
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX850
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	R7 5700x
Motherboard	MSI B450i Gaming
Cooling	Accelero Mono CPU Edition
Memory	16 GB VLP
Video Card(s)	AMD RX 6700 XT Accelero Mono
Storage	P34A80 512GB
Display(s)	LG 27UM67 UHD
Case	none
Power Supply	Fractal Ion 650 SFX

System Name	Eula
Processor	AMD Ryzen 9 7900X PBO
Motherboard	ASUS TUF Gaming X670E Plus Wifi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case	Phanteks Eclipse P500A D-RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

Processor	7800X3D
Motherboard	MSI MAG Mortar b650m wifi
Cooling	Thermalright Peerless Assassin
Memory	32GB Corsair Vengeance 30CL6000
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Lian Li A3 mATX White
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	Steelseries Aerox 5
Keyboard	Lenovo Thinkpad Trackpoint II
Software	W11 IoT Enterprise LTSC
Benchmark Scores	Over 9000

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro