• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

Joined
Nov 11, 2016
Messages
3,417 (1.16/day)
System Name The de-ploughminator Mk-III
Processor 9800X3D
Motherboard Gigabyte X870E Aorus Master
Cooling DeepCool AK620
Memory 2x32GB G.SKill 6400MT Cas32
Video Card(s) Asus RTX4090 TUF
Storage 4TB Samsung 990 Pro
Display(s) 48" LG OLED C4
Case Corsair 5000D Air
Audio Device(s) KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply Corsair HX850
Mouse Razor Death Adder v3
Keyboard Razor Huntsman V3 Pro TKL
Software win11
NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.

Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
 
Joined
Jan 4, 2013
Messages
1,183 (0.27/day)
Location
Denmark
System Name R9 5950x/Skylake 6400
Processor R9 5950x/i5 6400
Motherboard Gigabyte Aorus Master X570/Asus Z170 Pro Gaming
Cooling Arctic Liquid Freezer II 360/Stock
Memory 4x8GB Patriot PVS416G4440 CL14/G.S Ripjaws 32 GB F4-3200C16D-32GV
Video Card(s) 7900XTX/6900XT
Storage RIP Seagate 530 4TB (died after 7 months), WD SN850 2TB, Aorus 2TB, Corsair MP600 1TB / 960 Evo 1TB
Display(s) 3x LG 27gl850 1440p
Case Custom builds
Audio Device(s) -
Power Supply Silverstone 1000watt modular Gold/1000Watt Antec
Software Win11pro/win10pro / Win10 Home / win7 / wista 64 bit and XPpro
Nice to see that Intels finest wine only need watercooling... but hey any GPU news of new cards is good news :)
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much


The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

I don't know where you're getting the bullshit numbers that the 3090 is faster than an A100, but... its just not true. Under any reasonable benchmark, like Linpack, A100 is something like 10-ish TFlops double-precision and 20-ish TFlops single-precision.
 
Joined
Nov 11, 2016
Messages
3,417 (1.16/day)
System Name The de-ploughminator Mk-III
Processor 9800X3D
Motherboard Gigabyte X870E Aorus Master
Cooling DeepCool AK620
Memory 2x32GB G.SKill 6400MT Cas32
Video Card(s) Asus RTX4090 TUF
Storage 4TB Samsung 990 Pro
Display(s) 48" LG OLED C4
Case Corsair 5000D Air
Audio Device(s) KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply Corsair HX850
Mouse Razor Death Adder v3
Keyboard Razor Huntsman V3 Pro TKL
Software win11
The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance

A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.
 
Joined
Nov 11, 2016
Messages
3,417 (1.16/day)
System Name The de-ploughminator Mk-III
Processor 9800X3D
Motherboard Gigabyte X870E Aorus Master
Cooling DeepCool AK620
Memory 2x32GB G.SKill 6400MT Cas32
Video Card(s) Asus RTX4090 TUF
Storage 4TB Samsung 990 Pro
Display(s) 48" LG OLED C4
Case Corsair 5000D Air
Audio Device(s) KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply Corsair HX850
Mouse Razor Death Adder v3
Keyboard Razor Huntsman V3 Pro TKL
Software win11
A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.
gpgpu.png

My overclocked 3090 is getting 40.6 TFLOPS of single precision (FP32) and only 660GFLOPS of FP64

A100 SM is different from GA102 SM as A100 is more focused on Tensor performance
GA102 SM vs A100 SM
ga102.png
a100.png
 
Last edited:
Joined
Sep 1, 2020
Messages
2,354 (1.52/day)
Location
Bulgaria
Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
 
Joined
Jan 8, 2017
Messages
9,438 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.

Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF).

It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
 
Joined
Mar 10, 2014
Messages
1,793 (0.46/day)
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
To be fair, rtx 3090 would get those numbers on this same clpeak benchmark, which Intel uses. So numbers are apples to apples, product audience is just wholly different and thus comparing them is just academic.
 

iO

Joined
Jul 18, 2012
Messages
529 (0.12/day)
Location
Germany
Processor R7 5700x
Motherboard MSI B450i Gaming
Cooling Accelero Mono CPU Edition
Memory 16 GB VLP
Video Card(s) AMD RX 6700 XT Accelero Mono
Storage P34A80 512GB
Display(s) LG 27UM67 UHD
Case none
Power Supply Fractal Ion 650 SFX
True competition will be Hopper and MI200.
I mean, it's a marvel of engineering and 45 TFlops is a lot, but the A100 was announced in May 2020. Beating a 2 year old card when it's finally going to be released sometime in 2022 sounds somewhat less impressive...
 
Joined
Apr 10, 2020
Messages
504 (0.30/day)
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.

Its like the old joke goes: What an engineer can accomplish in 1 month, two engineers can accomplish in 2 months.

The primary constraint of engineering is time. Money allows for a bigger product to be built, but not necessarily a better product. Given the timeline here, I'm sure the Intel GPU won't be the best. Something weird will happen and not work as expected.

What I'm looking for is a "good first step", not necessarily the best, but a product that shows that Intel knows why NVidia and AMD GPUs have done so well in supercomputing circles. Maybe generation 2 or 3 will be actually competitive.
 
Joined
Nov 23, 2010
Messages
317 (0.06/day)
Specs and powerpoint slides are great, but let's see what happens when intel actually ships the silicon...
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
A100 SM is different from GA102 SM as A100 is more focused on Tensor performance

Point. I accept your benchmark, but note that its somewhat impractical. I forgot that Turing cores did the "two FP32 instructions per clock tick" (similar to Pentium's dual-pipeline design way back in the day). That offers a paper-gain of 2x peak FP32 flops, but in practice, most code can't take advantage of it entirely.

Though in practice I still assert that the A100 is superior (again, 80GB of 2TBps HBM2e RAM, 10TFlops of double-precision performance, etc. etc.). Almost any GPU-programmer would rather have twice the SMs / cores rather than double the resources spent per SM.

In any case, A100 is still king of NVidia's lineup. Its a few years old however.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
AMD's Instinct MI200 'Aldebaran' is being shipped to customers. LOL. MI200 more than 50 TFlops FP32.

45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
Raytracing denoise compute shader runs on CUDA cores, hence RTX 3090's TFLOPS advantage is shown.

Direct Storage decompression on PC is done via Compute Shader (GpGPU) path.

Mesh Shader (similar to Compute Shader) is done on CUDA cores, hence RTX 3090's advantage is shown.

Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
3090's TF is real via Compute Shader (GpGPU) path. Pixel Shader path is bottlenecked by raster hardware.

Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Tensor is for pack math INT4, INT8, INT16, and FP16 with FP32 result and it's less flexible than CUDA cores .
 
Last edited:
Joined
Sep 17, 2014
Messages
22,472 (6.03/day)
Location
The Washing Machine
Processor 7800X3D
Motherboard MSI MAG Mortar b650m wifi
Cooling Thermalright Peerless Assassin
Memory 32GB Corsair Vengeance 30CL6000
Video Card(s) ASRock RX7900XT Phantom Gaming
Storage Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s) Gigabyte G34QWC (3440x1440)
Case Lian Li A3 mATX White
Audio Device(s) Harman Kardon AVR137 + 2.1
Power Supply EVGA Supernova G2 750W
Mouse Steelseries Aerox 5
Keyboard Lenovo Thinkpad Trackpoint II
Software W11 IoT Enterprise LTSC
Benchmark Scores Over 9000
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.

Well put :D Raja definitely has the biggest one now.
 
Joined
May 2, 2017
Messages
7,762 (2.80/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
 
Joined
May 3, 2018
Messages
2,881 (1.20/day)
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
I think that's the point. Intel just releasing a competitive product is good news for the market. Trying to beat others in every metric in their first go is a recipe for unending delays. Worry about power on the next iteration.

I'm hoping Intel DG2 is 3070 level of performance at 15% less price even if it uses more power, it will sell well if they can build reasonable supply and drivers are stable and updated regularly.
 
Joined
Jun 5, 2021
Messages
284 (0.22/day)
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
 
Joined
May 2, 2017
Messages
7,762 (2.80/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
Lol, at what power draw? I mean, sure, well probably get there eventually, but that's years and years in the future. And everyone is going mcm in the next generation or two.
 
Joined
Jan 4, 2021
Messages
1 (0.00/day)
Re: comparing fp32 on GA100 (TSMC A100) vs GA102 (Samsung 3090/3080 Ti), the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

The 3090 runs its fp32 cores at higher clock speeds, and can use its 32-bit integer path to double up the old (Turing) fp32 rate, hence that 36 TFLOPS. But as dragontamer5788 notes, to feed these cores, you need bandwidth. The A100 gets this from its magical but pricey HBM2e; the 3090 forces things by using GDDR6X on a 384-bit bus.

PS: if you think 3090's cost a lot, try to find the price for A100 !!
 
Joined
Jan 8, 2017
Messages
9,438 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

"Would/should be using" doesn't make sense, there are many workloads that cannot be accelerated by tensor ops.
 
Top