Friday, August 20th 2021

Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

Intel in its 2021 Architecture Day presentation put out fine technical details of its Xe HPC Ponte Vecchio accelerator, including some [very] preliminary performance claims for its current A0-silicon-based prototype. The prototype operates at 1.37 GHz, but achieves out at least 45 TFLOPs of FP32 throughput. We calculated the clock speed based on simple math. Intel obtained the 45 TFLOPs number on a machine running a single Ponte Vecchio OAM (single MCM with two stacks), and a Xeon "Sapphire Rapids" CPU. 45 TFLOPs sees the processor already beat the advertised 19.5 TFLOPs of the NVIDIA "Ampere" A100 Tensor Core 40 GB processor. AMD isn't faring any better, with its production Instinct MI100 processor only offering 23.1 TFLOPs FP32.
"A0 silicon" is the first batch of chips that come back from the foundry after the tapeout. It's a prototype that is likely circulated within Intel internally, and to a very exclusive group of ISVs and industry partners, under very strict NDAs. It is common practice to ship prototypes with significantly lower clock speeds than what the silicon is capable of, at least to the ISVs, so they can test for functionality and begin developing software for the silicon.
Our math for the clock speed is as follows. Intel, in the presentation mentions that each package (OAM) puts out a throughput of 32,768 FP32 ops per clock cycle. It also says that a 2-stack (one package) amounts to 128 Xe-cores, and that each Xe HPC core Vector Engine offers 256 FP32 ops per clock cycle. These add up to 32,768 FP32 ops/clock for one package (a 2-stack). From here, we calculate that 45,000 GFLOPs (measured in clpeak by the way), divided by 32,768 FP32 ops/clock, amounts to 1373 MHz clock speed. A production stepping will likely have higher clock speeds, and throughput scales linearly, but even 1.37 GHz seems like a number Intel could finalize on, given the sheer size and "weight" (power draw) of the silicon (rumored to be 600 W for A0). All this power comes with great thermal costs, with Intel requiring liquid cooling for the OAMs. If these numbers can make it into the final product, then Intel has very well broken through into the HPC space in a big way.
Add your own comment

48 Comments on Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

#26
jesdals
Nice to see that Intels finest wine only need watercooling... but hey any GPU news of new cards is good news :)
Posted on Reply
#27
dragontamer5788
nguyenobviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much
lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.

I don't know where you're getting the bullshit numbers that the 3090 is faster than an A100, but... its just not true. Under any reasonable benchmark, like Linpack, A100 is something like 10-ish TFlops double-precision and 20-ish TFlops single-precision.
Posted on Reply
#28
nguyen
dragontamer5788The A100 beats the 3090 in pretty much every benchmark across the board by a significant margin. A100 is the top-of-the-top, the $10,000+ NVidia GPU for serious work. Its NVidia's best card.
Yeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance
Posted on Reply
#29
dragontamer5788
nguyenYeah you just proved my point, A100 is designed for deep learning and not just FP32/64 performance
A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.
Posted on Reply
#30
nguyen
dragontamer5788A100 has EDIT: 108 SMs, 3090 only has 84 SM. (EDIT: 108 SMs. 128 is the full die, but it seems like NVidia expects some area of the die to be defective).

The A100 is in a completely different class than the 3090. In FP32 performance even. Its not even close before you factor the 2TBps 80GB HBM2e sitting on the die.

------

Note that the 3090 is lol 1/64th speed FP64. Its terrible at scientific compute. A100 is full speed (well, 1/2 speed, 10 TFlops) of double precision.

My overclocked 3090 is getting 40.6 TFLOPS of single precision (FP32) and only 660GFLOPS of FP64

A100 SM is different from GA102 SM as A100 is more focused on Tensor performance
GA102 SM vs A100 SM
Posted on Reply
#31
TumbleGeorge
nguyenNope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Please stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
Posted on Reply
#32
Vya Domus
45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.
TumbleGeorgePlease stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF).
It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
Posted on Reply
#33
jabbadap
TumbleGeorgePlease stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
To be fair, rtx 3090 would get those numbers on this same clpeak benchmark, which Intel uses. So numbers are apples to apples, product audience is just wholly different and thus comparing them is just academic.
Posted on Reply
#34
iO
True competition will be Hopper and MI200.
I mean, it's a marvel of engineering and 45 TFlops is a lot, but the A100 was announced in May 2020. Beating a 2 year old card when it's finally going to be released sometime in 2022 sounds somewhat less impressive...
Posted on Reply
#35
RedelZaVedno
Off the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.
Posted on Reply
#36
dragontamer5788
RedelZaVednoOff the record coming from RedGamingTech's Ngreedia insider leaks... Jensen&Co is really worried about Intel as investment suggest it is serious about competing on GPU front in the long run. Allegedly Intel now invests couple of times more on dGPU R&D than Nvidia&AMD do combined. All this money should show some decent results if not yet in Arc maybe down the road in next gen products.
Its like the old joke goes: What an engineer can accomplish in 1 month, two engineers can accomplish in 2 months.

The primary constraint of engineering is time. Money allows for a bigger product to be built, but not necessarily a better product. Given the timeline here, I'm sure the Intel GPU won't be the best. Something weird will happen and not work as expected.

What I'm looking for is a "good first step", not necessarily the best, but a product that shows that Intel knows why NVidia and AMD GPUs have done so well in supercomputing circles. Maybe generation 2 or 3 will be actually competitive.
Posted on Reply
#37
chodaboy19
Specs and powerpoint slides are great, but let's see what happens when intel actually ships the silicon...
Posted on Reply
#38
dragontamer5788
nguyenA100 SM is different from GA102 SM as A100 is more focused on Tensor performance
Point. I accept your benchmark, but note that its somewhat impractical. I forgot that Turing cores did the "two FP32 instructions per clock tick" (similar to Pentium's dual-pipeline design way back in the day). That offers a paper-gain of 2x peak FP32 flops, but in practice, most code can't take advantage of it entirely.

Though in practice I still assert that the A100 is superior (again, 80GB of 2TBps HBM2e RAM, 10TFlops of double-precision performance, etc. etc.). Almost any GPU-programmer would rather have twice the SMs / cores rather than double the resources spent per SM.

In any case, A100 is still king of NVidia's lineup. Its a few years old however.
Posted on Reply
#39
ValenOne
AMD's Instinct MI200 'Aldebaran' is being shipped to customers. LOL. MI200 more than 50 TFlops FP32.
Vya Domus45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
Raytracing denoise compute shader runs on CUDA cores, hence RTX 3090's TFLOPS advantage is shown.

Direct Storage decompression on PC is done via Compute Shader (GpGPU) path.

Mesh Shader (similar to Compute Shader) is done on CUDA cores, hence RTX 3090's advantage is shown.
TumbleGeorgePlease stop write fake info! If 3090 has real 36 TF must be make more of double FPS in game than RX 6800(16.2 TF). This is not fact. Please stop green fake!
3090's TF is real via Compute Shader (GpGPU) path. Pixel Shader path is bottlenecked by raster hardware.
nguyenNope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Tensor is for pack math INT4, INT8, INT16, and FP16 with FP32 result and it's less flexible than CUDA cores .
Posted on Reply
#40
Vayra86
Vya Domus45 TFLOPs is a lot but hardly impressive for something the size of a slice of ham.



It can peak at 36TF but in practice for real world workloads that number is probably close to half of that because of the way each SM works, hence why it's not much faster than a 6900XT.
Well put :D Raja definitely has the biggest one now.
Posted on Reply
#41
Richards
nguyenhuh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
It beats the a100
Posted on Reply
#42
Minus Infinity
btarunrFP64 is 1:1 FP32. So its FP64 throughput is identical.


It has all the ingredients to be a cloud gaming GPU.
Wow, very impressive. Thanks for the heads up.
Posted on Reply
#43
Valantar
While it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
Posted on Reply
#44
Minus Infinity
ValantarWhile it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
I think that's the point. Intel just releasing a competitive product is good news for the market. Trying to beat others in every metric in their first go is a recipe for unending delays. Worry about power on the next iteration.

I'm hoping Intel DG2 is 3070 level of performance at 15% less price even if it uses more power, it will sell well if they can build reasonable supply and drivers are stable and updated regularly.
Posted on Reply
#45
Richards
ValantarWhile it's no doubt impressive that Intel is actually looking competitive (considering they literally started from nothing a few years ago), the PR spin in this needs some correcting. "Our yet-to-be-launched product at 600W beats competitors' already launched 300W products by ~1.9-2.1x" is hardly that impressive, even if this is at preproduction clocks, especially when said competitors are likely to launch significantly faster solutions in the same time frame. If clocks go up, power does too, and 600W is already insane for a single package, mostly negating the compute density advantage due to the size and complexity of cooling needed.

Still, competition is always good. It'll be interesting to see how this plays out.
Intel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
Posted on Reply
#46
Valantar
RichardsIntel can reach 500 teraflops by stacking even more.. nvidia has no chance if they d'not go mcm or stacking
Lol, at what power draw? I mean, sure, well probably get there eventually, but that's years and years in the future. And everyone is going mcm in the next generation or two.
Posted on Reply
#47
nnunn
Re: comparing fp32 on GA100 (TSMC A100) vs GA102 (Samsung 3090/3080 Ti), the folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.

The 3090 runs its fp32 cores at higher clock speeds, and can use its 32-bit integer path to double up the old (Turing) fp32 rate, hence that 36 TFLOPS. But as dragontamer5788 notes, to feed these cores, you need bandwidth. The A100 gets this from its magical but pricey HBM2e; the 3090 forces things by using GDDR6X on a 384-bit bus.

PS: if you think 3090's cost a lot, try to find the price for A100 !!
Posted on Reply
#48
Vya Domus
nnunnthe folks who buy A100's for single precision work would/should be using the 156 TFLOPS from its matrix engines. That 19.5 fp32 TFLOPS (on the A100) comes from its "legacy" (general purpose) fp32 compute cores.
"Would/should be using" doesn't make sense, there are many workloads that cannot be accelerated by tensor ops.
Posted on Reply
Add your own comment
Nov 25th, 2024 23:42 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts