Tuesday, August 18th 2020

Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

Aug 18th, 2020 01:04 Discuss (32 Comments)

Raja Koduri, Intel's chief architect and senior vice president of Intel's discrete graphics division, has today held a talk at HotChips 32, the latest online conference of 2020, that shows off the latest architectural advancements in the semiconductor industry. So Intel has prepared two talks, one about Ice Lake-SP server CPUs and one about Intel's efforts in the upcoming graphics card launch. So what has Intel been working on the whole time? Raja Koduri took over the talk and has benchmarked the upcoming GPU and recorded how much raw power the GPUs posses, possibly counting in PetaFLOPs.

When Mr. Koduri got to talk, he pulled the 4-tile Xe HP GPU out of his pocket and showed for the first time how the chip looks. And it is one big chip. Featuring 4 tiles, the GPU represents Intel's fastest and biggest variant of Xe HP GPUs. The benchmark Intel ran was made to show off scaling on the Xe architecture and how the increase in the number of tiles results in a scalable increase in performance. Running on a single tile, the GPU managed to develop the performance of 10588 GFLOPs or around 10.588 TeraFLOPs. When there are two tiles, the performance scales almost perfectly at 21161 GFLOPS (21.161 TeraFLOPs) for 1.999X improvement. At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32.

Mr. Koduri has mentioned that the 4-tile chip is capable of "PetaFLOPs performance" which means that the GPU is going to be incredibly fast for tasks like machine learning and AI. Given that the GPU supports tensor cores if we calculate that it has 2048 compute units (EUs), capable of performing 128 operations per cycle (128 TOPs) and the fact that there are about 2 FMA (Fused Multiply-Add) units, that equals to about 524,288 FLOPs of AI power. This means that the GPU needs to be clocked at least at 2 GHz clock to achieve the PetaFLOP performance target, or have more than 128 TOPs of computing ability.

Source: Tom's Hardware

Add your own comment

32 Comments on Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

laszlo

didn't know intel started to make tiles ; roof covering is quite a good business now depend how long their tiles will last..

Vya Domus

10 TFLOPS per tile is rather unimpressive, you could have gotten that performance from a single GPU 4 years ago. It's kind of a waste for a MCM design, that should be reserved for when the independent GPUs are already as fast as possible.

HD64G

So, GPU compute workloads are prallel and when many GPU cores are combined, the result is almost perfrectly proportional to their number performance. Who knew?

Metroid

Vya Domus10 TFLOPS per tile is rather unimpressive, you could have gotten that performance from a single GPU 4 years ago. It's kind of a waste for a MCM design, that should be reserved for when the independent GPUs are already as fast as possible.

en.wikipedia.org/wiki/Ampere_(microarchitecture)

Architecture	FP32 CUDA Cores	Boost Clock	Memory Clock	Memory Bus Width	Memory Bandwidth	VRAM	Single Precision	Double Precision	INT8 Tensor	FP16 Tensor	bfloat16 Tensor	TensorFloat-32(TF32) Tensor	FP64 Tensor	Interconnect	GPU	GPU Die Size	Transistor Count	TDP	Manufacturing Process
Ampere	6912	1410MHz	2.4Gbps HBM2	5120-bit	1555GB/sec	40GB	19.5 TFLOPs	9.7 TFLOPs	624 TOPs	312 TFLOPs	312 TFLOPs	156 TFLOPs	19.5 TFLOPS	600GB/sec	GA100	826mm2	54.2B	400W	TSMC 7nm N7
Volta	5120	1530MHz	1.75Gbps HBM2	4096-bit	900GB/sec	16GB/32GB	15.7 TFLOPs	7.8 TFLOPs	N/A	125 TFLOPs	N/A	N/A	N/A	300GB/sec	GV100	815mm2	21.1B	300W/350W	TSMC 12nm FFN
Pascal	3584	1480MHz	1.4Gbps HBM2	4096-bit	720GB/sec	16GB	10.6 TFLOPs	5.3 TFLOPs	N/A	N/A

"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "

Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.

Vayra86

HD64GSo, GPU compute workloads are prallel and when many GPU cores are combined, the result is almost perfrectly proportional to their number performance. Who knew?

But look at it scale man! Never mind the fact power consumption is obviously also x2

Regardless, Intel does seem to have an MCM solution running, so they're doing something.

ZoneDymo

"10588 GFLOPs or around 10.588 TeraFLOPs. "

this is just funny

It manages about 1km or about 1000m or about 100000cm

Vayra86

Metroiden.wikipedia.org/wiki/Ampere_(microarchitecture)

Architecture FP32 CUDA Cores Boost Clock Memory Clock Memory Bus Width Memory Bandwidth VRAM Single Precision Double Precision INT8 Tensor FP16 Tensor bfloat16 Tensor TensorFloat-32(TF32) Tensor FP64 Tensor Interconnect GPU GPU Die Size Transistor Count TDP Manufacturing Process
Ampere 6912 1410MHz 2.4Gbps HBM2 5120-bit 1555GB/sec 40GB 19.5 TFLOPs 9.7 TFLOPs 624 TOPs 312 TFLOPs 312 TFLOPs 156 TFLOPs 19.5 TFLOPS 600GB/sec GA100 826mm2 54.2B 400W TSMC 7nm N7
Volta 5120 1530MHz 1.75Gbps HBM2 4096-bit 900GB/sec 16GB/32GB 15.7 TFLOPs 7.8 TFLOPs N/A 125 TFLOPs N/A N/A N/A 300GB/sec GV100 815mm2 21.1B 300W/350W TSMC 12nm FFN
Pascal 3584 1480MHz 1.4Gbps HBM2 4096-bit 720GB/sec 16GB 10.6 TFLOPs 5.3 TFLOPs N/A N/A

"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "

Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.

Dunno. Nvidia dominates the space not just because of raw TFLOPs, it does so because it provides software frameworks for their hardware as well. There's a lot more to this than just dropping a number on the floor and say 'look its fast'. Yeah... theoretically. Nvidia has been doing more with less for decades now. In addition, do you know the floor plan for this Xe GPU? It sure as hell is large. 4 tiles = 4 perfect dies.

In addition, between this and their IGPs I'm not seeing how Xe is going to make waves for gaming. This is certainly not the trimmed down gaming die we'll get... but what dó we get? 16 tiles of IGP? :p

All in all, cool test, but pretty pointless and a great way of telling us nothing.

Metroid

Vayra86Dunno. Nvidia dominates the space not just because of raw TFLOPs, it does so because it provides software frameworks for their hardware as well. There's a lot more to this than just dropping a number on the floor and say 'look its fast'. Yeah... theoretically. Nvidia has been doing more with less for decades now. In addition, do you know the floor plan for this Xe GPU? It sure as hell is large. 4 tiles = 4 perfect dies.

The impressive from my post still remains, now transforming that into a gaming machine that will rival AMD and Nvidia that is a totally different thing. I do hope it happens, Nvidia wanting to charge $2k for their 3090 is enough, more competition, better prices, we all win.

To complement my post, last i heard, 4 tiles would be around 500 watts, imagine to cool that thing down hehe, I mean cooling down a threadripper is hard enough, TR is around 500 watts. I have no idea what they will do, I think they will release a 2 tiles and 1 tile gpus or a more manufacturing friendly approach, only a 2 tiles gpus and price it lower than competition. We will see, I hope they get it right, we need this but knowing Intel, they always price their things very high x the competition.

Vya Domus

MetroidNvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.

In other words what Intel can do with 2 GPUs Nvidia can with 1. That's not impressive no matter how you spin it, Nvidia ships 8 GPUs per board I believe, I can't imagine Intel shipping more than 2 per board so in the end Nvidia is still going to have the performance density advantage.

#10

Metroid

Vya DomusIn other words what Intel can do with 2 GPUs Nvidia can with 1. That's not impressive no matter how you spin it, Nvidia ships 8 GPUs per board I believe, I can't imagine Intel shipping more than 2 per board so in the end Nvidia is still going to have the performance density advantage.

I have no idea if 1 tile intel = ampere, I mean in size, I do think ampere matches a 2 tile size from intel. So in single precision, Intel 2 tiles and Nvidia ampere is pretty much matched in that sense and I do find it impressive, this is Intel first attempt.

#11

stimpy88

More smoke and mirrors from Mr Koduri... The usual promises of greatness, yet when launched, will most likely be an expensive turd for niche use cases.

#12

Fluffmeister

Yeah kinda has a whiff of the Polaris launch about it, why buy a single fast efficient GTX 1080 when you can buy TWO RX 480's instead! I mean look how well they do in everyone's favourite game!

#13

ExcuseMeWtf

Metroiden.wikipedia.org/wiki/Ampere_(microarchitecture)

Architecture FP32 CUDA Cores Boost Clock Memory Clock Memory Bus Width Memory Bandwidth VRAM Single Precision Double Precision INT8 Tensor FP16 Tensor bfloat16 Tensor TensorFloat-32(TF32) Tensor FP64 Tensor Interconnect GPU GPU Die Size Transistor Count TDP Manufacturing Process
Ampere 6912 1410MHz 2.4Gbps HBM2 5120-bit 1555GB/sec 40GB 19.5 TFLOPs 9.7 TFLOPs 624 TOPs 312 TFLOPs 312 TFLOPs 156 TFLOPs 19.5 TFLOPS 600GB/sec GA100 826mm2 54.2B 400W TSMC 7nm N7
Volta 5120 1530MHz 1.75Gbps HBM2 4096-bit 900GB/sec 16GB/32GB 15.7 TFLOPs 7.8 TFLOPs N/A 125 TFLOPs N/A N/A N/A 300GB/sec GV100 815mm2 21.1B 300W/350W TSMC 12nm FFN
Pascal 3584 1480MHz 1.4Gbps HBM2 4096-bit 720GB/sec 16GB 10.6 TFLOPs 5.3 TFLOPs N/A N/A

"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "

Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.

Fair enough.

We will see if latency in whatever interconnect implementation they use won't be an issue, though I presume Intel engineers thought of that and more already and certainly know better than all of us here.

#14

Mescalamba

The beauty of nVidia being able to power all sorts of processing is that CUDA and similar is actually possible to implement and use, sometimes even for daily noob user. AMD has mostly "theoretical" power, but actually using it is a lot different and sadly when actually used in comparable scenario, they not where they should be based on pure power. I cant say if AMD is overstating its performance or just software part isnt up to it.

Over long time period I always had "feeling" AMD (ATi) could do ton better if they actually got their software side together.. Even games when tuned right show that sometimes. Its there, just mostly out of reach. :/

Intel seems to be again a bit late to the party, with same issues their CPU have. Too big, too hot. And probably too expensive.

#15

T4C Fantasy

CPU & GPU DB Maintainer

I'm excited for all new gpus coming out, architecture is really intriguing no matter how inferior or superior it is.

#16

Vayra86

MescalambaThe beauty of nVidia being able to power all sorts of processing is that CUDA and similar is actually possible to implement and use, sometimes even for daily noob user. AMD has mostly "theoretical" power, but actually using it is a lot different and sadly when actually used in comparable scenario, they not where they should be based on pure power. I cant say if AMD is overstating its performance or just software part isnt up to it.

Over long time period I always had "feeling" AMD (ATi) could do ton better if they actually got their software side together.. Even games when tuned right show that sometimes. Its there, just mostly out of reach. :/

Intel seems to be again a bit late to the party, with same issues their CPU have. Too big, too hot. And probably too expensive.

What strikes me with Intel in all of their new developments is the lack of focus on scalability in terms of yields. Nowhere can we see a straight copy of the idea of chiplets that are as small as possible. They're still trying to make big complicated stuff. Even these tiled GPUs are humongous. They're also differentiating everything all over the place with a myriad of product lines and tweaks... its like they literally don't WANT to make an efficient, single product stack and derive new products from it - they just build a whole new one for every little segment. The wide variety of core configurations alone... wtf.

Looks like old ideas desperately trying to keep themselves relevant, despite ever increasing foundry challenges. Its like they love to repeat 10nm. Intel seems to be adamant that extreme specialization and tweaking is the way forward... but isn't that a dead end, ultimately, and probably pretty soon?

#17

Caring1

More flops from Intel, I bet no one saw that coming. ;)

#18

DeathtoGnomes

Caring1More flops from Intel, I bet no one saw that coming. ;)

I did. [I see what you did there!]

I mean it is low entry level design, so for a first attempt we can say "it has potential".

#19

Steevo

A demonstration of a simulation of the possible power of what could be?

Sounds about right for this guy.

#20

Metroid

SteevoA demonstration of a simulation of the possible power of what could be?
Sounds about right for this guy.

He likes to brag ehhe, I expected no less from Mr Raja Koduri ehhe

He is in a lot of pressure I tell you hehe, nothing better to show people who do not like him at Intel that he is doing his job.

#21

PowerPC

Raja Koduri's eyes look like he should have stayed at AMD.

#22

stimpy88

PowerPCRaja Koduri's eyes look like he should have stayed at AMD.

Him leaving AMD was the best thing that has happened to them since Lisa Su and the Zen architecture.

#23

Zareek

At least this is starting to get interesting! Some more competition in graphics would be really nice...

#24

Blueberries

Having linear scalability is WILD, and 10.5TFLOPS on a single chipset is nothing to scoff at.

I'll rehash what I said when Xe was announced: if Intel doesn't provide a competitive product with their initial launch, they absolutely will with their third or fourth generation.

#25

Steevo

BlueberriesHaving linear scalability is WILD, and 10.5TFLOPS on a single chipset is nothing to scoff at.

I'll rehash what I said when Xe was announced: if Intel doesn't provide a competitive product with their initial launch, they absolutely will with their third or fourth generation.

Also a broken clock is right twice a day.

Add your own comment

Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

32 Comments on Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

Related News

32 Comments on Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts