Tuesday, August 18th 2020
Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU
Raja Koduri, Intel's chief architect and senior vice president of Intel's discrete graphics division, has today held a talk at HotChips 32, the latest online conference of 2020, that shows off the latest architectural advancements in the semiconductor industry. So Intel has prepared two talks, one about Ice Lake-SP server CPUs and one about Intel's efforts in the upcoming graphics card launch. So what has Intel been working on the whole time? Raja Koduri took over the talk and has benchmarked the upcoming GPU and recorded how much raw power the GPUs posses, possibly counting in PetaFLOPs.
When Mr. Koduri got to talk, he pulled the 4-tile Xe HP GPU out of his pocket and showed for the first time how the chip looks. And it is one big chip. Featuring 4 tiles, the GPU represents Intel's fastest and biggest variant of Xe HP GPUs. The benchmark Intel ran was made to show off scaling on the Xe architecture and how the increase in the number of tiles results in a scalable increase in performance. Running on a single tile, the GPU managed to develop the performance of 10588 GFLOPs or around 10.588 TeraFLOPs. When there are two tiles, the performance scales almost perfectly at 21161 GFLOPS (21.161 TeraFLOPs) for 1.999X improvement. At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32.Mr. Koduri has mentioned that the 4-tile chip is capable of "PetaFLOPs performance" which means that the GPU is going to be incredibly fast for tasks like machine learning and AI. Given that the GPU supports tensor cores if we calculate that it has 2048 compute units (EUs), capable of performing 128 operations per cycle (128 TOPs) and the fact that there are about 2 FMA (Fused Multiply-Add) units, that equals to about 524,288 FLOPs of AI power. This means that the GPU needs to be clocked at least at 2 GHz clock to achieve the PetaFLOP performance target, or have more than 128 TOPs of computing ability.
Source:
Tom's Hardware
When Mr. Koduri got to talk, he pulled the 4-tile Xe HP GPU out of his pocket and showed for the first time how the chip looks. And it is one big chip. Featuring 4 tiles, the GPU represents Intel's fastest and biggest variant of Xe HP GPUs. The benchmark Intel ran was made to show off scaling on the Xe architecture and how the increase in the number of tiles results in a scalable increase in performance. Running on a single tile, the GPU managed to develop the performance of 10588 GFLOPs or around 10.588 TeraFLOPs. When there are two tiles, the performance scales almost perfectly at 21161 GFLOPS (21.161 TeraFLOPs) for 1.999X improvement. At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32.Mr. Koduri has mentioned that the 4-tile chip is capable of "PetaFLOPs performance" which means that the GPU is going to be incredibly fast for tasks like machine learning and AI. Given that the GPU supports tensor cores if we calculate that it has 2048 compute units (EUs), capable of performing 128 operations per cycle (128 TOPs) and the fact that there are about 2 FMA (Fused Multiply-Add) units, that equals to about 524,288 FLOPs of AI power. This means that the GPU needs to be clocked at least at 2 GHz clock to achieve the PetaFLOP performance target, or have more than 128 TOPs of computing ability.
32 Comments on Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU
I hope industry sees a comeback until the score is settled...
Intel's main advantage is EMIB vs. tsv architecture. Intel's is clearly better, though AMD has taken great strides and knows the ins-and-outs of the technology very clearly. AMD can rain down on the Intel parade anytime an opportunity presents itself.
www.techpowerup.com/245521/on-the-coming-chiplet-revolution-and-amds-mcm-promise
I only said his expression in this picture says it all about what he probably feels now about this move. He left way before it was clear that Intel was going under and AMD was rising over them. Kinda strange that you have to point out something from my post that I never argued.
I mean heck, Xe could arguably bee the continuation of Larabee / Xeon Phi, since its simply Intel's next coprocessor. Granted, they're starting over from scratch on this one (or at least, on their Gen11 architecture), but this isn't the first time Intel has tried to enter the high-end Coprocessor market.
Its basically looking like Kodak which tried really hard to pi** against wind, only to capitulate later, train has already left the station and Kodak left the building a bit later too..
Trying to force any market to do whatever you want is really really stupid idea. Much like mankind trying to do same with the nature. It never worked and never will. And it always comes back and bites bottom of anyone who is trying to do that.
However, 4-bit Neural Networks do exist. If the GPU provides 8x FMA instructions per FP32 unit, that's 128 4-bit Tensor ops per cycle. Leading to... Roughly 1 PetaFLOP. Except its not a "Flop", its a "IOP" (integer-op), and only 4-bit at that. (Unless there's some 4-bit floating-point unit that I haven't heard of before...). Its a stretch for sure, but neural-nets are popular enough that it might be realistic for one or two customers out there...