Monday, January 8th 2024

AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

Jan 8th, 2024 03:15 Discuss (30 Comments)

According to the latest round of tests by Phoronix, we are seeing proof of substantial performance gains Intel's 5th Gen Xeon Emerald Rapids server CPUs deliver when employing AVX-512 vector instructions. Enabling AVX-512 doubled throughput on average across a range of workloads, with specific AI tasks accelerating over 10x versus having it disabled. Running on the top-end 64-core Platinum 8592+ SKU, benchmarks saw minimal frequency differences between AVX-512 on and off states. However, the specialized 512-bit vector processing unlocked dramatic speedups, exemplified in the OpenVINO AI framework. Specifically, weld porosity detection, which has real-world applications, showed the biggest speedups. Power draw also increased moderately - the expected tradeoff for such an unconstrained performance upside.

With robust optimizations, the vector engine potential has now been fully demonstrated. Workloads spanning AI, visualization, simulation, and analytics could multiply speed by upgrading to Emerald Rapids. Of course, developer implementation work remains non-trivial. But for the data center applications that can take advantage, AVX-512 enables Intel to partially close raw throughput gaps versus AMD's core count leadership. Whether those targeted acceleration gains offset EPYC's wider general-purpose value depends on customer workloads. But with tests proving dramatic upside, Intel is betting big on vector acceleration as its ace card. AMD also supports the AVX-512 instruction set. Below, you can find the geometric mean of all test results, and check the review with benchmarks here.

Source: Phoronix

Add your own comment

30 Comments on AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

Dr. Dro

AI this, AI that, and AI then some

no one bloody cares

Get to actually matching AMD where it matters, Intel. You're slacking.

aktpu

If you think that

Dr. Drono one bloody cares

you are sorely mistaken. Does a average hobbyist care? No. But there's a butt ton of demand for performant AI hardware (for example see nvidia)

Daven

If AMD also supports AVX-512, what is the actual Intel advantage here?

Panther_Seraphin

DavenIf AMD also supports AVX-512, what is the actual Intel advantage here?

Its about Upgrading from an existing intel platform to the latest.

Dont talk about the competition if you cant compete sort of scenario so big yourself up against previous/last gen stuff.

AnarchoPrimitiv

Panther_SeraphinIts about Upgrading from an existing intel platform to the latest.

Dont talk about the competition if you cant compete sort of scenario so big yourself up against previous/last gen stuff.

Ohhhhh

Vya Domus

aktpuyou are sorely mistaken. Does a average hobbyist care? No. But there's a butt ton of demand for performant AI hardware

ML on CPUs is relegated to some very niche use cases, for all intents and purposes they're wasting their time, no one is going to buy thousands of Xeons for ML.

TumbleGeorge

Vya DomusML on CPUs is relegated to some very niche use cases, for all intents and purposes they're wasting their time, no one is going to buy thousands of Xeons for ML.

It would be interesting if one could build a supercomputer made up of just the instinct mi300a APU's without any separate CPUs. But for now, such do not exist and are hardly possible.

Dr. Dro

aktpuIf you think that

you are sorely mistaken. Does a average hobbyist care? No. But there's a butt ton of demand for performant AI hardware (for example see nvidia)

I'm well aware mate. It's sadly the "thing" right now. It's called hyperbole/conjecture to cope with frustration.

That EMR IS, on average, worse than equivalent Ryzen Threadripper Pro and EPYC CPUs stands however

Shihab

10x gains instead of 32x? Damn you, Amdahl! *shakes fist at a graveyard*

Seriously, did anyone carefully read the article and figured out whether the "disabled/off" scenario is AVX2/AVX/SSE, or is off for all vectorization sets? Compiler flags aren't telling much.

#10

TumbleGeorge

Shihab10x gains instead of 32x? Damn you, Amdahl! *shakes fist at a graveyard*

Seriously, did anyone carefully read the article and figured out whether the "disabled/off" scenario is AVX2/AVX/SSE, or is off for all vectorization sets? Compiler flags aren't telling much.

Oh, please don't challenge the marketing messages from Intel so seriously, or there might be layoffs on this team. :(

#11

ncrs

This is not as clear cut as one might think. If you read the comments there's a supposition that the 10x increase is actually caused by the usage of AMX instruction set, and not AVX-512 per se. This result is somewhat consistent with earlier benchmarks of AMX on Phoronix.

Shihab10x gains instead of 32x? Damn you, Amdahl! *shakes fist at a graveyard*

Seriously, did anyone carefully read the article and figured out whether the "disabled/off" scenario is AVX2/AVX/SSE, or is off for all vectorization sets? Compiler flags aren't telling much.

OpenVINO x86 CPU mode requires at least SSE4.2, so trying to use it with all SIMD disabled wouldn't work.

#12

unwind-protect

Up to 10x Improvement in AI Workloads

Who cares when GPUs are even faster for those workloads?

I'm serious, who does? Is it because of easier scheduling and division of hardware to different users?

#13

tygrus

Correction:
AMD also supports some of the AVX-512 instructions split up to use the AVX-256 units.

#14

ncrs

tygrusCorrection:
AMD also supports some of the AVX-512 instructions split up to use the AVX-256 units.

Zen 4/4c supports 15 subsets AVX-512 while Sapphire/Emerald Rapids does 16 with the addition of FP16.

AMD's implementation of AVX-512 is not as simple as splitting it to 256-bit operations. Otherwise the performance increase would not have been as high. Chips and Cheese posted a good analysis of it.

#15

ThrashZone

Hi,
Nice intel about kills avx-512 because it creates to much heat plus not much uses it anyway
Now it's needed for AI boost ?

www.makeuseof.com/what-is-avx-512-why-intel-killing-it/
www.tomshardware.com/news/intel-nukes-alder-lake-avx-512-now-fuses-it-off-in-silicon

#16

Squared

This isn't news. The performance increase from using AVX-512 in Emerald Rapids is inline with the increase offered by Sapphire Rapids, and similar to the increase offered by Genoa. I don't see what's changed. Emerald Rapids is faster than Sapphire Rapids so Phoronix's testing should be useful to those considering Emerald Rapids.

#17

Dr. Dro

ThrashZoneHi,
Nice intel about kills avx-512 because it creates to much heat plus not much uses it anyway
Now it's needed for AI boost ?

www.makeuseof.com/what-is-avx-512-why-intel-killing-it/
www.tomshardware.com/news/intel-nukes-alder-lake-avx-512-now-fuses-it-off-in-silicon

AVX-512 VNNI was backported into AVX2 and is supported in Alder and Raptor Lake sans EVEX prefixes.

Sapphire Rapids uses the same Golden Cove P-cores from Alder while Emerald uses Raptor Cove. The decision to remove AVX-512 was from a segmentation perspective.

#18

ncrs

Dr. DroAVX-512 VNNI was backported into AVX2 and is supported in Alder and Raptor Lake sans EVEX prefixes.

While AVX-VNNI along with IFMA were ported they did not bring all of AVX-512's features, regardless of the width change to 256-bit. That will happen only with AVX10.1, which still isn't going to be equal to AVX-512 on its own. Only AVX10.1/512 will have full 512-bit width, while AVX10.1/256 is going to be 256-bit. The former is expected to be available on servers (just like AVX-512 is now), and the latter on everything including future E-cores.

Dr. DroSapphire Rapids uses the same Golden Cove P-cores from Alder while Emerald uses Raptor Cove. The decision to remove AVX-512 was from a segmentation perspective.

I'm not sure about that last part. 10th/11th generation Core in laptops and 11th gen. in desktops had AVX-512.
The main reason for removing AVX-512 was the inability of E-cores to execute those instructions. I'm not aware of any mainstream operating system that can deal with non-uniform instruction sets, so P-cores had to match them.
12th gen. desktop CPUs were also capable of running AVX-512 by disabling E-cores, but that possibility got taken away with a microcode update, and by physical modification in later stepping/generations.

#19

Squared

ncrs12th gen. desktop CPUs were also capable of running AVX-512 by disabling E-cores, but that possibility got taken away with a microcode update, and by physical modification in later stepping/generations.

I don't think Intel had any reason to take away AVX-512 when E-cores are disabled other than product segmentation, unless it was literally removed from the die design to make for a smaller die.

#20

Dr. Dro

ncrsWhile AVX-VNNI along with IFMA were ported they did not bring all of AVX-512's features, regardless of the width change to 256-bit. That will happen only with AVX10.1, which still isn't going to be equal to AVX-512 on its own. Only AVX10.1/512 will have full 512-bit width, while AVX10.1/256 is going to be 256-bit. The former is expected to be available on servers (just like AVX-512 is now), and the latter on everything including future E-cores.

I'm not sure about that last part. 10th/11th generation Core in laptops and 11th gen. in desktops had AVX-512.
The main reason for removing AVX-512 was the inability of E-cores to execute those instructions. I'm not aware of any mainstream operating system that can deal with non-uniform instruction sets, so P-cores had to match them.
12th gen. desktop CPUs were also capable of running AVX-512 by disabling E-cores, but that possibility got taken away with a microcode update, and by physical modification in later stepping/generations.

Aware of other -512 features being missing, it's just VNNI and IFMA that got an AVX2 version.

I'm fairly sure that Linux can, but not Windows - and I don't think that's specifically the reason why, or they could simply let us use it with the E-cores turned off. Intel's never been a stranger to market segmentation, they've always done it even when AMD was exerting full pressure on them.

SquaredI don't think Intel had any reason to take away AVX-512 when E-cores are disabled other than product segmentation, unless it was literally removed from the die design to make for a smaller die.

There was no physical shrinkage of later batches of Alder Lake - and Raptor Lake "refresh", aka the 14th Gen scam, has no physical differences whatsoever from 13th Gen

#21

ThrashZone

Hi,
Dang I thought they killed it because it was skewing core temperatures and voltage usage to high against amd :p
My bad :slap:

#22

ncrs

SquaredI don't think Intel had any reason to take away AVX-512 when E-cores are disabled other than product segmentation, unless it was literally removed from the die design to make for a smaller die.

As far as I can see from Locuza's annotated die photos they did not remove it physically even from Raptor Lake. It wouldn't really make much sense to remove it since those execution resources are shared between ports 0 and 1 (Intel does fused operations on 256-bit units, but in a different way from AMD), in contrast to server versions which have one additional full AVX-512 unit on port 5, and that one was cut from desktop variants. Maybe the changes were so small that we can't see them directly or maybe they just disabled it permanently via microcode.

It would have been very difficult to properly support mixed configurations, both from end user and professional perspectives.

"AVX-512 is supported**,
** - but only if you disable some cores"
- would not have been good PR, and a source of constant support issues ;)

Compilers contain architectural optimizations, so it would require having two different levels for this setup. Doable, but I don't think Intel wanted to commit to such complexity.

Dr. DroI'm fairly sure that Linux can, but not Windows - and I don't think that's specifically the reason why, or they could simply let us use it with the E-cores turned off. Intel's never been a stranger to market segmentation, they've always done it even when AMD was exerting full pressure on them.

Certain enough to provide an example of such computer? I am not aware of any.
I would also like for Intel to allow this, but I understand the burdens it would bring. I'm just not convinced they explicitly wanted to segment AVX-512 out after investing so much time and money into mainstream hardware and software enablement in their awesome libraries.

#23

Dr. Dro

ncrsAs far as I can see from Locuza's annotated die photos they did not remove it physically even from Raptor Lake. It wouldn't really make much sense to remove it since those execution resources are shared between ports 0 and 1 (Intel does fused operations on 256-bit units, but in a different way from AMD), in contrast to server versions which have one additional full AVX-512 unit on port 5, and that one was cut from desktop variants. Maybe the changes were so small that we can't see them directly or maybe they just disabled it permanently via microcode.

It would have been very difficult to properly support mixed configurations, both from end user and professional perspectives.

"AVX-512 is supported**,
** - but only if you disable some cores"
- would not have been good PR, and a source of constant support issues ;)

Compilers contain architectural optimizations, so it would require having two different levels for this setup. Doable, but I don't think Intel wanted to commit to such complexity.

Certain enough to provide an example of such computer? I am not aware of any.
I would also like for Intel to allow this, but I understand the burdens it would bring. I'm just not convinced they explicitly wanted to segment AVX-512 out after investing so much time and money into mainstream hardware and software enablement in their awesome libraries.

Most Android phones nowadays have SoCs containing mixed architectures, and well, I am aware Android is very heavily customized, but end of the day still largely a Linux kernel under all of that, so I would place a safe bet on Linux being much better suited for a mixed architecture CPU.

I think both sides of this argument provide good points, by removing this from chips such as the i9-13900K they give a reason for the 8- and 10-core SPR and EMR CPUs to exist, while playing safe on the compatibility front, although compiler-wise, I don't think it'd be more of a burden, the configuration is already there on Xeons anyway

#24

ncrs

Dr. DroMost Android phones nowadays have SoCs containing mixed architectures, and well, I am aware Android is very heavily customized, but end of the day still largely a Linux kernel under all of that, so I would place a safe bet on Linux being much better suited for a mixed architecture CPU.

I'm not sure if I explained myself well enough. I meant differing instruction sets, in context of this discussion, as x86 cores with AVX-512 (P-cores) and without (E-cores).
Android phones, and most other ARM SoCs can have cores based on differing microarchitectures, for example Snapdragon 8 Gen 3 has ARM Cortex-X4, ARM Cortex-A720 and ARM Cortex-A520 cores, but they all adhere to the same AArch64 specification level - ARMv9.2-A. Thus the kernel can freely move processes between them. I'm not aware of any that do mix levels, but I haven't looked very hard.

Dr. DroI think both sides of this argument provide good points, by removing this from chips such as the i9-13900K they give a reason for the 8- and 10-core SPR and EMR CPUs to exist, while playing safe on the compatibility front, although compiler-wise, I don't think it'd be more of a burden, the configuration is already there on Xeons anyway

They would have existed anyway since "bigger" CPUs bring more RAM channels and PCIe lanes along with core count increases, and well as Intel did that in the past. Ice Lake was shared by mobile, workstation and server segments including lower core count SKUs.
As for compilers, it's true that they contain optimization for Xeons, but they do differ in significant ways - "big" Golden Cove cores have an additional full 512-bit AVX-512 execution port (it matters for fine tuning of instruction ordering), and also have SGX and TSX which were removed from desktop/mobile processors.

#25

Squared

ncrsI'm just not convinced they explicitly wanted to segment AVX-512 out after investing so much time and money into mainstream hardware and software enablement in their awesome libraries.

So why doesn't Meteor Lake support AVX-512? I don't think it even supports AVX10. It has a new E-core, so this difference could've been solved. To follow your thinking and answer my own question, probably because Intel felt AVX-512 required too much silicon for an E-core and AVX10 didn't exist before Meteor Lake was finalized.

Dr. DroMost Android phones nowadays have SoCs containing mixed architectures...

This was addressed by ncrs but I'll add this: the instruction set architecture (ISA) is the language software and hardware use to communicate to one another. A software developer will tell a piece of software to check once whether the hardware can speak AVX-512, and probably won't check again. The OS constantly moves that software between cores and regardless of how smart the OS is, the piece of software will crash if it tries to speak AVX-512 after moving to a core that doesn't speak it. The software is blind to the microarchitecture, which is the underlying logic that actually does work that is communicated through the ISA. (Evidently the logic required to perform an AVX-512 instruction is pretty complex.)

Add your own comment

AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

30 Comments on AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

Related News

30 Comments on AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts