Monday, January 8th 2024
AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads
According to the latest round of tests by Phoronix, we are seeing proof of substantial performance gains Intel's 5th Gen Xeon Emerald Rapids server CPUs deliver when employing AVX-512 vector instructions. Enabling AVX-512 doubled throughput on average across a range of workloads, with specific AI tasks accelerating over 10x versus having it disabled. Running on the top-end 64-core Platinum 8592+ SKU, benchmarks saw minimal frequency differences between AVX-512 on and off states. However, the specialized 512-bit vector processing unlocked dramatic speedups, exemplified in the OpenVINO AI framework. Specifically, weld porosity detection, which has real-world applications, showed the biggest speedups. Power draw also increased moderately - the expected tradeoff for such an unconstrained performance upside.
With robust optimizations, the vector engine potential has now been fully demonstrated. Workloads spanning AI, visualization, simulation, and analytics could multiply speed by upgrading to Emerald Rapids. Of course, developer implementation work remains non-trivial. But for the data center applications that can take advantage, AVX-512 enables Intel to partially close raw throughput gaps versus AMD's core count leadership. Whether those targeted acceleration gains offset EPYC's wider general-purpose value depends on customer workloads. But with tests proving dramatic upside, Intel is betting big on vector acceleration as its ace card. AMD also supports the AVX-512 instruction set. Below, you can find the geometric mean of all test results, and check the review with benchmarks here.
Source:
Phoronix
With robust optimizations, the vector engine potential has now been fully demonstrated. Workloads spanning AI, visualization, simulation, and analytics could multiply speed by upgrading to Emerald Rapids. Of course, developer implementation work remains non-trivial. But for the data center applications that can take advantage, AVX-512 enables Intel to partially close raw throughput gaps versus AMD's core count leadership. Whether those targeted acceleration gains offset EPYC's wider general-purpose value depends on customer workloads. But with tests proving dramatic upside, Intel is betting big on vector acceleration as its ace card. AMD also supports the AVX-512 instruction set. Below, you can find the geometric mean of all test results, and check the review with benchmarks here.
30 Comments on AVX-512 Doubles Intel 5th Gen "Emerald Rapids" Xeon Processor Performance, Up to 10x Improvement in AI Workloads
no one bloody cares
Get to actually matching AMD where it matters, Intel. You're slacking.
Dont talk about the competition if you cant compete sort of scenario so big yourself up against previous/last gen stuff.
That EMR IS, on average, worse than equivalent Ryzen Threadripper Pro and EPYC CPUs stands however
Seriously, did anyone carefully read the article and figured out whether the "disabled/off" scenario is AVX2/AVX/SSE, or is off for all vectorization sets? Compiler flags aren't telling much.
I'm serious, who does? Is it because of easier scheduling and division of hardware to different users?
AMD also supports some of the AVX-512 instructions split up to use the AVX-256 units.
AMD's implementation of AVX-512 is not as simple as splitting it to 256-bit operations. Otherwise the performance increase would not have been as high. Chips and Cheese posted a good analysis of it.
Nice intel about kills avx-512 because it creates to much heat plus not much uses it anyway
Now it's needed for AI boost ?
www.makeuseof.com/what-is-avx-512-why-intel-killing-it/
www.tomshardware.com/news/intel-nukes-alder-lake-avx-512-now-fuses-it-off-in-silicon
Sapphire Rapids uses the same Golden Cove P-cores from Alder while Emerald uses Raptor Cove. The decision to remove AVX-512 was from a segmentation perspective.
The main reason for removing AVX-512 was the inability of E-cores to execute those instructions. I'm not aware of any mainstream operating system that can deal with non-uniform instruction sets, so P-cores had to match them.
12th gen. desktop CPUs were also capable of running AVX-512 by disabling E-cores, but that possibility got taken away with a microcode update, and by physical modification in later stepping/generations.
I'm fairly sure that Linux can, but not Windows - and I don't think that's specifically the reason why, or they could simply let us use it with the E-cores turned off. Intel's never been a stranger to market segmentation, they've always done it even when AMD was exerting full pressure on them. There was no physical shrinkage of later batches of Alder Lake - and Raptor Lake "refresh", aka the 14th Gen scam, has no physical differences whatsoever from 13th Gen
Dang I thought they killed it because it was skewing core temperatures and voltage usage to high against amd :p
My bad :slap:
It would have been very difficult to properly support mixed configurations, both from end user and professional perspectives.
"AVX-512 is supported**,
** - but only if you disable some cores"
- would not have been good PR, and a source of constant support issues ;)
Compilers contain architectural optimizations, so it would require having two different levels for this setup. Doable, but I don't think Intel wanted to commit to such complexity. Certain enough to provide an example of such computer? I am not aware of any.
I would also like for Intel to allow this, but I understand the burdens it would bring. I'm just not convinced they explicitly wanted to segment AVX-512 out after investing so much time and money into mainstream hardware and software enablement in their awesome libraries.
I think both sides of this argument provide good points, by removing this from chips such as the i9-13900K they give a reason for the 8- and 10-core SPR and EMR CPUs to exist, while playing safe on the compatibility front, although compiler-wise, I don't think it'd be more of a burden, the configuration is already there on Xeons anyway
Android phones, and most other ARM SoCs can have cores based on differing microarchitectures, for example Snapdragon 8 Gen 3 has ARM Cortex-X4, ARM Cortex-A720 and ARM Cortex-A520 cores, but they all adhere to the same AArch64 specification level - ARMv9.2-A. Thus the kernel can freely move processes between them. I'm not aware of any that do mix levels, but I haven't looked very hard. They would have existed anyway since "bigger" CPUs bring more RAM channels and PCIe lanes along with core count increases, and well as Intel did that in the past. Ice Lake was shared by mobile, workstation and server segments including lower core count SKUs.
As for compilers, it's true that they contain optimization for Xeons, but they do differ in significant ways - "big" Golden Cove cores have an additional full 512-bit AVX-512 execution port (it matters for fine tuning of instruction ordering), and also have SGX and TSX which were removed from desktop/mobile processors.