AMD Zen 5 Execution Engine Leaked, Features True 512-bit FPU

SL2 · Apr 5, 2024

bug said:
I'd rather do without and have CPUs that are 20-30% cheaper instead.

You mean due to smaller die? Yeah, I don't think that's gonna happen.

I mean, of course AMD could lower the price for various reasons, but the reason being smaller die size alone isn't very likely I'm afraid.

bug · Apr 5, 2024

SL2 said:
You mean due to smaller die? Yeah, I don't think that's gonna happen.

I mean, of course AMD could lower the price for various reasons, but the reason being smaller die size alone isn't very likely I'm afraid.

Die size makes the biggest impact on the retail price of a CPU. Waffers have predetermined sizes, they cost the same to make. The more chips you turn them into, the lower the price.

Philaphlous · Apr 5, 2024

I'm sure the shutdown at TSMC from the earthquakes will definitely impact AMD...Delay or reduced shipments if delivered on time...

529th · Apr 5, 2024

Wonder if this will be a compelling upgrade for Zen3 gamers.

windwhirl · Apr 5, 2024

Denver said:
I'd just like to see more mainstream consumer applications using such an instruction set.

There are some mainstream uses, such as Blender and some image/video encoding/decoding libraries, but not much else. Maybe RPCS3 if you want to consider PS3 emulation as "mainstream"

529th said:
Wonder if this will be a compelling upgrade for Zen3 gamers.

Gotta change board and RAM for this, at least, so it'd probably need some impressive numbers (+20% over Zen4).

Redwoodz · Apr 5, 2024

Philaphlous said:
I'm sure the shutdown at TSMC from the earthquakes will definitely impact AMD...Delay or reduced shipments if delivered on time...

I'm sure everyone will be impacted, nothing different about AMD.

evernessince · Apr 5, 2024

bug said:
If run locally, maybe. But currently most models worth anything are too big to run a consumer PC. And that's not going to change: no matter how capable PCs will grow, the cloud will always be better.

This is simply not true. You have large models like Llama2, Mistral, ect with a massive amount of parameters working well on regular desktop PCs. You also have Stable diffusion XL and the upcoming stable diffusion 3 models. There's also plenty of AI models that don't require much to run like AI voice enhancers, voice isolation, layer isolation, ect. You are assuming that every AI model worth having is super big and resource intensive but you can see from things like DLSS and SDXL Lighting that AI can be a powerful tool without needing a massive amount of resources. These smaller models can be extremely handy and light on resources.

ScaLibBDP · Apr 5, 2024

Here a couple of comments...

- A source for that leak is Very Questionable

- Intel AVX-512 ISA is a Complete Tech Disaster ( * )

( * ) It is based on my experience using an Intel Xeon Phi server. We reached its performance limitations in less than 4 weeks after a project was started.

ncrs · Apr 5, 2024

bug said:
I'm a bit confused. A few years ago we were burning Intel to the stake for AVX-512 (https://linuxiac.com/linus-torvalds-criticizes-intel-avx-512/, but not only). Now we're cheering for the same AVX-512?

We were burning Intel at the stake because their implementation was subpar. Engaging early AVX-512 implementations caused severe downclocking for the entire CPU even if only a single core was using it. The same issue affected AVX2 to a lesser extent. This made using AVX-512 a hazard for normal CPU operations, often resulting in performance significantly worse than AVX/AVX2 versions.
Since then Intel designs have reduced the penalty and almost eliminated it altogether for Sapphire Rapids.

bug said:
Thermal have certainly improved, but the discussion was more about the large amount of die space being used for specialized purposes. That's still the case. Considering the increased competition for fab capacity, you'd think "wasted" transistors is more of o problem today than it was 4 years ago.

Even with an older Skylake-X implementation that contained 2 AVX-512-capable units (one created by combining two 256-bit units, and one dedicated) the difference isn't as big, since only the red part is "dedicated" for AVX-512. Obviously there's other parts of the CPU that need to be extended for it as well.

Source

bug said:
I'm a bit more in the other camp: if it only benefits like 10% of the typical workloads, I'd rather do without and have CPUs that are 20-30% cheaper instead.

At the same time, I realize this is basically a chicken-and-egg problem: if AVX-512 isn't available, apps that use it won't be either.

Current Intel desktop/mobile P-cores contain the transistors for one AVX-512 unit (the combined 2x256-bit), and the miscellaneous stuff all over the core. The server parts extend this base core with a second dedicated 512-bit unit, more cache, a mesh agent and an AMX unit, among other things we can't be sure of just from die shots.
Meteor Lake is also built on the same principle using Redwood Cove cores. It would be prohibitively expensive for Intel to design a special version of the core without them when the combined unit is used for AVX2 anyway. All that makes the E-core business even more controversial.
I doubt purging AVX-512 completely would result in 20-30% less area.

Gains from AVX-512 can be significant, some benchmarks on Phoronix show up to 20x improvement using AVX-512-FP16, but most are not as drastic. Another recent gain of 10x in AI LLM prompt evaluation speed. We're starting to see some Linux distributions compiling software specifically for the x86-64-v4 target which includes AVX-512. It's not only about the vector length, since AVX-512 contains other general improvements usable even by strictly integer-based software.

JohH · Apr 5, 2024

In znver5 FP store ports are fused for 512-bit operations but can be used separately for 256-bit operations. In some AVX(2) workloads this will improve performance as well.

Code:

(define_reservation "znver5-fp-store256" "znver5-fp-store0|znver5-fp-store1")
(define_reservation "znver5-fp-store-512" "znver5-fp-store0+znver5-fp-store1")

Daven · Apr 5, 2024

bug said:
Die size makes the biggest impact on the retail price of a CPU. Waffers have predetermined sizes, they cost the same to make. The more chips you turn them into, the lower the price.

Don’t forget the law of mass production where reductions in cost can be achieved at scale. It’s cheaper to make millions of a single complex, large core design than a much smaller volume of a few simpler, smaller cores. That’s why AMD has the same chiplet for both Epyc and Ryzen.

R-T-B · Apr 5, 2024

bug said:
I'm a bit confused. A few years ago we were burning Intel to the stake for AVX-512 (https://linuxiac.com/linus-torvalds-criticizes-intel-avx-512/, but not only). Now we're cheering for the same AVX-512?

The criticism was due to the product segmentation not the product.

Denver · Apr 5, 2024

Redwoodz said:
I'm sure everyone will be impacted, nothing different about AMD.

It shouldn't, the fab that produces 5nm chips was not impacted. TSMC also left its financial guidance unchanged

529th · Apr 5, 2024

windwhirl said:
Gotta change board and RAM for this, at least, so it'd probably need some impressive numbers (+20% over Zen4).

Atleast %20 at the bottom performing increases!

bug · Apr 5, 2024

R-T-B said:
The criticism was due to the product segmentation not the product.

You didn't even open the link I provided, did you?

Panther_Seraphin · Apr 6, 2024

bug said:
You didn't even open the link I provided, did you?

Read what he says

He complains at the time that Intel were trying to market AVX512 as the magic bullet to solve all problems. When in actual fact if you used it, it was horrible.

You run AVX512 code on Alder lake and your down in 3.5Ghz Territory when the Turbos were 5Ghz+ for most other things. It also meant the P Cores were physically larger per core for near 0 benefit for most work loads where as a 10-12 core design with only AVX2 would have been better for most use cases. And the other half of your die was completely useless for doing AVX512 workload so there was also that as you had to disable your E cores to use it effectively.

AMD at the time were giving him everything he wanted. More cores, Decent power levels/consumption per core and no gimmicky tools to use to extract extra performance. As he stated at the time AVX512 should have been only in HPC/Server areas and the desktop had little to no benefit from it then.

R-T-B · Apr 6, 2024

bug said:
You didn't even open the link I provided, did you?

I've read it before. I know what Torvalds argues.

Have a quote:

He also cautioned against placing too much weight on floating-point performance benchmarks. Especially those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.

user556 · Apr 6, 2024

Oh, man, what a huge let-down. I had my hopes up it was the general instruction pipeline that was up by 40%. But alas not it seems.

Zen4 AVX512 is already a huge winner the way it is. It single-handedly turned the AVX512 ship around. It didn't need any measurable extra power to do amazing amounts of work. I now fear Zen5 is not going to be that good.

JohH · Apr 6, 2024

user556 said:
Oh, man, what a huge let-down. I had my hopes up it was the general instruction pipeline that was up by 40%. But alas not it seems.

Zen4 AVX512 is already a huge winner the way it is. It single-handedly turned the AVX512 ship around. It didn't need any measurable extra power to do amazing amounts of work. I now fear Zen5 is not going to be that good.

Because of a fake slide?
The way Zen 5 implements 512-bit operations is not yet clear. It may simply be fusing ports fp0/fp1, like they do for stores, in one cycle instead of doing it sequentially. It wouldn't take much extra area. Nor extra power compared to a dense AVX2 loop.

And what we do have evidence for from Zen 5 changes to Linux and GCC suggests general pipeline improvements too. 8 wide dispatch from micro-op cache, 6 ALU and 4 AGU. The only confirmed change for FP is a second FP store unit which does suggest improved throughput of AVX2 and AVX512 programs.

And where did you get the idea it'd be 40% faster? Discredited RDNA3 hypebeasts on twitter?

Onasi · Apr 6, 2024

JohH said:
And where did you get the idea it'd be 40% faster? Discredited RDNA3 hypebeasts on twitter?

Yeah, this should obviously be ridiculous - there hasn’t been a gen on gen improvement this massive… in a while. Not solely from the general instructions. Definitely not between generations of the same architecture. Otherwise we would be talking about a jump in overall performance that would be the biggest for AMD since Zen 1 when compared to Bulldozer and its derivatives. CPUs simply don’t increase in performance this drastically. Even the leaks and estimates for Zen 5 go for saner numbers like 10-15% IPC improvement (plausible) and 20-30% overall performance uplift compared to Zen 4 (again, tracks pretty well with what we’ve seen with previous gen increases, Zen+ aside for obvious reasons).

efikkan · Apr 6, 2024

stimpy88 said:
The low L2 cache size is an obvious planned mistake and low hanging fruit for Zen 6 to fix, we know AMD were experimenting with larger L2 cache sizes, and that 2MB was the sweet spot, and 3MB offering only slight low single-digit uplift in perf over 2MB. One of the reasons for the infamous "AMD dip".

Even though we know the slide is fake, I just want to point out that no one, including the best engineers, could precisely assess the effect of a cache change without evaluating the performance of a specific microarchitecture. A change in cache size on one microarchitecture might not translate to the same proportional change on another. L2 and L1 especially, is very tied to how the pipeline works, which is why the cache configuration might change a lot between generations. And contrary to what most people believe, they don't design the microarchitecture around the cache, it's the other way around. If throwing in another MB or so would make a huge benefit, I'm sure they would. They do simulate all kinds of core configurations before they do a tapeout, so they have quite likely already simulated what a larger L2 cache, and whichever they pick is the overall best performing within the constraints of the architecture and node.

Also, keep in mind there are many more attributes than just size, like latency, number of banks, bandwidth, etc. If the next generation is moved to a new node with different characteristics, it may be achievable with e.g. a larger cache without worsening the latency significantly.
Additionally, many heavy AVX workloads are more sensitive to bandwidth than cache size.

stimpy88 said:
And it's also borderline criminal AMD do not rectify the L3 cache starvation issue without the "3D cache band-aid" cash grab. Even a better memory controller would help in this regard.

I've often criticized the large L3, as it's a very "brute force" attempt to make up for shortcomings in the architecture, a sort of "band-aid" like you rightfully call it. But if Zen 5 is significantly better, especially in the front-end and scheduling of instructions, the usefulness of extra L3 may be actually reduced.
There will obviously still be the edge-case scenarios where the extra L3 shines (mostly very bloated code), but the overall gain is close to negligible, and it's such a waste of silicon for most uses.

Wirko said:
AVX512 is for integer and bitwise operations too, not only for FP. That's where SPEC-int gains, purportedly very big, come from.

AVX certainly support integer operations too as you say, but I suspect SPECint isn't compiled to use it, although I haven't checked thoroughly. But even so, modern CPUs do auto-vectorize in some cases, but I don't know if the front-end will be fast enough to vectorize more than 4 64-bit or 8 32-bit ops (per vector unit, so 2x) per clock. I suspect it will be very underutilized in reality, but still, in the worst case with AMD having their vector units on separate execution ports, it will allow each vector unit to work as a single ALU. Or probably split, so each FMA-pair as ALU+MUL. (whether it's worth it in power draw is uncertain)

mahoney · Apr 6, 2024

You are a bunch of idiots for taking MLID's leaks as truth
When the prick has a long track record of deleting all his BS leak videos when they don't come true.

Daven · Apr 6, 2024

I for one am glad the nonsense of a one year cadence between Zen 4 and Zen 5 is dead. So many were saying why buy Zen 4 when Zen 5 would come a year later. AMD processor architectures are on a two year cadence just like GPUs. Its possible it could be up to six months early or up to six months late for some releases as circumstances dictate. But never less or more than that for a major release.

Longer cadence with more features and performance on the same established platform as the last gen. This is a big reason I buy AMD.

evernessince · Apr 6, 2024

user556 said:
Oh, man, what a huge let-down. I had my hopes up it was the general instruction pipeline that was up by 40%. But alas not it seems.

Zen4 AVX512 is already a huge winner the way it is. It single-handedly turned the AVX512 ship around. It didn't need any measurable extra power to do amazing amounts of work. I now fear Zen5 is not going to be that good.

I'll assume based on your reaction here that you are not into tech news enough to know that a single slide cannot contain all the details of a given chip. Typically the press is given a deck of slides, not just a single slide, when a company releases a new CPU or GPU.

Nevermind that the slide turned out to be fake, you are drawing a conclusion based on wholly incomplete information. As usual with these kind of rumors and "leaked" slides, they are designed to generate clicks and engagement like what you've provided here. Don't fall for it, wait for official info to draw an informed conclusion.

kondamin · Apr 6, 2024

SL2 said:
You mean due to smaller die? Yeah, I don't think that's gonna happen.

I mean, of course AMD could lower the price for various reasons, but the reason being smaller die size alone isn't very likely I'm afraid.

AMD does chiplets, they can just cut back the number of cores per chiplet and have small dies.

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

System Name	System V
Processor	AMD Ryzen 5 3600
Motherboard	Asus Prime X570-P
Cooling	Cooler Master Hyper 212 // a bunch of 120 mm Xigmatek 1500 RPM fans (2 ins, 3 outs)
Memory	2x8GB Ballistix Sport LT 3200 MHz (BLS8G4D32AESCK.M8FE) (CL16-18-18-36)
Video Card(s)	Gigabyte AORUS Radeon RX 580 8 GB
Storage	SHFS37A240G / DT01ACA200 / ST10000VN0008 / ST8000VN004 / SA400S37960G / SNV21000G / NM620 2TB
Display(s)	LG 22MP55 IPS Display
Case	NZXT Source 210
Audio Device(s)	Logitech G430 Headset
Power Supply	Corsair CX650M
Software	Whatever build of Windows 11 is being served in Canary channel at the time.
Benchmark Scores	Corona 1.3: 3120620 r/s Cinebench R20: 3355 FireStrike: 12490 TimeSpy: 4624

Processor	Ryzen 7800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	32GB DDR5 6000 CL30
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	Pioneer
Processor	Ryzen R9 9950X
Motherboard	GIGABYTE Aorus Elite X670 AX
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise IoT 2024

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	AMD 7600x
Motherboard	Asrock x670e Steel Legend
Cooling	Silver Arrow Extreme IBe Rev B with 2x 120 Gentle Typhoons
Memory	4x16Gb Patriot Viper Non RGB @ 6000 30-36-36-36-40
Video Card(s)	XFX 6950XT MERC 319
Storage	2x Crucial P5 Plus 1Tb NVME
Display(s)	3x Dell Ultrasharp U2414h
Case	Coolermaster Stacker 832
Power Supply	Thermaltake Toughpower PF3 850 watt
Mouse	Logitech G502 (OG)
Keyboard	Logitech G512

System Name	The Workhorse
Processor	AMD Ryzen R9 5900X
Motherboard	Gigabyte Aorus B550 Pro
Cooling	CPU - Noctua NH-D15S Case - 3 Noctua NF-A14 PWM at the bottom, 2 Fractal Design 180mm at the front
Memory	GSkill Trident Z 3200CL14
Video Card(s)	NVidia GTX 1070 MSI QuickSilver
Storage	Adata SX8200Pro
Display(s)	LG 32GK850G
Case	Fractal Design Torrent (Solid)
Audio Device(s)	FiiO E-10K DAC/Amp, Samson Meteorite USB Microphone
Power Supply	Corsair RMx850 (2018)
Mouse	Razer Viper (Original) on a X-Raypad Equate Plus V2
Keyboard	Cooler Master QuickFire Rapid TKL keyboard (Cherry MX Black)
Software	Windows 11 Pro (24H2)

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu