• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

Joined
Jun 5, 2016
Messages
69 (0.02/day)
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.


It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:
  • Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
  • Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
  • ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
  • FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
  • Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)

These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.
 
Last edited:
Joined
May 2, 2017
Messages
7,762 (2.80/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
Clarified? They have never claimed otherwise.
 

bug

Joined
May 22, 2015
Messages
13,779 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Joined
Oct 5, 2017
Messages
595 (0.23/day)
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
 

bug

Joined
May 22, 2015
Messages
13,779 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
Yeah, well, posting in a hurry between two compiles will do that ;)

Edit: fixed
 
Joined
Jun 5, 2016
Messages
69 (0.02/day)
Maybe "put an end to speculations" would have been clearer?

They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
 

bug

Joined
May 22, 2015
Messages
13,779 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
At least it puts an upper bound on expectations, so I'm good.
 
Joined
Jun 5, 2016
Messages
69 (0.02/day)
At least it puts an upper bound on expectations, so I'm good.

That it does - it looks around 30% for any real world task is going to be max. It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%. But that's where we were before, except with a lower upper bound. I still think it will be about 15% on average.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.

That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.
Judging by that image, there is no issue with L3 cache at all.

Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...
Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.
This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.

Data sharing is insanely common in multi-threaded programs.
Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
 
Joined
Jun 5, 2016
Messages
69 (0.02/day)
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?

It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies. We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result. Intel uses a bus they call, unimaginatively, but accurately, the "result bus." This bus is fed by the store pipelines and each execution pipeline. The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places. You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data. I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.

Judging by that image, there is no issue with L3 cache at all.

That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.



You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better. This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3. This happens increasingly more after a single core uses more than 4MiB of data. By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal. My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).

Intel is much better at utilizing their cache through a better front-end.

Zen's front end is extremely good. As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6. Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6. That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler. Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF. This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers. Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.

Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

If you mean between L3 caches to mean between each CCX or die - yes, that's true. Everything is always referenced as a memory address, the LLC acts as the insulator to main memory. However, cross communication does occur for certain volatile memory. This seems to happen via the command bus, but it also happens via the data fabric. This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on). This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.

Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.
 
Last edited:
Joined
Oct 21, 2005
Messages
7,061 (1.01/day)
Location
USA
System Name Computer of Theseus
Processor Intel i9-12900KS: 50x Pcore multi @ 1.18Vcore (target 1.275V -100mv offset)
Motherboard EVGA Z690 Classified
Cooling Noctua NH-D15S, 2xThermalRight TY-143, 4xNoctua NF-A12x25,3xNF-A12x15, 2xAquacomputer Splitty9Active
Memory G-Skill Trident Z5 (32GB) DDR5-6000 C36 F5-6000J3636F16GX2-TZ5RK
Video Card(s) ASUS PROART RTX 4070 Ti-Super OC 16GB, 2670MHz, 0.93V
Storage 1x Samsung 970 Pro 512GB NVMe (OS), 2x Samsung 970 Evo Plus 2TB (data), ASUS BW-16D1HT (BluRay)
Display(s) Dell S3220DGF 32" 2560x1440 165Hz Primary, Dell P2017H 19.5" 1600x900 Secondary, Ergotron LX arms.
Case Lian Li O11 Air Mini
Audio Device(s) Audiotechnica ATR2100X-USB, El Gato Wave XLR Mic Preamp, ATH M50X Headphones, Behringer 302USB Mixer
Power Supply Super Flower Leadex Platinum SE 1000W 80+ Platinum White, MODDIY 12VHPWR Cable
Mouse Zowie EC3-C
Keyboard Vortex Multix 87 Winter TKL (Gateron G Pro Yellow)
Software Win 10 LTSC 21H2
'The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,' AMD's clarification continues. 'We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.'

https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/
 
Joined
Jul 14, 2008
Messages
872 (0.15/day)
Location
Copenhagen, Denmark
System Name Ryzen/Laptop/htpc
Processor R9 3900X/i7 6700HQ/i7 2600
Motherboard AsRock X470 Taichi/Acer/ Gigabyte H77M
Cooling Corsair H115i pro with 2 Noctua NF-A14 chromax/OEM/Noctua NH-L12i
Memory G.Skill Trident Z 32GB @3200/16GB DDR4 2666 HyperX impact/24GB
Video Card(s) TUL Red Dragon Vega 56/Intel HD 530 - GTX 950m/ 970 GTX
Storage 970pro NVMe 512GB,Samsung 860evo 1TB, 3x4TB WD gold/Transcend 830s, 1TB Toshiba/Adata 256GB + 1TB WD
Display(s) Philips FTV 32 inch + Dell 2407WFP-HC/OEM/Sony KDL-42W828B
Case Phanteks Enthoo Luxe/Acer Barebone/Enermax
Audio Device(s) SoundBlasterX AE-5 (Dell A525)(HyperX Cloud Alpha)/mojo/soundblaster xfi gamer
Power Supply Seasonic focus+ 850 platinum (SSR-850PX)/165 Watt power brick/Enermax 650W
Mouse G502 Hero/M705 Marathon/G305 Hero Lightspeed
Keyboard G19/oem/Steelseries Apex 300
Software Win10 pro 64bit
well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.
 
Joined
Jan 29, 2016
Messages
128 (0.04/day)
System Name Ryzen 5800X-PC / RyzenITX (2nd system 5800X stock)
Processor AMD Ryzen 7 5800X (atx) / 5800X itx (soon one pc getting 5800X3D upgrade! ;)
Motherboard Gigabyte X570 AORUS MASTER (ATX) / X570 I Aorus Pro WiFi (ITX)
Cooling AMD Wrath Prism Cooler / Alphenhone Blackridge (ITX)
Memory OLOY 4000Mhz 16GB x 2 (32GB) DDR4 4000 Mhz CL18, (22,22,22,42) 1.40v AT & ITX PC's (2000 Fclk)
Video Card(s) AMD Radeon RX 6800 XT (ATX) /// AMD Radeon RX 6700 XT 12GB GDDR6 (ITX)
Storage (Sys)Sammy 970EVO 500GB & SabrentRocket 4.0+ 2TB (ATX) | SabrentRocket4.0+ 1TB NVMe (ITX)
Display(s) 30" Ultra-Wide 21:9 200Hz/AMD FREESYNC 200hz/144hz LED LCD Montior Connected Via Display Port (x2)
Case Lian Li Lancool II Mesh (ATX) / Velkase Velka 7 (ITX)
Audio Device(s) Realtek HD ALC1220 codec / Onboard HD Audio* (BOTH) w/ EQ settings
Power Supply 850w (Antec High-Current Gamer) HC-850 PSU (80+ gold certified) ATX) /650Watt Thermaltake SFX (ITX)
Mouse Logitech USB Wireless KB & MOUSE (Both Systems)
Keyboard Logitech USB Wireless KB & MOUSE (Both Systems)
VR HMD Oculus Quest 2 - 128GB - Standalone + Oculus link PC
Software Windows 10 Home x64bit 2400 /BOTH SYSTEMS
Benchmark Scores CPUZ - ATX-5800X (ST:670) - (MT: 6836.3 ) CPUZ - ITX -5800X (ST:680.2) - (MT: 7015.2) ??? same CPU?
AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.

Lisa Su is very careful about the guidance she puts out.



actually.... it was an IPC uplift of 52% from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....
 
Top