AMD "Zen 2" IPC 29 Percent Higher than "Zen"

looncraz · Nov 14, 2018

bucketface said:
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.

It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:

Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)

These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.

Valantar · Nov 14, 2018

bug said:
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.

Clarified? They have never claimed otherwise.

bug · Nov 14, 2018

Valantar said:
Clarified? They have never claimed otherwise.

Maybe "put an end to speculations" would have been clearer?

GlacierNine · Nov 14, 2018

bug said:
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.

I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.

bug · Nov 14, 2018

GlacierNine said:
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.

Yeah, well, posting in a hurry between two compiles will do that

Edit: fixed

looncraz · Nov 14, 2018

bug said:
Maybe "put an end to speculations" would have been clearer?

They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.

bug · Nov 14, 2018

looncraz said:
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.

At least it puts an upper bound on expectations, so I'm good.

looncraz · Nov 14, 2018

bug said:
At least it puts an upper bound on expectations, so I'm good.

That it does - it looks around 30% for any real world task is going to be max. It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%. But that's where we were before, except with a lower upper bound. I still think it will be about 15% on average.

efikkan · Nov 14, 2018

looncraz said:
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.

You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.

looncraz said:
That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.

Judging by that image, there is no issue with L3 cache at all.

looncraz said:
Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...

Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.

looncraz said:
Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.

This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.

looncraz said:
Data sharing is insanely common in multi-threaded programs.

Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.

looncraz · Nov 14, 2018

efikkan said:
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?

It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies. We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result. Intel uses a bus they call, unimaginatively, but accurately, the "result bus." This bus is fed by the store pipelines and each execution pipeline. The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places. You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data. I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.

efikkan said:
Judging by that image, there is no issue with L3 cache at all.

That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.

You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better. This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3. This happens increasingly more after a single core uses more than 4MiB of data. By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal. My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).

efikkan said:
Intel is much better at utilizing their cache through a better front-end.

Zen's front end is extremely good. As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6. Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6. That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler. Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF. This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers. Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.

efikkan said:
Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

If you mean between L3 caches to mean between each CCX or die - yes, that's true. Everything is always referenced as a memory address, the LLC acts as the insulator to main memory. However, cross communication does occur for certain volatile memory. This seems to happen via the command bus, but it also happens via the data fabric. This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on). This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.

efikkan said:
Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.

Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.

Vario · Nov 15, 2018

'The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,' AMD's clarification continues. 'We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.'

https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/

$ReaPeR$ · Nov 18, 2018

well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.

Adam Krazispeed · Nov 21, 2018

btarunr said:
AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.

Lisa Su is very careful about the guidance she puts out.

actually.... it was an IPC uplift of 52% from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

System Name	Computer of Theseus
Processor	Intel i9-12900KS: 50x Pcore multi @ 1.18Vcore (target 1.275V -100mv offset)
Motherboard	EVGA Z690 Classified
Cooling	Noctua NH-D15S, 2xSF MegaCool SF-PF14, 4xNoctua NF-A12x25, 3xNF-A12x15, AquaComputer Splitty9Active
Memory	G-Skill Trident Z5 (32GB) DDR5-6000 C36 F5-6000J3636F16GX2-TZ5RK
Video Card(s)	ASUS PROART RTX 4070 Ti-Super OC 16GB, 2670MHz, 0.93V
Storage	1x Samsung 990 Pro 1TB NVMe (OS), 2x Samsung 970 Evo Plus 2TB (data), ASUS BW-16D1HT (BluRay)
Display(s)	Dell S3220DGF 32" 2560x1440 165Hz Primary, Dell P2017H 19.5" 1600x900 Secondary, Ergotron LX arms.
Case	Lian Li O11 Air Mini
Audio Device(s)	Audiotechnica ATR2100X-USB, El Gato Wave XLR Mic Preamp, ATH M50X Headphones, Behringer 302USB Mixer
Power Supply	Super Flower Leadex Platinum SE 1000W 80+ Platinum White, MODDIY 12VHPWR Cable
Mouse	Zowie EC3-C
Keyboard	Vortex Multix 87 Winter TKL (Gateron G Pro Yellow)
Software	Win 10 LTSC 21H2

System Name	Ryzen/Laptop/htpc
Processor	R9 3900X/i7 6700HQ/i7 2600
Motherboard	AsRock X470 Taichi/Acer/ Gigabyte H77M
Cooling	Corsair H115i pro with 2 Noctua NF-A14 chromax/OEM/Noctua NH-L12i
Memory	G.Skill Trident Z 32GB @3200/16GB DDR4 2666 HyperX impact/24GB
Video Card(s)	TUL Red Dragon Vega 56/Intel HD 530 - GTX 950m/ 970 GTX
Storage	970pro NVMe 512GB,Samsung 860evo 1TB, 3x4TB WD gold/Transcend 830s, 1TB Toshiba/Adata 256GB + 1TB WD
Display(s)	Philips FTV 32 inch + Dell 2407WFP-HC/OEM/Sony KDL-42W828B
Case	Phanteks Enthoo Luxe/Acer Barebone/Enermax
Audio Device(s)	SoundBlasterX AE-5 (Dell A525)(HyperX Cloud Alpha)/mojo/soundblaster xfi gamer
Power Supply	Seasonic focus+ 850 platinum (SSR-850PX)/165 Watt power brick/Enermax 650W
Mouse	G502 Hero/M705 Marathon/G305 Hero Lightspeed
Keyboard	G19/oem/Steelseries Apex 300
Software	Win10 pro 64bit

System Name	Ryzen 5800X-PC / RyzenITX (2nd system 5800X stock)
Processor	AMD Ryzen 7 5800X (atx) / 5800X itx (soon one pc getting 5800X3D upgrade! ;)
Motherboard	Gigabyte X570 AORUS MASTER (ATX) / X570 I Aorus Pro WiFi (ITX)
Cooling	AMD Wrath Prism Cooler / Alphenhone Blackridge (ITX)
Memory	OLOY 4000Mhz 16GB x 2 (32GB) DDR4 4000 Mhz CL18, (22,22,22,42) 1.40v AT & ITX PC's (2000 Fclk)
Video Card(s)	AMD Radeon RX 6800 XT (ATX) /// AMD Radeon RX 6700 XT 12GB GDDR6 (ITX)
Storage	(Sys)Sammy 970EVO 500GB & SabrentRocket 4.0+ 2TB (ATX) \| SabrentRocket4.0+ 1TB NVMe (ITX)
Display(s)	30" Ultra-Wide 21:9 200Hz/AMD FREESYNC 200hz/144hz LED LCD Montior Connected Via Display Port (x2)
Case	Lian Li Lancool II Mesh (ATX) / Velkase Velka 7 (ITX)
Audio Device(s)	Realtek HD ALC1220 codec / Onboard HD Audio* (BOTH) w/ EQ settings
Power Supply	850w (Antec High-Current Gamer) HC-850 PSU (80+ gold certified) ATX) /650Watt Thermaltake SFX (ITX)
Mouse	Logitech USB Wireless KB & MOUSE (Both Systems)
Keyboard	Logitech USB Wireless KB & MOUSE (Both Systems)
VR HMD	Oculus Quest 2 - 128GB - Standalone + Oculus link PC
Software	Windows 10 Home x64bit 2400 /BOTH SYSTEMS
Benchmark Scores	CPUZ - ATX-5800X (ST:670) - (MT: 6836.3 ) CPUZ - ITX -5800X (ST:680.2) - (MT: 7015.2) ??? same CPU?