• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD "Zen 2" IPC 29 Percent Higher than "Zen"

This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.


It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.

From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:
  • Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
  • Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
  • ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
  • FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
  • Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)

These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.

___________________

Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.

If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.

Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.
 
Last edited:
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
Clarified? They have never claimed otherwise.
 
If it's too good to be true, it usually is.
AMD have clarified in the meantime 29% was based on one particular benchmark. So we're basically back to square one.
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
 
I like that even when you're quoting a decades old, common saying, you still manage to get it completely wrong by omitting the word "seems" and replacing it with a contraction that makes your sentence appear to read "If it is too good to be true, it usually is".

Congratulations on your newfound grasp of this tautology.
Yeah, well, posting in a hurry between two compiles will do that ;)

Edit: fixed
 
Maybe "put an end to speculations" would have been clearer?

They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
 
They wanted to make crystal clear that the benchmark wasn't designed to be a representative workload.

It's like using Cinebench as your only performance metric... not such a good idea unless all you do is run Cinema4D.
At least it puts an upper bound on expectations, so I'm good.
 
At least it puts an upper bound on expectations, so I'm good.

That it does - it looks around 30% for any real world task is going to be max. It's kind of a way to temper the whole "we doubled the FPU" thing, I think.

Much better for headlines to read 29% improvement rather than 100% improvement.. especially when some programs will see 5% benefit and others 20%, with a few as high as 30%.

Sadly, it's hard to predict Cinebench scores for this since Cinebench relies much more heavily on branch prediction and prefetch, but we can guess it will be at least 10%, but no more than 30% - and very very unlikely to see 30%. But that's where we were before, except with a lower upper bound. I still think it will be about 15% on average.
 
Bandwidth/latency between the integer and floating point PRFs, muxes, L1D, DTLB, load buffer, etc...

It can be a little easy to forget that Zen's FPU is a dedicated unit that has to have specific points of communication with the integer+memory complex whereas Intel's floating point units are on the same pipelines as their integer units.
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?
Let's say you write A + B + C + D in your code,
this will be executed as ((A + B) + C) + D,
and while each addition should only take a few cycles, the CPU would have to wait up to 18 cycles before the next operation can even start. While the exact timing and bandwidth slightly vary between CPU architectures, this principle should largely be the same for any pipelined architecture.

That's not the issue... the L2 is really good... it's when we get inside the L3 that issues begin... and they explode once we hit the IMC.
256stride.jpg
Judging by that image, there is no issue with L3 cache at all.

Intel's main advantage is a tightly coupled low latency IMC... AMD's game is more than on point until it hits the IMC (see above graphic), which happens at any access above 8MiB...
Sure, the memory latency can be much worse due to Zen's core structure. But you were the one arguing for Zen needing better cache. I'm the one pointing out that Zen have a decent cache on paper, but Intel is much better at utilizing their cache through a better front-end.

Games, for their part, often do a pretty good job of imitating a random memory access pattern... which is why Ryzen's game performance can jump 15% or more with overclocked memory. Give Zen the same memory latency as Intel cores have and I think the Ryzen 2700X would be the gaming king per clock.
This has to do with AMD's Infinity Fabric being tied to the memory speed.
Intel see no significant difference with speeds above 2666 MHz, even memory with slightly better timings have virtually no impact.

Data sharing is insanely common in multi-threaded programs.
Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.
 
You do know that any data coming out of an ALU or FPU needs to finish the pipeline before it can be fed again?

It actually doesn't, though that used to be the case. Today, fetch and decode chunks of instructions and then determine dependencies. We tag instructions as dependent upon others and try to reroute non-dependent instructions around them before then processing the dependent instruction.

For most instructions, the core knows (within reason) how long it should take to execute and get the result back, so we don't wait for the result - we schedule the instruction so that it is already ready to be fed into a pipeline when the result is available... we want the decoded instruction tagged and sitting in the scheduler - and we want the instruction whose result we need to carry a matching tag.

An add instruction takes a single cycle, but it takes time to get the result. Intel uses a bus they call, unimaginatively, but accurately, the "result bus." This bus is fed by the store pipelines and each execution pipeline. The load and execution pipelines can read results directly off this bus if the timing works out correctly.

So, A + B + C + D would execute as mov A, result; add B, result; add C, result; add D, result;.

One trick here, which is by no means obvious or necessarily done, would be to keep the instructions in two places. You keep a copy in the scheduler, tagged, as you send the dependent instruction down the pipelines one after the other, so the next instruction can get its result from the previous instruction from the result bus.

The way Intel describes what they do (despite admitting that the execution pipelines can read from the result bus) is to send the result back to the scheduler, where the dependent instruction is waiting for the data. I genuinely suspect they do both (otherwise there's no need for an execution pipeline to read from the result bus..)... it just depends on how dependably the execution while occur within a given time frame.

Judging by that image, there is no issue with L3 cache at all.

That image is using a 256-byte stride which hides the full effect of the random-access issue until it exceeds about the cache size as Ryzen can predict the access pattern well enough.

amd-ryzen-mem-data-lat.png


You can see the (unsurprising) excellent sequential performance (which is only 2.8ns on my 2700X) ... and them the abysmal random access performance.

Intel's in-page random access performance is several times better. This is the cache performance issue that is hurting Ryzen - and it relates to how often the L3 prefetch ends up hitting the IMC instead of being able to stay within the L3. This happens increasingly more after a single core uses more than 4MiB of data. By 6MiB you have a ~50% miss rate that result in hitting memory latency.

My 2700X results are better - because my IMC latency is only 61.9ns with 3600MT/s memory.

If Zen 2 can bring that down by another 20ns while increasing how much L3 each core can access, it's going to be a big deal. My 2700X only has 9.5ns latency to the L3 - if I had 40ns latency to main memory and 16MiB of cache to access, in-page random access should fall to the 20~30ns region (depending on page size).

Intel is much better at utilizing their cache through a better front-end.

Zen's front end is extremely good. As is Intel's.

Zen's has higher throughput potential (8 uops vs 4 uops), but Intel has fusion - so that 4 uops is sometimes 7 uops... and Zen's 8 uops is sometimes 4 uops...
Intel's branch predictor seems to be better, but that's about it.

The first Zen bottleneck (if you want to call it that) is when Intel can dispatch 7 uops and Zen can only dispatch 6. Intel isn't always able to dispatch 7 uops, but Zen can never exceed 6. That's a potential 16.7% advantage to Intel.

The next Intel advantage is in their unified scheduler - which allows accessing results without going back to the scheduler. Zen, AFAICT, needs to send results back to the forwarding muxes or the PRF. This is only a couple cycles - and AMD makes up for it by having four ALU full featured pipelines and 6 independent schedulers. Being only 14 deep keeps things simple to manage, but it may mean results could need to be fetched from the L1D (3 cycle penalty) more often.

Data sharing between L3 caches, which was what we were talking about, is very rare. The lifecycle of any piece of data in cache is in microseconds or less. Cache is just a streaming-buffer for RAM, it's not like the "most important stuff" stays there. Cache is usually completely overwritten thousands of times per second.

If you mean between L3 caches to mean between each CCX or die - yes, that's true. Everything is always referenced as a memory address, the LLC acts as the insulator to main memory. However, cross communication does occur for certain volatile memory. This seems to happen via the command bus, but it also happens via the data fabric. This is probably magic that happens through the IMC without going to system memory, which would explain the latency results with core to core communication (simple test - fixed affinity, each core accessing the same memory addresses, just reading and updating a simple struct - time, accessing core, and a mutex... each core that gains the mutex records the time difference between the last access, which core made that access, updates the struct, and moves on). This showed that handling the mutex could, PEAK, take only 10ns between the CCXes (this could even be a timing mechanism inaccuracy, since this all reanalysis), but usually took way more... strong clusters at 20ns, 40~50ns, and a good half at 100ns or more (which means out to main memory).

Multi-threaded apps share data across cores, it's as simple as that, and mutexes and volatile memory are all something the CPU can figure out with ease, so optimizing for those, in very least, has been done.

Zen is however penalized when having to access a different memory controller through the Infinity Fabric, which of course is common in multithreaded workloads.

Yes, it will be extremely interesting to see how the newly unified IMC that's spread far across the IO die will work to solve some of these issues.
 
Last edited:
'The data in the footnote represented the performance improvement in a microbenchmark for a specific financial services workload which benefits from both integer and floating point performance improvements and is not intended to quantify the IPC increase a user should expect to see across a wide range of applications,' AMD's clarification continues. 'We will provide additional details on "Zen 2" IPC improvements, and more importantly how the combination of our next-generation architecture and advanced 7nm process technology deliver more performance per socket, when the products launch.'

https://bit-tech.net/news/tech/cpus/amd-downplays-29-percent-zen-2-ipc-boost-reports/1/
 
well.. based on their recent track record its quite plausible that with zen 2 they will reach and maybe even overtake (by a small margin) intel on the single core ipc, their only disadvantage/problem is basically clocks and that can be fixed relatively easily with a smaller node.
 
AMD's "59% higher" claims for Zen1 over Excavator invited the same ridicule.

Lisa Su is very careful about the guidance she puts out.



actually.... it was an IPC uplift of 52% from "excavator"

if this is another 29% MORE IPC Over zen 1, then Intel is done.....
 
Back
Top