Thursday, September 15th 2022

IPC Comparisons Between Raptor Cove, Zen 4, and Golden Cove Spring Surprising Results

OneRaichu, who has access to engineering samples of both the AMD "Raphael" Ryzen 7000-series, and Intel 13th Gen Core "Raptor Lake," performed IPC comparisons between the two, by disabling E-cores on the "Raptor Lake," fixing the clock speeds of both chips to 3.60 GHz, and testing them across a variety of DDR5 memory configurations. The IPC testing was done with SPEC, a mostly enterprise-relevant benchmark, but one that could prove useful in tracing where the moderately-clocked enterprise processors such as EPYC "Genoa" and Xeon Scalable "Sapphire Rapids" land in the performance charts. OneRaichu also threw in scores obtained from a 12th Gen Core "Alder Lake" processor for this reason, as its "Golden Cove" P-core powers "Sapphire Rapids" (albeit with more L2 cache).

With DDR5-4800 memory, and testing on SPECCPU2017 Rate 1, at 3.60 GHz, the AMD "Zen 4" core ends up with the highest scores in SPECint, topping even the "Raptor Cove" P-core. It scores 6.66, compared to 6.63 total of the "Raptor Cove," and 6.52 of the "Golden Cove." In the SPECfp tests, however, the "Zen 4" core falls beind "Raptor Cove." Here, scores a 9.99 total compared to 9.91 of the "Golden Cove," and 10.21 of the "Raptor Cove." Things get interesting at DDR5-6000, a frequency AMD considers its "sweetspot," The 13th Gen "Raptor Cove" P-core tops SPECint at 6.81, compared to 6.77 of the "Zen 4," and 6.71 of "Golden Cove." SPECfp sees the "Zen 4" fall behind even the "Golden Cove" at 10.04, compared to 10.20 of the "Golden Cove," and 10.46 of "Raptor Cove."
The big surprise here is just how good the "Gracemont" E-cores are in SPECint. OneRaichu made a distinction between the "Gracemont" E-cores of "Alder Lake" (GLC-12) and those of "Raptor Lake" (GLC-13,) as the latter have double the amount of shared L2 cache per E-core cluster. The E-core is fast approaching IPC levels comparable to that of "Skylake," which really is Intel's calculation in giving its processors a large number of E-cores next to a small number of P-cores. The idea is that the E-cores will soak up all the moderately-intensive compute workloads and background processes, keeping the P-cores free for gruelling compute-heavy tasks.
Source: OneRaichu (Twitter)
Add your own comment

34 Comments on IPC Comparisons Between Raptor Cove, Zen 4, and Golden Cove Spring Surprising Results

#26
JustBenching
progsteThe cores per ccd is the same (8), if anything the 5950x should put out more heat since it has two of them but the opposite is tru as AMD refined their manufacturing process.
By hard to cool eveyrone compares similar coolers in similar use conditions, not watt per watt.
Also if the TDP is 170W it will do at most that at stock.
You don't understand the fundamental parts of thermodynamics. The 5950x is power limited to the same wattage as the 5800x,but that wattage is spread out to double the die size, thats why its easier to cool.

Zen 4 has an even smaller die size, but even higher power draw, which will make it way harder to cool than zen 3. On the other hand Raptor will have a bigger die size than alderlake but similar power draw, which makes it easier. Assuming the zen 4 rumors are true and the 7950x draws north of 200w, it will be way harder to cool than the 13900k at 250watts. Thats just physics
Posted on Reply
#27
progste
fevgatosYou don't understand the fundamental parts of thermodynamics. The 5950x is power limited to the same wattage as the 5800x,but that wattage is spread out to double the die size, thats why its easier to cool.

Zen 4 has an even smaller die size, but even higher power draw, which will make it way harder to cool than zen 3. On the other hand Raptor will have a bigger die size than alderlake but similar power draw, which makes it easier. Assuming the zen 4 rumors are true and the 7950x draws north of 200w, it will be way harder to cool than the 13900k at 250watts. Thats just physics
The die size is the same, the 5950x uses two 8 core chiplets while the 5800x uses one of them.
We'll see once the chips are out, but something tells me the 13900k will be another miniature stove while the 7950x will be reasonable.
Posted on Reply
#28
JustBenching
progsteThe die size is the same, the 5950x uses two 8 core chiplets while the 5800x uses one of them.
We'll see once the chips are out, but something tells me the 13900k will be another miniature stove while the 7950x will be reasonable.
You are confusing the ihs with the die. The ihs is the same yes, the die isn't. The 5950x has 2 ccds of 85mm2 each. The 5800x has one.
Posted on Reply
#29
Operandi
RichardsRaptor cove still superior on an older node.. intel architecture is more advanced
But AMD is matching Intel's performance using significantly fewer transistors so clearly AMD is still superior.

The reality is they are both very different and it looks like both have good designs and AMD and Intel will pretty much directly competing overall.
AlwaysHopeThose scores are pretty close if not within the margin of error. It's like splitting hairs here... I also think bios immaturity with RPL could be a handicap.
Its just one test but it is pretty insane just how close these very different architectures perform when normalized at the same clock, I would not have expected that at all.
Posted on Reply
#30
efikkan
agent_x007IPC is a constant (and depends on task), and it is independent of core frequency (and why you multiple both together to approximate performance FYI).

The higher the core frequency, the higher it will be scewed by buses/IMC/DRAM performance, and higher chance of throttling based on cooling/power requirements …
This is a typical misconception.
Real IPC is a constant and is given by the architectural design, it's the architecture's ability to process instructions across "any" workload, and is measured in clocks. Real IPC isn't possible for us to measure, so we approximate it by locking clock speed far below any throttling point, choosing memory hopefully fast enough not to cause a bottleneck, and hopefully selecting a good amount of workloads able to saturate a single core. What we get is a relative IPC, which is an approximation, and the quality of this approximation is dependent on the aforementioned factors which will affect the benchmark scores.
Posted on Reply
#31
Wirko
efikkanThis is a typical misconception.
Real IPC is a constant and is given by the architectural design, it's the architecture's ability to process instructions across "any" workload, and is measured in clocks. Real IPC isn't possible for us to measure, so we approximate it by locking clock speed far below any throttling point, choosing memory hopefully fast enough not to cause a bottleneck, and hopefully selecting a good amount of workloads able to saturate a single core. What we get is a relative IPC, which is an approximation, and the quality of this approximation is dependent on the aforementioned factors which will affect the benchmark scores.
How do you account for the fact that different instructions take different number of cycles to execute, from zero (sometimes, if the front end manages to fuse two instructions into one micro-op) to several tens (division, whose time to execute also depends of the actual data being divided)?
How do you account for the fact that, as an example, a Skylake core can do four non-vector additions at the same time (they probably execute in one cycle but I haven't checked) but only one division (which, again, takes many cycles to execute)?
Posted on Reply
#32
Steevo
WirkoHow do you account for the fact that different instructions take different number of cycles to execute, from zero (sometimes, if the front end manages to fuse two instructions into one micro-op) to several tens (division, whose time to execute also depends of the actual data being divided)?
How do you account for the fact that, as an example, a Skylake core can do four non-vector additions at the same time (they probably execute in one cycle but I haven't checked) but only one division (which, again, takes many cycles to execute)?
Cause that is the actual real world effect of architecture on IPC in real world software at a set frequency so we can determine the efficiency of a architecture at a given task.


I seriously don't know how that is so hard to understand by so many.

Architecture A may be great at X software, while Y architecture may excel with Z software and its a balance act to make one great at everything, which is also why a great architecture at in order execution has a long/deep pipeline but a out of order architecture must have a either shallow pipeline and or a great predictive branching unit and lots of cache.


Why are Arm CPUs so good on phones and closed environments? They have a closed environment and can be optimized for typical handheld devices. The same program can run significantly faster on a desktop CPU through a emulator though, so which architecture is superior? Which has higher IPC.



Posted on Reply
#33
progste
arm is built on a RISC architecture which means they have less a simpler and smaller instruction set which means less space and lower power.
x86 is a CISC architecture which means they have a wider set of instructions, some of which are very complex and take a lot of hardware and power to implement.

The advantage of RISC is efficiency for small tasks, the advantage of CISC is performance on highly complex tasks, neither is superior in absolute.
in other words the x86 CPU can do the same thing with less instructions so this doesn't really reflect IPC.
Posted on Reply
#34
Wirko
efikkanReal IPC is a constant and is given by the architectural design, it's the architecture's ability to process instructions across "any" workload
So the real IPC of the Haswell or Skylake architecture is 6, is that what you mean? It's been calculated by people who seem to know the architecture well enough.
stackoverflow.com/questions/37041009/what-is-the-maximum-possible-ipc-can-be-achieved-by-intel-nehalem-microarchitect
btarunrThe big surprise here is just how good the "Gracemont" E-cores are in SPECint. OneRaichu made a distinction between the "Gracemont" E-cores of "Alder Lake" (GLC-12) and those of "Raptor Lake" (GLC-13,) as the latter have double the amount of shared L2 cache per E-core cluster. The E-core is fast approaching IPC levels comparable to that of "Skylake," which really is Intel's calculation in giving its processors a large number of E-cores next to a small number of P-cores. The idea is that the E-cores will soak up all the moderately-intensive compute workloads and background processes, keeping the P-cores free for gruelling compute-heavy tasks.
This was single-threaded benchmarking. While it does reveal a lot, it would have been great if it was also done with two threads and four threads.

2 threads on a single P core vs. 2 threads on the same E core cluster: each thread's performance on P should drop sharply (by 35% or so) but what about E?

4 threads on two P cores vs. 4 threads on the same E core cluster: similar but the E cores would be even more constrained because they share L2 and access to L3 and bus.

There may be optimisations (or regressions, for that matter) in how a P core handles SMT, and such benchmarking would have exposed that.
Posted on Reply
Add your own comment
Nov 23rd, 2024 17:20 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts