I remain cautiously optimistic about AMD finding average IPC increases for Zen 5 and other chips in the pipeline. AMD, unlike Intel and Apple, isn't focused on single threaded performance über alles. Their focus in on server performance, and since they use a single design for servers and desktops, they have opted to focus on a small core that can offer acceptable single threaded performance while delivering leading multi threaded performance for servers. They appear to have matched Alder Lake's single threaded performance with a much smaller core in terms of the out-of-order window by focusing on front-end and MLP (load/store) improvements. They can still use all the tricks that Intel has used for Golden Cove, but that would require a larger core.
This table comparing the out of order engines of Zen 3 and Alder Lake is reproduced from
Chips and Cheese's excellent overview of Alder Lake/Golden Cove. I've changed the order of various entries, and eliminated some columns for brevity. The most important column is the last one; most structures in the out of order engine of Golden Cove are much larger than their counterparts in Zen 3, and even Sunny Cove (Ice Lake), has larger structures than Zen 3. Despite all that, Zen 3 beats all Intel architectures before Golden Cove.
Structure | Instruction affected if it.. | Golden Cove Capacity | Zen 3 Capacity | Golden Cove vs Zen 3 |
Reorder Buffer (ROB) | Is waiting to retire (all) | 512 | 256 | 2x |
Load Queue | Reads from memory | 192 | 116 | 1.5x |
Store Queue | Writes to memory | 114 | 64 | 1.78x |
Branch Order Buffer | Affects control flow | 128 | 48 Taken
117 Not Taken | Complicated, approximately
1.33x |
Integer Register File | Writes to an integer register | 280
(~248+32) | 192 (173 measured+32?) | 1.45x |
Flags Register File | Sets flags (often tied to integer registers on x86) | 248 | 121 | 2.04x |
Floating Point/Vector Register File | Writes to a fp/vector register | 332 (300+32) | 160 (139 measured+32?) | 2.07x |
Total Scheduler Capacity | Is waiting on an execution unit | 205 | 160 | 1.28x |
Fill Buffer | Misses L1D | 16 | 24 | 0.66x |
Superqueue | Misses L2 | 48 | 64? | 0.75x? |