I cannot "see" any big IPC improvements from now on. 10% max from gen to gen is my prediction. Zen3 made a huge jump. Clocks and efficiency will determine the progress until new materials for transistors are used that will allow big clock jumps (graphite anyone?).
Upcoming Sapphire Rapids / Golden Cove is a major architectural overhaul, greatly extending the CPU front-end beyond Sunny Cove. I would be surprised if AMD wouldn't attemt something comparable.
AMD expect clock speeds to decrease over the next nodes, so don't expect much there.
But even if materials would allow significantly higher clock speeds, the current bottlenecks would just become more apparent. Pretty much all non-SIMD workloads scales towards cache misses and branch mispredicions. Cache misses have a nearly fixed time cost (the memory latency), so increasing the CPU clocks actually increases the relative cost of a cache miss. The cost of branch mispredicions in isolation is fixed in clock cycles, but they can cause secondary cache misses. My point is, if we don't reduce these issues, performance would be greatly hindered, even if we were able to increase clock speed significantly.
The changes we know Intel will bring is more of the same as both Sunny Cove and Zen 3 brought us; larger instruction windows, larger uop caches, larger register files, and possibly more execution ports. IPC gains are certainly still possible, I believe further generational jumps in the 20-30% range is still possible. But these changes will obviously have diminishing returns, and there are limits to how much paralellization can be extracted without either rewriting the code and/or extending the ISA. I know Intel is working on something called "threadlets", which may be solving a lot of the pipeline flushes and stalls. If this is successful, we could easily be looking at a 2-3x performance increase.
So you're saying Apple has magic technology that makes general purpose code run on fixed-function hardware accelerators? Or did they tune their chip specifically for GeekBench?
Geekbench is showcasing Apple's accelerators. This is a benchmark of various implementations, not something that translates into generic performance. Geekbench is useless for anything but showcasing special features.
IPC is instructions per clock. If the performance of the new M1 in some application or benchmark, say darktable or SPEC, is the same as on the latest intel tiger lake processor, the amount of instructions in the executables is similar and the clock speed of M1 is less than that of the tiger lake processor, it means that the IPC is higher on M1 than on tiger lake.
What am I missing here?
edit: for example take these results:
View attachment 183964
<snip>
And to clarify, the above test makes no use of fixed function accelerators.
You are right about this not using special accerlation, but you're missing that this is a "pure" ALU and FPU benchmark. There is nothing preventing an ARM design from having comparable ALU or FPU performance. This only stresses a small part of the CPU, while the front-end and the caches are not stressed at all. Such benchmarks are fine to showcase various strengths and weaknesses of architectures, but they don't translate into real world performance. The best benchmark is always real workloads, and even if synthetic workloads are used, there should be a large variety of them, and preferably algorithms, not single unrealistic workloads.
M1 has 50% more IPC than latest Ryzen, can you see that bound ? There is no way that next Ryzen will have more than 4 decoders, there is no way that next Ryzen will have 50% IPC, can you see that bound ? This is the main change M1 has.
Comparing IPC across ISAs means you don't even know what IPC means.
I don't believe there is anything preventing microarchitectures from Intel or AMD to have more than 4 decoders, you can pipeline many of them, and there is a good chance that upcoming Sapphire Rapids / Golden Cove actually will.