Just a note: In most situations, the overall utilization of ALUs in a modern CPU is fairly low (such as: less than 50%). The typical average IPC of most x86 applications is still in the range 1.0 - 2.0. This means that, in theory, 1 or 2 ALUs of a given type are sufficient and having 3 or more ALUs of a given type is a waste of silicon. Applications with an IPC of 4.0 are very rare. The main drivers of CPU core performance in recent years have been: OoO (out-of-order) logic improvements, larger internal CPU buffers and queues, branch prediction improvements. ALUs are relatively cheap in terms of silicon area and are relatively easy to replicate on a chip - OoO logic isn't cheap and is a much harder problem to crack than ALUs.
Going "deeper"(OoO) has certainly been one of the main performance drivers since the Core 2 family, arguably even long before that. But we must not forget that going "wider"(more execution ports) goes along with it, and along with balancing the right execution units (on the execution ports), power gating and so on, they achieve good utilization of execution ports and all the resources to feed these, even though the individual execution units probably have fairly low utilization rates.
Back with Sandy Bridge, Intel had 3 execution ports to do integer or vector operations.
In Haswell they added a forth with an ALU.
In Sunny Cove(Ice Lake/Rocket Lake) they added more execution units on the forth port.
On Alder Lake(Golde Cove) they added the fifth execution port for int/vec, this time with an ALU and LEA unit (similar to Haswell). (While only 3 execution ports still contain vector units.)
But still, there are more minor changes which add up to significant performance gains. Like in Sunny Cove, Intel brought significantly faster integer multiplication and division. More such improvements will be possible as they move to more advanced nodes. While I don't think the ALUs themselves can be much faster (and they are down to a few clock cycles anyways), and those are as you said very cheap, but the other units probably can.
So will we see Intel going even wider? Probably, but I don't see them going straight to 8 ALUs, as it wouldn't be worth the scheduling etc. to manage it before the rest of the front-end can feed it. But as you know, at some point there is a point of diminishing returns (the CPU front-ends are already huge), well unless something changes on the software side. And I don't just mean the quality of software, but also ISA changes and compiler improvements. There could be a lot of efficiency gains if the cost of mispredictions are reduced (like a partial flush). And I'm sure both companies have a lot coming that I'm not aware of.
BTW, lot's of interesting discussions here.
But why is there a large number of execution units in x86 processors? To improve SMT performance?
No, not at least the way current x86 microarchitectures implements it. (Currently they only switch between two threads)
Multiple execution ports (each can hold multiple execution units) allows what we call
instruction level parallelism (worth reading), which basically means whenever the CPU finds multiple calculations that are independent on each other, it might as well execute them in parallel, and there are huge savings whenever prefetching or branching needs the result before continuing.
We actually got this feature very early on. Back with 80486 we got pipelining, and already with the following Pentium we got two execution ports. Pentium Pro/II added out-of-order execution. Even though these implementations were very simple compared to current designs, these concepts have evolved over decades, and been a core part of the performance gains over these years.
Current designs from Intel(Golden Cove) have 5 ports for int/vec operations + 7 ports for memory operations.
Zen have a different configuration, but keep their integer and vector engines separate. If I read the schematics correctly; 8 ports for integer and memory operations combined (4 of which with ALUs), 6 ports for their vector operations (where 4 are for calculations (but can be fused together for FMA), 2 for load/store). So in theory, Zen 4 is in a way "wider" than Golden Cove, but this doesn't tell us all the finer details that makes up the complete picture. But then Zen 4 can seemingly only issue 6 operations/clock for 14 ports.
But if you can take away one lesson for today; it's the utilization of these ports that makes up a large part of the performance characteristics of an architecture, and is a good part of the explanation why an AMD CPU can win massively in one workload, while Intel wins in another.