I've already answered why, the scalability of something like that would be atrocious. A CPU core with dozens of ALUs and FPUs and one thread will be significantly slower than an 8 core CPU, there is simply not enough parallelism that can be extracted out of a single thread of instructions to fully max out such a core.
It is also the reason why SMT exists, and has existed since the age of Pentium 4 with HT.
There's only so much a line of instructions can be divided and parallelized before encountering a "conditional" instruction which splits the logic to unknown paths (in some cases somewhat known, via even more transistors called "branch predictor" *)
The problem with branch predictors is that in many cases they get it wrong, resulting in the CPU having to start over from last point which wasn't guesstimated.
Basically, in CPU design all "low hanging fruit" have been plucked a long, long time ago, and today all they can do is improve data moving latency a bit here, a bit there, extracting that last 2% of performance.
There is just no wait to make x86 based CPU's faster, or any CPUs faster (including CISC) for that matter, when talking about IPC for a single thread.
And of course frequency (clock speed) can still raise... as materials get better and manufacturing tech improves, but only a little bit... electricity (and light) have a maximum speed, it cannot be made to travel faster, not in this universe.
The
ONLY way forward for more performance is at software level, when the software itself is designed to run in parallel on multiple cores.
It's probably possible to scale core counts in the millions in the next 100 years, even on silicon, by making three-dimensional CPU's, assuming the software is capable of dividing itself that much and tech finds a way to cool such 3D CPUs (via micro-water pipes maybe that go through the CPU? IBM has did it, experimental...)