As some have pointed out, IPC is single thread. What you probably meant is saturation of the core resources, but it's important to understand that SMT even i perfect conditions never exceeds the performance of a single "optimal" thread. It's simply a way to let other threads utilize the resources the other thread doesn't use, scaling towards one "optimal thread".
There are several factors that impacts IPC. One way is to add more execution resources (ALUs, FPUs, AGUs etc.) which boosts your peak performance, but can leave resources unsaturated. Secondly, there are front-end, latency and cache improvements which improve the utilization of the execution resources you already have. Since SMT relies on exploiting idle resources of the CPU core for other threads, the ever increasing efficiency of CPU architectures is actually making SMT less and less useful for generic tasks, as efficiency gains in front-end and cache will ultimately consume the "gains" of SMT.
SMT was introduced at a time when single core CPUs were mostly idle due to stalls in the CPU pipeline, and the cost of implementing SMT in silicon was minuscule. But these days as the gains of SMT are shrinking, and the security implications of SMT makes the silicon costs ever increasing, it's actually time to drop it, not extend it further with 4-way or even 8-way SMT. Today, SMT only really makes sense for server workloads where latency is irrelevant and total throughput of massive amounts of requests (or work items) is the primary goal. SMT is really a relic of the past, and 2020 is not the year to push it further.
While future gains in CPU performance wouldn't get close to the improvements we saw in the 80s and the 90s, it's important to remember that the reason "stagnant" single thread performance for the last ~4+ years is not due to any theoretical performance limit in IPC. Obviously we are now at a "clock wall" for the current type of semiconductors, but the primary reason for the (Intel's) stagnant CPU selection is the node problems causing two years of delays to Ice Lake(Sunny Cove), which they claim offer 18% IPC gains. Both Intel and AMD have their 2-3 next architectures lined up, and theoretically it is absolutely possible to achieve ~50% better IPC over Skylake with just continuing to add more execution resources, improving cache, reducing latency and improving the front-end.
But even beyond that, single thread performance will not hit a wall any time soon. Quite the opposite, we are now on the verge of the largest single thread gain since the 90s. Since Pentium(1993), x86 CPUs have become increasingly superscalar, which obviously does wonders for peak performance, but also keeps widening the gap between minimum and average vs. peak performance, as the CPU becomes more sensitive to the code to keep the resources fully saturated. As anyone familiar with machine code would know, there are two major causes for this lack of saturation; cache misses and branch mispredictions. Optimizing for cache misses can be done fairly efficiently, but branch mispredictions are harder to deal with. Largely it's about removing bloat, but you will usually still have enough of it left to hold back performance. And in the greater scope of even a function, most branching only have local effects, but the CPU can't know that, so when there is a branch misprediction it has to flush the pipeline, even if some of the calculations may still be "good". This is because a lot of context is lost between your high level code and machine code, and even the best prediction models will only get you so far without getting some extra "help". I know Intel is researching a solution to this problem, where basically you have these dependencies between branching implied in machine code (e.g. this branch only affects this code over here, but not the bigger flow of the program), I believe they call it "threadlets" or something, and would probably done by having chains of instructions that are independent of branching in others, like sort of a "thread" that only exists virtually for a few dozen instructions. While this would at least require recompilation of software, it would greatly improve the CPU front-end's ability to reason about true dependencies between calculations, instead of having to assume the pipeline needs to be flushed. Gains in single threaded performance of 2-3x should not be unreasonable. While what I'm describing here may seem a little out of scope, it's actually not, as this would practically eliminate SMT. But don't expect this to be implemented in shipping products yet, it's still experimental, I would expect it 5-10 years down the road.
Actually, you got this the wrong way. In ideal conditions, SMT would not be needed at all, the only reason why there are gains from SMT is that threads don't saturate the CPU enough. When you have ideal software as you said, branch and cache optimized, it will saturate the CPU very well.
SMT is mostly useful for server workloads where you have an "endless" supply of "work chunks" that can be done in parallel, very typical for a server running worker threads for Java code or scripts. This is code which can't be cache optimized and is heavily abstracted, so the CPU will more or less constantly stall. This is where 4-way and even 8-way SMT makes sense (like Power CPUs), and even then the execution part of the CPU will be largely idle, the bottleneck will be the front-end and the caches, otherwise you could make a 32-way SMT CPU and scale on.
Oh, there can be so many, too much to discuss here. It depends how many threads you spawn, how they are synchronized and of course how your application is "disturbed" by background threads.