Sure, down to single instructions can be slightly faster or slower on various architectures. In my tests, I've seen some cases where Haswell is slower than Sandy Bridge, but in most cases it's faster. The problem here is that this is a benchmark of a single operation in a loop, this is a synthetic test case which will exaggerate the real world difference. The reason why he runs the loop 1.000.000.000.000 times is to get a measurable difference. Also, it's not like these operations are different alternatives to solve the same problem. It's not unlikely that you can find older architectures which can do certain simple operations like this faster, while modern architectures are optimized for saturating several execution ports and doing a mix of various types of operations. This is why such benchmarks can be very misguiding.
When doing real optimization of code, it's common to benchmark whole algorithms or larger pieces of code to see the real world difference of different approaches. It's very rare that you'll find a larger piece of code that performs much better on Skylake and a competing alternative which performs much better on let's say Zen 2. Any difference that you'll find for single instructions will be less important than the overall improvements of the architecture. And it's not like there will be an "Intel optimization", Intel has changed the resource balancing for every new architecture, so has AMD.
Interestingly the sample code in that video scales poorly with many cores, but should be able to scale nearly linearly if the work queue is implemented smarter.
Instruction level parallelism is already heavily used, there is no need to spread the ALUs, FPUs, etc. across several cores, the distance would make a synchronization nightmare. We should expect future architectures to continue to scale their superscalar abilities. But I don't doubt that someone will find a clever way to utilize "idle transistors" in some of these by manipulating clock cycles etc.
The problem with superscalar scaling is keeping execution units fed. Both Intel and AMD currently have four integer pipelines. Integer pipelines are cheap (both in transistors and power usage), so why not double or quadruple them? Because they would struggle to utilize them properly. Both of them have been increasing instruction windows with every generation to try to exploit more parallelism, and Intel's next gen Sapphire Rapids/Golde Cove is allegedly featuring a massive 800 entry instruction window (Skylake has 224, Sunny Cove 352 for comparison). And even with these massive CPU front-ends, execution units are generally under-utilized due to branch mispredictions and cache misses. Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.