The attacker thread in a side channel attack is not directly reading the cache line. It merely probes the cache to determine the usage by the victim thread. You merely need access to a shared cache to carry out such an attack, and on the flip side the chance you get useful information out of such an attack is absurdly small.
Unless you are talking about the extraction of meta information here, cached data is not a problem.
Even non-speculative execution have sensitive data in L1/L2/L3 all the time, as the CPU constantly do context switches without flushing caches.
The issue with speculative execution is when sensitive data is loaded into registers, etc. or even whole instructions are executed before this is discarded, but some of this data can be extracted before it's cleaned up (or overwritten). Implementing all instructions with proper safeguards in place will eliminate this problem (and all Specre class bugs). This will certainly create design constraints, but speculative execution as a whole is not principally flawed like many seems to think.
The more resources a core has inside it, the more likely a thread does not have enough instruction level parallelism to suitably utilise all the resources in the core. Digging too hard for ILP results in bloated cores since increasing the out of order window exponentially drives up complexity of the core.
SMT no longer exists to cover stalling pipelines.
You are forgetting that modern microarchitectures are using power gating quite heavily, and have multiple different execution units on a single execution port. If we are talking about computing in general (desktop usage, workstations, etc.), execution ports are usually quite well saturated when the CPU pipeline isn't stalled, so unless it's stalled, there rarely are many idle execution ports to delegate to other threads. This is why x86 SMT implementations only execute one thread at the time.
On the other hand, Power have an "impressive" 8-way SMT which can execute two threads simultaneously. These are intended for specific web server/enterprise workloads where performance of a single thread is less important than total throughput, and the threads are mostly stalled anyway. A such CPU design would result in a horrible user experience as a desktop CPU.
Statically scheduled VLIW code can never be scheduled efficiently for every use case on general purpose processors.
True, at least as far as we know. A new paradigm would be required to change this.