Oh, I confused them with I guess x87 ones that take forever, but I was mostly talking about the complex AVX ones not just add/multiply, because I am pretty sure there are a few that take 20-40 cycles
There are many instructions that are very slow, although they are usually a tiny fraction of the workload, if at all.
Interestingly enough, Ice Lake/Rocket Lake brought the legacy FMUL down from 5 to 4 cycles, as well as integer division(IDIV) from 97 cycles to 18 cycles.
For comparison, Intel's current CPUs have 4 cycles for multiplication, 11 cycles for division of fp32 using AVX, and 5 cycles for integer multiplication using AVX. (official spec)
As for "worst case" performers of legacy x87 instructions: examples are FSQRT(square root) at 14-21 cycles, sin/cos/tan ~50-160 cycles and the most complex; FBSTP at 264, but this one is probably not very useful today. FDIV is 14-16 cycles (so slightly slower than its AVX counterpart). And for comparison, in Zen 4, legacy x87 instructions seems to be overall lower latency than Intel. All of these figures are from
agner.org and are benchmarked, so a grain of salt, but they are probably good approximations.
Many think "legacy" instructions are holding back the performance of modern x86 CPUs, but that's not true. Since the mid 90s, they've all translated the x86 ISA to their own specific micro-operations, and this is also how they support x87/MMX/SSE/AVX through the same execution ports; the legacy instructions are translated to micro-ops anyways. This allows them to design the CPUs to be as efficient as possible with the new features, yet support the old ones. If the older ones happens to have worse latency, it's usually not an issue, as applications that rely on those are probably very old. One thing of note is that x87 instructions are rounding off differently than normal IEEE 754 fp32/fp64 does.
L3 victim cache*
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
It's not pedantic at all, you missed the point.
The prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger, while adding just a tiny bit more of L2 would allow the prefetcher to work differently. In other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
Secondly, as I said, the stream of data flowing through L3 is all coming from memory->L2, so the overall bandwidth here is limited by memory, even though the tiny bit you read back will have higher burst speed.
Software that will be more demanding in the coming years will be more computationally intensive, so over the long term the faster CPUs will be favored over those with more L3 cache. Those that are very L3 sensitive will remain outliers.