Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!
Moment in the video here: