We found the Missing Performance: Zen 5 Tested with SMT Disabled

Minus Infinity · Aug 12, 2024

stimpy88 said:
This basically just shows you how stupid the Windows scheduler actually is. It makes no sense to assign a heavy workload to a fully occupied physical core's virtual core. Microsoft should know and detect the difference between a physical core and a virtual one. They should at least make this an option in the power settings or something.

I can't help but wonder if all this anti-SMT stuff is a result of Intel's push to remove SMT from their CPU's, and Microsoft is deliberately nerfing performance to help make a case in the minds of consumers to get rid of it.

But another thing wouldn't surprise me, AMD knows their architecture is cache starved, and that enabling SMT also puts more pressure on the tiny L2 cache. 1MB is a joke.

AMD apparently tested 2MB and 3MB cache versions and according to insiders that MLiD talked too, the improvements were something like 4% and 7% on average, and they decided it wasn't worth it.

R0H1T · Aug 12, 2024

You're taking the words of that con artist on face value, who talks from both sides of his mouth :wtf:

Minus Infinity · Aug 12, 2024

Darmok N Jalad said:
I dunno, my work PC with all those E cores sure doesn't seem to load them sometimes. I wish I could go back to the quad core that was all one type, TBH.

I doubt it's intentional, but I don't doubt that Intel has offered a lot of assistance with getting P+E support into Windows. It's probably not as simple as "if Windows detects P+E, use this type of scheduler, if just P cores, use the classic one." Intel is still the marketshare leader, and they also offer more compiling tools and support than AMD. Head over to Linux, and I think we see a more even approach to support. This is only further reinforced by the 10-15% performance gains that Zen 5 is showing over Zen 4. No need to disable anything.

Isn't in interesting that when Qualcomm designed Snapdragon X, they made it up to 12 P cores, with no E cores? Qualcomm has been producing P+E Arm chips for about a decade now, yet they skipped that design choice entirely for their Windows entry.

But E cores on a phone are their precisely for power saving,s and to get decent battery life, the most critical thing on a phone. Laptops it's important, but they have much larger batteries and can be plugged in for use. They decided they needed full phat cores to compete against Apple as much as x86. They seem to have done a decent job as battery life is good it seems.

I wonder if next year Nvidia's and Mediatek's ARM SoC's for Windows will use E cores?

R0H1T said:
You're taking the words of that con artist on face value, who talks from both sides of his mouth

I get he talks a lot of shit, but he also gets a lot right. I don't doubt he has contacts inside AMD and Intel.

R0H1T · Aug 12, 2024

His hit rate is avg at best ~
https://www.reddit.com/r/intel/comments/v7zg5u
IIRC he's also deleted some of his videos after those predictions bombed. So probably even less now!

https://www.reddit.com/r/BustedSilicon/comments/yo9l2i

mark84 · Aug 12, 2024

Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:

HD64G · Aug 12, 2024

Maybe a new chipset driver combined with a new AGESA will do the trick of Zen5 working as planned.

AusWolf · Aug 12, 2024

Upgrayedd said:
It seems like AMD needs to implement a similar program like Intel's APO.

Or OSes / game engines need to be more aware and make better use of SMT. The technology has been with us since Pentium 4, so it's not some kind of revolutionary new thing that one can't write code for.

stimpy88 · Aug 12, 2024

Minus Infinity said:
AMD apparently tested 2MB and 3MB cache versions and according to insiders that MLiD talked too, the improvements were something like 4% and 7% on average, and they decided it wasn't worth it.

Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.

W1zzard · Aug 12, 2024

mark84 said:
with an AMD engineer

Mike is much more than just an engineer, but yeah, really good interview. Unfortunately AMD's marketing/PR team is afraid of proactively sharing these details, so we only get a fairly high-level overview like you see in the slides, without much explanation on the reasoning behind them and I'm only allowed to ask so many questions. I submitted 22, got 3 answers after like a week. In the case of Zen 5, after a lot of press complained, they actually had a follow up call to the LA event where they finally shared more info, instead of just talking about AI AI AI

But without much more insight into the machine (that nobody but AMD has), while I like your hypothesis (it's good), I don't think it's better than many others and neither should be published as "answer" in the original article--no doubt, some tech media would do that and sell it as their invention for more clicks

Minus Infinity · Aug 12, 2024

stimpy88 said:
Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.

Strix Halo is getting new 3nm I/O die, which is why it's about 1 year late. I would hope Zen 6 on one would presume N3P at worst, would get new I/O die with better capabilities. They are supposed to be fixing the latency issues with dual ccd's too as well as bandwidth issues.

PuiuS · Aug 12, 2024

stimpy88 said:
Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.

The chip size is fine, they have plenty of space to work with. This is definitely not an issue for them (and moving away from 5nm to 4nm allowed for 20-30% density increase).

I do agree that Zen5 should have gotten an I/O die upgrade, something to that would allow for at least 6800MHz on desktop (UCLK 1:1). You can currently get to run it with 6400 1:1 if you OC it, but it's not guaranteed.

stimpy88 · Aug 12, 2024

Minus Infinity said:
Strix Halo is getting new 3nm I/O die, which is why it's about 1 year late. I would hope Zen 6 on one would presume N3P at worst, would get new I/O die with better capabilities. They are supposed to be fixing the latency issues with dual ccd's too as well as bandwidth issues.

Do you think they will update the IO die for the last generation of Zen on AM5? Or do you think AMD are going to stick with AM5 for longer? I just can't see them doing a new IO die for just 1 generation, and won't that also require a new MB?

Vincero · Aug 12, 2024

Looking at the review numbers, I'm not really seeing anything out of the ordinary with SMT on vs off.
E.g. older CPUs, even in thread-sensitive benchmarks which benefitted greatly from more cores/threads without HT/SMT (i.e. cinebench), would struggle to scale their multi-thread performance linearly with the amount of physical CPU cores available - HT/SMT would help out a lot in those scenarios:
AMD Phenom:
1100T (six-core) ST:MT ratio: 5.23

Sandy-Bridge:
i5 (quad-core) ST:MT ratio: 3.67 (no-HT)
i7 (quad-core) ST:MT ratio: 4.33 (HT)

AMD 9700X (8-core) using numbers taken from this review:
SMT on ST:MT ratio: 8.95
SMT off ST:MT ratio: 7.03

Basically any scenario where there is thread / resource parallelization and each thread essentially isn't part of a massive memory hungry process and has no strict timing requirements (e.g. part of a process where one thread must be executed before another is allowed to start), SMT being enabled easily sees off operating with it disabled, e.g. as seen in the review Server/Workstation, AI, File Encryption/Compression benchmarks.
The odd outlier is some of the Office productivity application results (Excel especially).

As has always been the case since Intel debuted Hyper-Threading to overcome the Pentium4's long pipe-line and potential stalling, games are not an ideal candidate and such things take a little bit of a hit usually because game devs don't/can't optimise as easily as game engines usually have distinctly different tasks executing in different threads, although it's good to see there are some notable exceptions where some devs / game engines can leverage HT/SMT for a little extra boost, e.g. Cyberpunk, Elden Ring, Starfield.

What is more interesting / surprising, is that in some cases, how little difference it's making to some applications / games in the real world (i.e. not in synthetic benchmarks) - core scaling / utilisation seems reasonably decent in scenarios without SMT/HT being available - how much credit goes to application (or shared development library) developers / AMD&Intel would be debatable.
Having the option set either way isn't the issue it used to be in terms of sacrificing / gaining performance - back with the P4, or even the first gen Core i3 (where we had a bunch of laptops that could only run WinXP without HT-enabled until BIOS fix was available), not having HT enabled was very obvious, but of course they were single/dual core CPUs so a lack of threading would stand out.

stimpy88 said:
Do you think they will update the IO die for the last generation of Zen on AM5? Or do you think AMD are going to stick with AM5 for longer? I just can't see them doing a new IO die for just 1 generation, and won't that also require a new MB?

Not necessarily - AM4 went through 2 different IO dies, and at least 3 generations of monolithic dies with markedly different capabilities IO/system logic and fabrication processes and the right boards can work with all of them.

Wirko · Aug 12, 2024

stimpy88 said:
AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

There's another size limit that AMD has to consider: as many whole CCDs as possible must fit into the 33 x 26 mm rectangle (reticle size) to optimise the use of costly lithography machines.
I didn't do much calculation, the die size could be around 9,3 x 7,6 mm according to available die shots (or rather drawings), and there must be a small gap to allow cutting the dies apart. Maybe, just maybe, this is what stopped AMD from adding another ~400M transistors (8 cores x 8 Mbit/core x 6 transistors/bit) to the 8.3B already on the die.
And AMD loves proper binary numbers, unlike Intel, who doesn't mind odd cache sizes like 1.875 MB or 2.5 MB.

Vincero · Aug 12, 2024

Wirko said:
And AMD loves proper binary numbers, unlike Intel, who doesn't mind odd cache sizes like 1.875 MB or 2.5 MB.

Those aren't necessarily 'odd' cache sizes - you're looking at a base10 (decimal) scaled measurement of something which logically is designed for base2 (binary) maths.
Alder Lake with its 1.25MB cache sizes, assuming Intel are using normal MB notation, would be 1280KB which in normal binary terms is a nice number.
1.875MB would be 1920KB.

There are many places on the web which list the size without it being scaled to decimal MB numbers.

Wirko · Aug 12, 2024

Vincero said:
Looking at the review numbers, I'm not really seeing anything out of the ordinary with SMT on vs off.
E.g. older CPUs, even in thread-sensitive benchmarks which benefitted greatly from more cores/threads without HT/SMT (i.e. cinebench), would struggle to scale their multi-thread performance linearly with the amount of physical CPU cores available - HT/SMT would help out a lot in those scenarios:
AMD Phenom:
1100T (six-core) ST:MT ratio: 5.23

Sandy-Bridge:
i5 (quad-core) ST:MT ratio: 3.67 (no-HT)
i7 (quad-core) ST:MT ratio: 4.33 (HT)

AMD 9700X (8-core) using numbers taken from this review:
SMT on ST:MT ratio: 8.95
SMT off ST:MT ratio: 7.03

Basically any scenario where there is thread / resource parallelization and each thread essentially isn't part of a massive memory hungry process and has no strict timing requirements (e.g. part of a process where one thread must be executed before another is allowed to start), SMT being enabled easily sees off operating with it disabled, e.g. Server/Workstation, AI, File Encryption/Compression.
The odd outlier is some of the Office productivity application results (Excel especially).

As has always been the case since Intel debuted Hyper-Threading to overcome the Pentium4's long pipe-line and potential stalling, games are not an ideal candidate and such things take a little bit of a hit usually because game devs don't/can't optimise as easily as game engines usually have distinctly different tasks executing in different threads, although it's good to see there are some notable exceptions where some devs / game engines can leverage HT/SMT for a little extra boost, e.g. Cyberpunk, Elden Ring, Starfield.

What is more interesting / surprising, is that in some cases, how little difference it's making to some applications / games in the real world (i.e. not in synthetic benchmarks) - core scaling / utilisation seems reasonably decent in scenarios without SMT/HT being available - how much credit goes to application (or shared development library) developers / AMD&Intel would be debatable.
Having the option set either way isn't the issue it used to be in terms of sacrificing / gaining performance - back with the P4, or even the first gen Core i3 (where we had a bunch of laptops that could only run WinXP without HT-enabled until BIOS fix was available), not having HT enabled was very obvious, but of course they were single/dual core CPUs so a lack of threading would stand out.

SMT on x86/x64 has a problem that everyone here seems to overlook: the two threads that run on the same core have equal priorities, and OS and applications can't change that. If a single thread can use 100% of the core performance without HT, two will run at about 65% + 65%, with unpredictable variations, with HT. Not 100% + 30% or something. Disable HT, and the same two threads will have 70% + 30% minus context switching, with less variability because OS preemptive multitasking does its job.

A good way around that would be to identify the main, time-critical thread of a game (or Excel, for that matter) and let it have a core for itself for as long as possible. Kind of über-affinity. An application can't do that without support from OS but Windows has no such feature - or am I wrong here?

On top of that, the system of priorities on x86/x64 is insufficient in at least one other way: DRAM access is not prioritised.

Vincero said:
Those aren't necessarily 'odd' cache sizes - you're looking at a base10 (decimal) scaled measurement of something which logically is designed for base2 (binary) maths.
Alder Lake with its 1.25MB cache sizes, assuming Intel are using normal MB notation, would be 1280KB which in normal binary terms is a nice number.
1.875MB would be 1920KB.

There are many places on the web which list the size without it being scaled to decimal MB numbers.

These "odd" numbers are still very much round in binary notation, I understand that. I (and you too) have should used MiB and KiB here.

SethNW · Aug 12, 2024

Yeah, it is almost like this was super rushed for reasons that make sense only to AMD.... Like it is just AMD being stupid, because there was no reason to rush it ahead of motherboards, it clearly could use more time in the oven so AMD can fix issues, do more internal testing and write reviewer guides with actually good guidelines that tell reviewers how to get most out of their chip. Seems like recent bad Radeon habits are spilling over to Ryzen. Plus they could have done way better optimization to power use, so they actually have performance gain. Yes they wouldn't win on power consumption, but almost no one would care. Plus if they did that, announced X3D, they would have won on their hands. But nope, AMD can't escape underdog mentality.

Vincero · Aug 12, 2024

SethNW said:
Yeah, it is almost like this was super rushed for reasons that make sense only to AMD.... Like it is just AMD being stupid, because there was no reason to rush it ahead of motherboards, it clearly could use more time in the oven so AMD can fix issues, do more internal testing and write reviewer guides with actually good guidelines that tell reviewers how to get most out of their chip. Seems like recent bad Radeon habits are spilling over to Ryzen. Plus they could have done way better optimization to power use, so they actually have performance gain. Yes they wouldn't win on power consumption, but almost no one would care. Plus if they did that, announced X3D, they would have won on their hands. But nope, AMD can't escape underdog mentality.

I don't think 'more time in the oven' would have helped. What we're seeing is definitely a 'server first' design (which, let's not kid ourselves, has been the case for decades with Intel and AMD), and boy will those efficiency numbers for certain types of task look very nice.

At the end of the day, AMD don't want to be making more than x number of products at any one time and also not have specialist product lines with limited returns... so these will filter to the mainstream.
X3D chips always follow later, probably because the validation / production process being a bit more complex and not really utilised outside of the desktop PC space mean that it will always lag physical development and manufacturing vs just the core CCD dies - why hold back one product and build up masses of inventory just to release a halo product which actually will not even make up 90+% of your sales? Who do you think they are... Apple/Intel?

Wirko said:
SMT on x86/x64 has a problem that everyone here seems to overlook: the two threads that run on the same core have equal priorities, and OS and applications can't change that. If a single thread can use 100% of the core performance without HT, two will run at about 65% + 65%, with unpredictable variations, with HT. Not 100% + 30% or something. Disable HT, and the same two threads will have 70% + 30% minus context switching, with less variability because OS preemptive multitasking does its job.

A good way around that would be to identify the main, time-critical thread of a game (or Excel, for that matter) and let it have a core for itself for as long as possible. Kind of über-affinity. An application can't do that without support from OS but Windows has no such feature - or am I wrong here?

On top of that, the system of priorities on x86/x64 is insufficient in at least one other way: DRAM access is not prioritised.

These "odd" numbers are still very much round in binary notation, I understand that. I (and you too) have should used MiB and KiB here.

As a non-developer I can't answer that question. I don't think before Windows 11 there were great mechanisms for the CPU to push the OS scheduler into making informed decisions about which CPU core to use for certain tasks - to what extent that can do is unknown (i.e. is it just for 'performance/economy' or can it be informed about utilising certain cores / resources for lower latency, etc.). I don't follow Linux kernel updates so no idea what the capabilities are there, but there would need to be some interface / metric provided by the CPU to inform the OS scheduler about how to efficiently run something and I'm not sure there is such a thing in place.

Would be great if someone could actually provide some insight in to that. It seems HT/SMT and the OS schedulers are still basically 'hoping for the best' in terms of managing processes and threads generated.

Darmok N Jalad · Aug 12, 2024

Minus Infinity said:
But E cores on a phone are their precisely for power saving,s and to get decent battery life, the most critical thing on a phone. Laptops it's important, but they have much larger batteries and can be plugged in for use. They decided they needed full phat cores to compete against Apple as much as x86. They seem to have done a decent job as battery life is good it seems.

I wonder if next year Nvidia's and Mediatek's ARM SoC's for Windows will use E cores?

I get he talks a lot of shit, but he also gets a lot right. I don't doubt he has contacts inside AMD and Intel.

Yet Apple has P+E on all their devices, phones, tablets, and desktop PCs. When it comes to epic battery life, Mac is where it's at. The SD-X stuff is an improvement, but we're still looking at hours of difference, and active cooling is required. I bet E-cores would have at least helped with the former. I just think you don't need as many E cores as Intel puts in there. 4E should cover the basics, but instead, they call a 2P+8E an i7, when often times it performs more like an i5 from 2010, in my experience.

R0H1T · Aug 12, 2024

Darmok N Jalad said:
but we're still looking at hours of difference, and active cooling is required.

You can easily make a lot of x86 chips passive cooled as well by lowering Tjmax(on Intel?) & that would also increase battery life. MacBook Air running at 90c or above is not ideal, at least for me!

Darmok N Jalad · Aug 12, 2024

R0H1T said:
You can easily make a lot of x86 chips passive cooled as well by lowering Tjmax(on Intel?) & that would also increase battery life. MacBook Air running at 90c or above is not ideal, at least for me!

Temps are what they are, IMO. Most of the time, a MBA won't even hit 50C under everyday tasks. Load it down and it will certainly blow past 90C--it's actually closer to 105C, but that's the trade-off. A Mac with active cooling won't get near that hot though. Don't most modern GPUs and CPUs also flirt with these temps under load, even with active cooling? It's part of the design, as thermal density is high, but the chips have temp sensors all over to throttle hotspots.

R0H1T · Aug 12, 2024

I like active cooling unless you're just aiming for higher battery life benchmarks for some reason? It keeps the performance consistent & the overall system/laptop cooler. No reason to avoid that even if it nets you an hour or two extra.

L'Eliminateur · Aug 12, 2024

¿Would you be able to test a couple of the benchmarks that showed the most difference in windows 10?, as this looks to be another nail in the win11 coffin

DemonicRyzen666 · Aug 12, 2024

mark84 said:
Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:

I'm pretty sure I already mentioned this somewhere else on here, in another thread.
They're going to have to add a lot more predictors by going to quad or hexa or just enlarge them by 100%. As base line start for a new architecture. I believe Zen 4 is at the maximum IPC of its design; that's why this change was done you need a starting point of close to your last design in IPC while leaving room to able to increase it.

mkppo · Aug 12, 2024

mark84 said:
Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:

It's an interesting take but something still feels off. SMT disabled shouldn't have that much of a performance increase in games. I get that SMT off gives rise to 2x BP and decoders per thread while SMT on gives each thread access to one BP and decoder. But even with just one each with SMT on, it's the same as previous gen but the uplift on average seems to be around 5% whereas single threaded IPC in terms of floating point increased by a good 18% even without AVX512 workloads and games should definitely benefit from that.

I think we are memory limited in that scenario but your explanation is plausible in terms of Zen 5 SMT being different, as in unlike previous generations, you now actually have a marked increase in branch predictors and decoders for a single thread if you turn SMT off. Previous generations simply stopped sharing the same number of BP/decoders with SMT off.

Now there's yet another data point for 9700X. For numbers with PBO, now you can either have 15% ST, 10% MT and 5% game performance increase. Or you can have 20% ST, -5% MT and 10% game performance increase. I feel all those numbers will increase with memory tuning as Zen 5 should be more sensitive to it.

Also Mike Clark isn't just an AMD engineer but THE AMD Zen engineer. Also, hats off to Chips and Cheese, what a way to start your youtube channel with an interview with Mike Clark himself. Happy that they are finally getting the recognition they deserve, they churn out pretty impressive deep downs. Looking forward to their article, I want to know just how AMD managed to cram this many execution engines and BP's and the like in the same footprint. Some consumers seem to be disappointed that their favourite application isn't accelerated as much as they would've liked but the server guys are seeing massive performance increases and AMD is probably laughing all the way to the bank.

Thanks w1zzard for the tests, much appreciated!

System Name	Gaming rig
Processor	i7 6950K
Motherboard	Asus X99-Deluxe
Cooling	Thermaltake Water 3.0 Riing RGB 240
Memory	32GB DDR4-3000
Video Card(s)	Titan X (Pascal)
Storage	500GB 950 Pro, 500GB 850 Evo, 2x5GB HDD RAID1
Display(s)	Dell U3011
Case	Jonsbo UMX4 Windowed (Silver)
Audio Device(s)	Creative Soundblaster Z
Power Supply	Thermaltake 1050W RGB
Software	Windows 10
Benchmark Scores	23407 - Firestrike (better than 99% of all results!) https://www.3dmark.com/fs/10511898

Processor	AMD Ryzen 5 5600@80W
Motherboard	MSI B550 Tomahawk
Cooling	ZALMAN CNPS9X OPTIMA
Memory	2*8GB PATRIOT PVS416G400C9K@3733MT_C16
Video Card(s)	Sapphire Radeon RX 6750 XT Pulse 12GB
Storage	Sandisk SSD 128GB, Kingston A2000 NVMe 1TB, Samsung F1 1TB, WD Black 10TB
Display(s)	AOC 27G2U/BK IPS 144Hz
Case	SHARKOON M25-W 7.1 BLACK
Audio Device(s)	Realtek 7.1 onboard
Power Supply	Seasonic Core GC 500W
Mouse	Sharkoon SHARK Force Black
Keyboard	Trust GXT280
Software	Win 7 Ultimate 64bit/Win 10 pro 64bit/Manjaro Linux

Processor	Various Intel and AMD CPUs
Motherboard	Micro-ATX and mini-ITX
Cooling	Yes
Memory	Overclocking is overrated
Video Card(s)	Various Nvidia and AMD GPUs
Storage	A lot
Display(s)	Monitors and TVs
Case	It's not about size, but how you use it
Audio Device(s)	Speakers and headphones
Power Supply	300 to 750 W, bronze to gold
Mouse	Wireless
Keyboard	Mechanic
VR HMD	Not yet
Software	Linux gaming master race

Processor	AMD Ryzen 9 5950X
Motherboard	Asus ROG Crosshair VIII Hero WiFi
Cooling	Arctic Liquid Freezer II 420
Memory	32Gb G-Skill Trident Z Neo @3806MHz C14
Video Card(s)	MSI GeForce RTX2070
Storage	Seagate FireCuda 530 1TB
Display(s)	Samsung G9 49" Curved Ultrawide
Case	Cooler Master Cosmos
Audio Device(s)	O2 USB Headphone AMP
Power Supply	Corsair HX850i
Mouse	Logitech G502
Keyboard	Cherry MX
Software	Windows 11

Processor	Ryzen 7 5700X
Memory	48 GB
Video Card(s)	RTX 4080
Storage	2x HDD RAID 1, 3x M.2 NVMe
Display(s)	30" 2560x1600 + 19" 1280x1024
Software	Windows 10 64-bit

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	Mac Pro 2013
Processor	Xeon 2667v2
Motherboard	A collection of mother and daughter boards connected by ribbon cables
Cooling	Thermal core triangle
Memory	64GB ECC DDR3-1866
Video Card(s)	Dual FirePro D700 6GB
Storage	WD NVME 1GB
Display(s)	ASUS Pro Art 27"
Case	Apple Cylinder

Processor	R7-7700X
Motherboard	Gigabyte X670 Aorus Elite AX
Cooling	Scythe Fuma 2 rev B
Memory	no name DDR5-5200
Video Card(s)	Some 3080 10GB
Storage	dual Intel DC P4610 1.6TB
Display(s)	Gigabyte G34MQ + Dell 2708WFP
Case	Lian-Li Lancool III black no rgb
Power Supply	CM UCP 750W
Software	Win 10 Pro x64

System Name	S.L.I + RTX research rig
Processor	Ryzen 7 5800X 3D.
Motherboard	MSI MEG ACE X570
Cooling	Corsair H150i Cappellx
Memory	Corsair Vengeance pro RGB 3200mhz 32Gbs
Video Card(s)	2x Dell RTX 2080 Ti in S.L.I
Storage	Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s)	HP X24i
Case	Corsair 7000D Airflow
Power Supply	EVGA G+1600watts
Mouse	Corsair Scimitar
Keyboard	Cosair K55 Pro RGB

System Name	GraniteXT
Processor	Ryzen 9950X
Motherboard	ASRock B650M-HDV
Cooling	2x360mm custom loop
Memory	2x24GB Team Xtreem DDR5-8000 [M die]
Video Card(s)	RTX 3090 FE underwater
Storage	Intel P5800X 800GB + Samsung 980 Pro 2TB
Display(s)	MSI 342C 34" OLED
Case	O11D Evo RGB
Audio Device(s)	DCA Aeon 2 w/ SMSL M200/SP200
Power Supply	Superflower Leadex VII XG 1300W
Mouse	Razer Basilisk V3
Keyboard	Steelseries Apex Pro V2 TKL

We found the Missing Performance: Zen 5 Tested with SMT Disabled

Administrator