• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

We found the Missing Performance: Zen 5 Tested with SMT Disabled

Joined
May 3, 2018
Messages
2,683 (1.17/day)
This basically just shows you how stupid the Windows scheduler actually is. It makes no sense to assign a heavy workload to a fully occupied physical core's virtual core. Microsoft should know and detect the difference between a physical core and a virtual one. They should at least make this an option in the power settings or something.

I can't help but wonder if all this anti-SMT stuff is a result of Intel's push to remove SMT from their CPU's, and Microsoft is deliberately nerfing performance to help make a case in the minds of consumers to get rid of it.

But another thing wouldn't surprise me, AMD knows their architecture is cache starved, and that enabling SMT also puts more pressure on the tiny L2 cache. 1MB is a joke.
AMD apparently tested 2MB and 3MB cache versions and according to insiders that MLiD talked too, the improvements were something like 4% and 7% on average, and they decided it wasn't worth it.
 
Joined
Apr 12, 2013
Messages
7,244 (1.75/day)
You're taking the words of that con artist on face value, who talks from both sides of his mouth :wtf:
 
Joined
May 3, 2018
Messages
2,683 (1.17/day)
I dunno, my work PC with all those E cores sure doesn't seem to load them sometimes. I wish I could go back to the quad core that was all one type, TBH.

I doubt it's intentional, but I don't doubt that Intel has offered a lot of assistance with getting P+E support into Windows. It's probably not as simple as "if Windows detects P+E, use this type of scheduler, if just P cores, use the classic one." Intel is still the marketshare leader, and they also offer more compiling tools and support than AMD. Head over to Linux, and I think we see a more even approach to support. This is only further reinforced by the 10-15% performance gains that Zen 5 is showing over Zen 4. No need to disable anything.

Isn't in interesting that when Qualcomm designed Snapdragon X, they made it up to 12 P cores, with no E cores? Qualcomm has been producing P+E Arm chips for about a decade now, yet they skipped that design choice entirely for their Windows entry.
But E cores on a phone are their precisely for power saving,s and to get decent battery life, the most critical thing on a phone. Laptops it's important, but they have much larger batteries and can be plugged in for use. They decided they needed full phat cores to compete against Apple as much as x86. They seem to have done a decent job as battery life is good it seems.

I wonder if next year Nvidia's and Mediatek's ARM SoC's for Windows will use E cores?

You're taking the words of that con artist on face value, who talks from both sides of his mouth :wtf:
I get he talks a lot of shit, but he also gets a lot right. I don't doubt he has contacts inside AMD and Intel.
 
Joined
Nov 18, 2009
Messages
8 (0.00/day)
System Name Gaming rig
Processor i7 6950K
Motherboard Asus X99-Deluxe
Cooling Thermaltake Water 3.0 Riing RGB 240
Memory 32GB DDR4-3000
Video Card(s) Titan X (Pascal)
Storage 500GB 950 Pro, 500GB 850 Evo, 2x5GB HDD RAID1
Display(s) Dell U3011
Case Jonsbo UMX4 Windowed (Silver)
Audio Device(s) Creative Soundblaster Z
Power Supply Thermaltake 1050W RGB
Software Windows 10
Benchmark Scores 23407 - Firestrike (better than 99% of all results!) https://www.3dmark.com/fs/10511898
Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:
 
Joined
Apr 30, 2011
Messages
2,670 (0.55/day)
Location
Greece
Processor AMD Ryzen 5 5600@80W
Motherboard MSI B550 Tomahawk
Cooling ZALMAN CNPS9X OPTIMA
Memory 2*8GB PATRIOT PVS416G400C9K@3733MT_C16
Video Card(s) Sapphire Radeon RX 6750 XT Pulse 12GB
Storage Sandisk SSD 128GB, Kingston A2000 NVMe 1TB, Samsung F1 1TB, WD Black 10TB
Display(s) AOC 27G2U/BK IPS 144Hz
Case SHARKOON M25-W 7.1 BLACK
Audio Device(s) Realtek 7.1 onboard
Power Supply Seasonic Core GC 500W
Mouse Sharkoon SHARK Force Black
Keyboard Trust GXT280
Software Win 7 Ultimate 64bit/Win 10 pro 64bit/Manjaro Linux
Maybe a new chipset driver combined with a new AGESA will do the trick of Zen5 working as planned.
 
Joined
Jan 14, 2019
Messages
11,019 (5.39/day)
Location
Midlands, UK
System Name Nebulon B
Processor AMD Ryzen 7 7800X3D
Motherboard MSi PRO B650M-A WiFi
Cooling be quiet! Dark Rock 4
Memory 2x 24 GB Corsair Vengeance DDR5-4800
Video Card(s) AMD Radeon RX 6750 XT 12 GB
Storage 2 TB Corsair MP600 GS, 2 TB Corsair MP600 R2, 4 + 8 TB Seagate Barracuda 3.5"
Display(s) Dell S3422DWG, 7" Waveshare touchscreen
Case Kolink Citadel Mesh black
Audio Device(s) Logitech Z333 2.1 speakers, AKG Y50 headphones
Power Supply Seasonic Prime GX-750
Mouse Logitech MX Master 2S
Keyboard Logitech G413 SE
Software Windows 10 Pro
It seems like AMD needs to implement a similar program like Intel's APO.
Or OSes / game engines need to be more aware and make better use of SMT. The technology has been with us since Pentium 4, so it's not some kind of revolutionary new thing that one can't write code for.
 
Joined
Apr 19, 2018
Messages
1,220 (0.53/day)
Processor AMD Ryzen 9 5950X
Motherboard Asus ROG Crosshair VIII Hero WiFi
Cooling Arctic Liquid Freezer II 420
Memory 32Gb G-Skill Trident Z Neo @3806MHz C14
Video Card(s) MSI GeForce RTX2070
Storage Seagate FireCuda 530 1TB
Display(s) Samsung G9 49" Curved Ultrawide
Case Cooler Master Cosmos
Audio Device(s) O2 USB Headphone AMP
Power Supply Corsair HX850i
Mouse Logitech G502
Keyboard Cherry MX
Software Windows 11
AMD apparently tested 2MB and 3MB cache versions and according to insiders that MLiD talked too, the improvements were something like 4% and 7% on average, and they decided it wasn't worth it.
Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.
 
Last edited:

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
27,425 (3.70/day)
Processor Ryzen 7 5700X
Memory 48 GB
Video Card(s) RTX 4080
Storage 2x HDD RAID 1, 3x M.2 NVMe
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
with an AMD engineer
Mike is much more than just an engineer, but yeah, really good interview. Unfortunately AMD's marketing/PR team is afraid of proactively sharing these details, so we only get a fairly high-level overview like you see in the slides, without much explanation on the reasoning behind them and I'm only allowed to ask so many questions. I submitted 22, got 3 answers after like a week. In the case of Zen 5, after a lot of press complained, they actually had a follow up call to the LA event where they finally shared more info, instead of just talking about AI AI AI

But without much more insight into the machine (that nobody but AMD has), while I like your hypothesis (it's good), I don't think it's better than many others and neither should be published as "answer" in the original article--no doubt, some tech media would do that and sell it as their invention for more clicks
 
Last edited:
Joined
May 3, 2018
Messages
2,683 (1.17/day)
Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.
Strix Halo is getting new 3nm I/O die, which is why it's about 1 year late. I would hope Zen 6 on one would presume N3P at worst, would get new I/O die with better capabilities. They are supposed to be fixing the latency issues with dual ccd's too as well as bandwidth issues.
 
Joined
Jun 17, 2019
Messages
4 (0.00/day)
Think of what it would do to SMT and 1% lows. A lot more than +-5% with Zen5.

AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.

The wish list for Zen6 is long, and I don't think they will do much beyond giving it a 2MB L2 cache, (1.5MB would not surprise me in the least... drip...drip...) and fix whatever low hanging fruit they deliberately didn't fix. The biggest problem they have is the IO die, and that is holding AM5 back with awful memory support, as well as its physical size, but I very much doubt we will see a new IO die in Zen 6, unless they are planning a Zen 7 on AM5.
The chip size is fine, they have plenty of space to work with. This is definitely not an issue for them (and moving away from 5nm to 4nm allowed for 20-30% density increase).

I do agree that Zen5 should have gotten an I/O die upgrade, something to that would allow for at least 6800MHz on desktop (UCLK 1:1). You can currently get to run it with 6400 1:1 if you OC it, but it's not guaranteed.
 
Joined
Apr 19, 2018
Messages
1,220 (0.53/day)
Processor AMD Ryzen 9 5950X
Motherboard Asus ROG Crosshair VIII Hero WiFi
Cooling Arctic Liquid Freezer II 420
Memory 32Gb G-Skill Trident Z Neo @3806MHz C14
Video Card(s) MSI GeForce RTX2070
Storage Seagate FireCuda 530 1TB
Display(s) Samsung G9 49" Curved Ultrawide
Case Cooler Master Cosmos
Audio Device(s) O2 USB Headphone AMP
Power Supply Corsair HX850i
Mouse Logitech G502
Keyboard Cherry MX
Software Windows 11
Strix Halo is getting new 3nm I/O die, which is why it's about 1 year late. I would hope Zen 6 on one would presume N3P at worst, would get new I/O die with better capabilities. They are supposed to be fixing the latency issues with dual ccd's too as well as bandwidth issues.
Do you think they will update the IO die for the last generation of Zen on AM5? Or do you think AMD are going to stick with AM5 for longer? I just can't see them doing a new IO die for just 1 generation, and won't that also require a new MB?
 
Joined
Jun 20, 2024
Messages
151 (2.52/day)
Looking at the review numbers, I'm not really seeing anything out of the ordinary with SMT on vs off.
E.g. older CPUs, even in thread-sensitive benchmarks which benefitted greatly from more cores/threads without HT/SMT (i.e. cinebench), would struggle to scale their multi-thread performance linearly with the amount of physical CPU cores available - HT/SMT would help out a lot in those scenarios:
AMD Phenom:
1100T (six-core) ST:MT ratio: 5.23

Sandy-Bridge:
i5 (quad-core) ST:MT ratio: 3.67 (no-HT)
i7 (quad-core) ST:MT ratio: 4.33 (HT)

AMD 9700X (8-core) using numbers taken from this review:
SMT on ST:MT ratio: 8.95
SMT off ST:MT ratio: 7.03

Basically any scenario where there is thread / resource parallelization and each thread essentially isn't part of a massive memory hungry process and has no strict timing requirements (e.g. part of a process where one thread must be executed before another is allowed to start), SMT being enabled easily sees off operating with it disabled, e.g. as seen in the review Server/Workstation, AI, File Encryption/Compression benchmarks.
The odd outlier is some of the Office productivity application results (Excel especially).

As has always been the case since Intel debuted Hyper-Threading to overcome the Pentium4's long pipe-line and potential stalling, games are not an ideal candidate and such things take a little bit of a hit usually because game devs don't/can't optimise as easily as game engines usually have distinctly different tasks executing in different threads, although it's good to see there are some notable exceptions where some devs / game engines can leverage HT/SMT for a little extra boost, e.g. Cyberpunk, Elden Ring, Starfield.

What is more interesting / surprising, is that in some cases, how little difference it's making to some applications / games in the real world (i.e. not in synthetic benchmarks) - core scaling / utilisation seems reasonably decent in scenarios without SMT/HT being available - how much credit goes to application (or shared development library) developers / AMD&Intel would be debatable.
Having the option set either way isn't the issue it used to be in terms of sacrificing / gaining performance - back with the P4, or even the first gen Core i3 (where we had a bunch of laptops that could only run WinXP without HT-enabled until BIOS fix was available), not having HT enabled was very obvious, but of course they were single/dual core CPUs so a lack of threading would stand out.

Do you think they will update the IO die for the last generation of Zen on AM5? Or do you think AMD are going to stick with AM5 for longer? I just can't see them doing a new IO die for just 1 generation, and won't that also require a new MB?
Not necessarily - AM4 went through 2 different IO dies, and at least 3 generations of monolithic dies with markedly different capabilities IO/system logic and fabrication processes and the right boards can work with all of them.
 
Last edited:
Joined
Jan 3, 2021
Messages
3,095 (2.34/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
AMD designed themselves into a corner when they released AM5 due to keeping compatibility for coolers. The chip is too physically small, the same for their EPIC line up, so the physical CCD cannot get over a certain size. AMD could have given each core a 2MB L2 cache, which it obviously needs, but because they stuck with 4nm they would have hit their size limit, so it should have been a 3nm design. So AMD took the easy route, 4nm. Then AMD made the tiny 1MB as fast as possible to help mitigate the problem, which it didn't, thus the bad SMT performance, and below the promised 16% IPC gain in most non-math intensive applications, then obviously the L3 cache is half the size it should be, due to AMD's slow IF and memory controller.
There's another size limit that AMD has to consider: as many whole CCDs as possible must fit into the 33 x 26 mm rectangle (reticle size) to optimise the use of costly lithography machines.
I didn't do much calculation, the die size could be around 9,3 x 7,6 mm according to available die shots (or rather drawings), and there must be a small gap to allow cutting the dies apart. Maybe, just maybe, this is what stopped AMD from adding another ~400M transistors (8 cores x 8 Mbit/core x 6 transistors/bit) to the 8.3B already on the die.
And AMD loves proper binary numbers, unlike Intel, who doesn't mind odd cache sizes like 1.875 MB or 2.5 MB.
 
Joined
Jun 20, 2024
Messages
151 (2.52/day)
And AMD loves proper binary numbers, unlike Intel, who doesn't mind odd cache sizes like 1.875 MB or 2.5 MB.

Those aren't necessarily 'odd' cache sizes - you're looking at a base10 (decimal) scaled measurement of something which logically is designed for base2 (binary) maths.
Alder Lake with its 1.25MB cache sizes, assuming Intel are using normal MB notation, would be 1280KB which in normal binary terms is a nice number.
1.875MB would be 1920KB.

There are many places on the web which list the size without it being scaled to decimal MB numbers.
 
Joined
Jan 3, 2021
Messages
3,095 (2.34/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Looking at the review numbers, I'm not really seeing anything out of the ordinary with SMT on vs off.
E.g. older CPUs, even in thread-sensitive benchmarks which benefitted greatly from more cores/threads without HT/SMT (i.e. cinebench), would struggle to scale their multi-thread performance linearly with the amount of physical CPU cores available - HT/SMT would help out a lot in those scenarios:
AMD Phenom:
1100T (six-core) ST:MT ratio: 5.23

Sandy-Bridge:
i5 (quad-core) ST:MT ratio: 3.67 (no-HT)
i7 (quad-core) ST:MT ratio: 4.33 (HT)

AMD 9700X (8-core) using numbers taken from this review:
SMT on ST:MT ratio: 8.95
SMT off ST:MT ratio: 7.03

Basically any scenario where there is thread / resource parallelization and each thread essentially isn't part of a massive memory hungry process and has no strict timing requirements (e.g. part of a process where one thread must be executed before another is allowed to start), SMT being enabled easily sees off operating with it disabled, e.g. Server/Workstation, AI, File Encryption/Compression.
The odd outlier is some of the Office productivity application results (Excel especially).

As has always been the case since Intel debuted Hyper-Threading to overcome the Pentium4's long pipe-line and potential stalling, games are not an ideal candidate and such things take a little bit of a hit usually because game devs don't/can't optimise as easily as game engines usually have distinctly different tasks executing in different threads, although it's good to see there are some notable exceptions where some devs / game engines can leverage HT/SMT for a little extra boost, e.g. Cyberpunk, Elden Ring, Starfield.

What is more interesting / surprising, is that in some cases, how little difference it's making to some applications / games in the real world (i.e. not in synthetic benchmarks) - core scaling / utilisation seems reasonably decent in scenarios without SMT/HT being available - how much credit goes to application (or shared development library) developers / AMD&Intel would be debatable.
Having the option set either way isn't the issue it used to be in terms of sacrificing / gaining performance - back with the P4, or even the first gen Core i3 (where we had a bunch of laptops that could only run WinXP without HT-enabled until BIOS fix was available), not having HT enabled was very obvious, but of course they were single/dual core CPUs so a lack of threading would stand out.
SMT on x86/x64 has a problem that everyone here seems to overlook: the two threads that run on the same core have equal priorities, and OS and applications can't change that. If a single thread can use 100% of the core performance without HT, two will run at about 65% + 65%, with unpredictable variations, with HT. Not 100% + 30% or something. Disable HT, and the same two threads will have 70% + 30% minus context switching, with less variability because OS preemptive multitasking does its job.

A good way around that would be to identify the main, time-critical thread of a game (or Excel, for that matter) and let it have a core for itself for as long as possible. Kind of über-affinity. An application can't do that without support from OS but Windows has no such feature - or am I wrong here?

On top of that, the system of priorities on x86/x64 is insufficient in at least one other way: DRAM access is not prioritised.

Those aren't necessarily 'odd' cache sizes - you're looking at a base10 (decimal) scaled measurement of something which logically is designed for base2 (binary) maths.
Alder Lake with its 1.25MB cache sizes, assuming Intel are using normal MB notation, would be 1280KB which in normal binary terms is a nice number.
1.875MB would be 1920KB.

There are many places on the web which list the size without it being scaled to decimal MB numbers.
These "odd" numbers are still very much round in binary notation, I understand that. I (and you too) have should used MiB and KiB here.
 
Last edited:
Joined
Jun 14, 2019
Messages
4 (0.00/day)
Yeah, it is almost like this was super rushed for reasons that make sense only to AMD.... Like it is just AMD being stupid, because there was no reason to rush it ahead of motherboards, it clearly could use more time in the oven so AMD can fix issues, do more internal testing and write reviewer guides with actually good guidelines that tell reviewers how to get most out of their chip. Seems like recent bad Radeon habits are spilling over to Ryzen. Plus they could have done way better optimization to power use, so they actually have performance gain. Yes they wouldn't win on power consumption, but almost no one would care. Plus if they did that, announced X3D, they would have won on their hands. But nope, AMD can't escape underdog mentality.
 
Joined
Jun 20, 2024
Messages
151 (2.52/day)
Yeah, it is almost like this was super rushed for reasons that make sense only to AMD.... Like it is just AMD being stupid, because there was no reason to rush it ahead of motherboards, it clearly could use more time in the oven so AMD can fix issues, do more internal testing and write reviewer guides with actually good guidelines that tell reviewers how to get most out of their chip. Seems like recent bad Radeon habits are spilling over to Ryzen. Plus they could have done way better optimization to power use, so they actually have performance gain. Yes they wouldn't win on power consumption, but almost no one would care. Plus if they did that, announced X3D, they would have won on their hands. But nope, AMD can't escape underdog mentality.
I don't think 'more time in the oven' would have helped. What we're seeing is definitely a 'server first' design (which, let's not kid ourselves, has been the case for decades with Intel and AMD), and boy will those efficiency numbers for certain types of task look very nice.

At the end of the day, AMD don't want to be making more than x number of products at any one time and also not have specialist product lines with limited returns... so these will filter to the mainstream.
X3D chips always follow later, probably because the validation / production process being a bit more complex and not really utilised outside of the desktop PC space mean that it will always lag physical development and manufacturing vs just the core CCD dies - why hold back one product and build up masses of inventory just to release a halo product which actually will not even make up 90+% of your sales? Who do you think they are... Apple/Intel?

SMT on x86/x64 has a problem that everyone here seems to overlook: the two threads that run on the same core have equal priorities, and OS and applications can't change that. If a single thread can use 100% of the core performance without HT, two will run at about 65% + 65%, with unpredictable variations, with HT. Not 100% + 30% or something. Disable HT, and the same two threads will have 70% + 30% minus context switching, with less variability because OS preemptive multitasking does its job.

A good way around that would be to identify the main, time-critical thread of a game (or Excel, for that matter) and let it have a core for itself for as long as possible. Kind of über-affinity. An application can't do that without support from OS but Windows has no such feature - or am I wrong here?

On top of that, the system of priorities on x86/x64 is insufficient in at least one other way: DRAM access is not prioritised.


These "odd" numbers are still very much round in binary notation, I understand that. I (and you too) have should used MiB and KiB here.

As a non-developer I can't answer that question. I don't think before Windows 11 there were great mechanisms for the CPU to push the OS scheduler into making informed decisions about which CPU core to use for certain tasks - to what extent that can do is unknown (i.e. is it just for 'performance/economy' or can it be informed about utilising certain cores / resources for lower latency, etc.). I don't follow Linux kernel updates so no idea what the capabilities are there, but there would need to be some interface / metric provided by the CPU to inform the OS scheduler about how to efficiently run something and I'm not sure there is such a thing in place.

Would be great if someone could actually provide some insight in to that. It seems HT/SMT and the OS schedulers are still basically 'hoping for the best' in terms of managing processes and threads generated.
 
Joined
Mar 16, 2017
Messages
1,942 (0.72/day)
Location
Tanagra
System Name Budget Box
Processor Xeon E5-2667v2
Motherboard ASUS P9X79 Pro
Cooling Some cheap tower cooler, I dunno
Memory 32GB 1866-DDR3 ECC
Video Card(s) XFX RX 5600XT
Storage WD NVME 1GB
Display(s) ASUS Pro Art 27"
Case Antec P7 Neo
But E cores on a phone are their precisely for power saving,s and to get decent battery life, the most critical thing on a phone. Laptops it's important, but they have much larger batteries and can be plugged in for use. They decided they needed full phat cores to compete against Apple as much as x86. They seem to have done a decent job as battery life is good it seems.

I wonder if next year Nvidia's and Mediatek's ARM SoC's for Windows will use E cores?


I get he talks a lot of shit, but he also gets a lot right. I don't doubt he has contacts inside AMD and Intel.
Yet Apple has P+E on all their devices, phones, tablets, and desktop PCs. When it comes to epic battery life, Mac is where it's at. The SD-X stuff is an improvement, but we're still looking at hours of difference, and active cooling is required. I bet E-cores would have at least helped with the former. I just think you don't need as many E cores as Intel puts in there. 4E should cover the basics, but instead, they call a 2P+8E an i7, when often times it performs more like an i5 from 2010, in my experience.
 
Joined
Apr 12, 2013
Messages
7,244 (1.75/day)
but we're still looking at hours of difference, and active cooling is required.
You can easily make a lot of x86 chips passive cooled as well by lowering Tjmax(on Intel?) & that would also increase battery life. MacBook Air running at 90c or above is not ideal, at least for me!
 
Joined
Mar 16, 2017
Messages
1,942 (0.72/day)
Location
Tanagra
System Name Budget Box
Processor Xeon E5-2667v2
Motherboard ASUS P9X79 Pro
Cooling Some cheap tower cooler, I dunno
Memory 32GB 1866-DDR3 ECC
Video Card(s) XFX RX 5600XT
Storage WD NVME 1GB
Display(s) ASUS Pro Art 27"
Case Antec P7 Neo
You can easily make a lot of x86 chips passive cooled as well by lowering Tjmax(on Intel?) & that would also increase battery life. MacBook Air running at 90c or above is not ideal, at least for me!
Temps are what they are, IMO. Most of the time, a MBA won't even hit 50C under everyday tasks. Load it down and it will certainly blow past 90C--it's actually closer to 105C, but that's the trade-off. A Mac with active cooling won't get near that hot though. Don't most modern GPUs and CPUs also flirt with these temps under load, even with active cooling? It's part of the design, as thermal density is high, but the chips have temp sensors all over to throttle hotspots.
 
Joined
Apr 12, 2013
Messages
7,244 (1.75/day)
I like active cooling unless you're just aiming for higher battery life benchmarks for some reason? It keeps the performance consistent & the overall system/laptop cooler. No reason to avoid that even if it nets you an hour or two extra.
 
Joined
May 22, 2010
Messages
367 (0.07/day)
Processor R7-7700X
Motherboard Gigabyte X670 Aorus Elite AX
Cooling Scythe Fuma 2 rev B
Memory no name DDR5-5200
Video Card(s) Some 3080 10GB
Storage dual Intel DC P4610 1.6TB
Display(s) Gigabyte G34MQ + Dell 2708WFP
Case Lian-Li Lancool III black no rgb
Power Supply CM UCP 750W
Software Win 10 Pro x64
¿Would you be able to test a couple of the benchmarks that showed the most difference in windows 10?, as this looks to be another nail in the win11 coffin
 
Joined
Apr 30, 2020
Messages
919 (0.58/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 16Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:

I'm pretty sure I already mentioned this somewhere else on here, in another thread.
They're going to have to add a lot more predictors by going to quad or hexa or just enlarge them by 100%. As base line start for a new architecture. I believe Zen 4 is at the maximum IPC of its design; that's why this change was done you need a starting point of close to your last design in IPC while leaving room to able to increase it.
 
Last edited:
Joined
Oct 30, 2020
Messages
169 (0.12/day)
Skimmed through the article and got to the conclusion where the author seems at a loss as to why SMT behavior is like this with no word from AMD about SMT changes to explain why.
Haven't read this whole forum thread, maybe someone has already pointed this out, but AMD in its press releases did hint at SMT improvements, if you looked hard enough and thought about it.
The key is the dual branch predictors and decoders, new to Zen5.
While not much admittedly is said of it in the official releases, it is mentioned and shown in diagrams.
A video VERY much worth watching is from Chips and Cheese, he goes into the depths of the new architecture changes with an AMD engineer about Zen5.
Specifically, he asks at one point, if 1T loads can make full use of all the core front-end resources (predictors, decoders etc) and the answer is YES.
So, disable SMT, you're forcing 1T mode per core, thus each thread gains 2 branch predictors and decoders per thread instead of 1.
I would say that the benchmarks with the biggest performance gains with SMT disabled are scenarios where the extra branch prediction and/or decoder muscle is kicking in to save the CPU from stalls of failed predictions or is simply keeping the core more fully fed.
In SMT mode, in those scenarios, they're actually a little predictor or decoder-starved!
Interesting results, keep up the good work TPU!

Moment in the video here:

It's an interesting take but something still feels off. SMT disabled shouldn't have that much of a performance increase in games. I get that SMT off gives rise to 2x BP and decoders per thread while SMT on gives each thread access to one BP and decoder. But even with just one each with SMT on, it's the same as previous gen but the uplift on average seems to be around 5% whereas single threaded IPC in terms of floating point increased by a good 18% even without AVX512 workloads and games should definitely benefit from that.

I think we are memory limited in that scenario but your explanation is plausible in terms of Zen 5 SMT being different, as in unlike previous generations, you now actually have a marked increase in branch predictors and decoders for a single thread if you turn SMT off. Previous generations simply stopped sharing the same number of BP/decoders with SMT off.

Now there's yet another data point for 9700X. For numbers with PBO, now you can either have 15% ST, 10% MT and 5% game performance increase. Or you can have 20% ST, -5% MT and 10% game performance increase. I feel all those numbers will increase with memory tuning as Zen 5 should be more sensitive to it.

Also Mike Clark isn't just an AMD engineer but THE AMD Zen engineer. Also, hats off to Chips and Cheese, what a way to start your youtube channel with an interview with Mike Clark himself. Happy that they are finally getting the recognition they deserve, they churn out pretty impressive deep downs. Looking forward to their article, I want to know just how AMD managed to cram this many execution engines and BP's and the like in the same footprint. Some consumers seem to be disappointed that their favourite application isn't accelerated as much as they would've liked but the server guys are seeing massive performance increases and AMD is probably laughing all the way to the bank.

Thanks w1zzard for the tests, much appreciated!
 
Top