• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products

Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
There is no such thing as single threaded software these days, practically everything is written to use multiple threads.

Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

So even though 3d rendering is multithreaded, a lot of the "bone scripts" that 3d modelers write in Python (or whatever scripting language your 3d program supports) are single-thread bound. They could be written multithreaded, but these are 3d artists who are writing a lot of this stuff, not necessarily expert programmers. Rigging, import/export scripts, game animations, etc. etc. A lot of these things end up on a single thread.

People want both: single thread and multithreaded power.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
That article did help me understand the actual difference you're trying to convey, thanks.
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;

extremetech article said:
SMT-enabled CPUs are able to schedule work from more than one thread for execution in the same clock cycle.
SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.

extremetech article said:
Modern x86 CPUs from AMD and Intel take advantage of SMT to improve performance by an average of 20-30 percent at a fraction of the cost or power that would be required to build an entire second core.
<snip>
Apple’s 8-wide M1 doesn’t have this problem. The front-end of a RISC CPU allows generally higher efficiency in terms of instructions decoded per single thread.
This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

extremetech article said:
An x86 CPU achieves much higher overall efficiency when you run two threads through a single core, partly because they’ve been explicitly designed and optimized for it, and partly because SMT helps CPUs with decoupled CISC front-ends achieve higher IPC overall.
Nonsense.
SMT doesn't improve IPC.
 
Joined
Mar 10, 2010
Messages
11,878 (2.21/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;


SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.


This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.


Nonsense.
SMT doesn't improve IPC.
SMT isn't done the same by everyone, AMD and I think intel have advanced it beyond such simplicity.
"
AMD Zen microarchitecture has 2-way SMT.

VISC architecture[11][12][13] uses the Virtual Software Layer (translation layer) to dispatch a single thread of instructions to the Global Front End which splits instructions into virtual hardware threadlets which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where."

Taken from the wiki.
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
There isn't a decoding bottleneck on x86 designs

Laughs in Bulldozer.

But you're right. I'm just poking fun at your phrasing. Seriously though: Bulldozer had exactly the problem you're describing, so its a good realistic example of what you're talking about.
 
Joined
Jan 8, 2017
Messages
9,438 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.

In practice, most code is multithreaded.

But in practice, most code is single-thread bound, due to Ahmdal's law. Which means the code gets faster when you get +Single thread performance. +Multithread performance is minimized due to the nature of Ahmdal's law.

There are exceptions: 3d Modeling renders are closer to Gustafson's law. That is: people aren't primarily interested in rendering times per se. A 3d Render is "set" at 8-hours or ~72 hours per frame (in the case of Marvel / Pixar movies), which is the largest practical time for their workflow. What 3d modelers want is a better image at the end of those 72 hours, which follows Gustafson's law (you can do more work / more detailed modeling in the same timeframe).

Video games are often multithread-programmed but single-thread bound on the physics thread. AI, Sound, even graphical effects can all complete nearly immediately. But the physics rendering (collision detection. Bullet detection, object-per-object updates) takes the most time, and is often only written in a single thread for maximum consistency. (It is hard to make a multiplayer game all update their physics simultaneously unless you're all doing it in a single-thread in a well defined order and well-defined floating-point rounding)

------

Same game at higher FPS: Ahmdal's law and single-thread bound.

Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.

Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.

RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.
POWER9 has pipeline stalls because it was designed with lol 2-latency on XOR and Add instructions.

The other "RISC" processors (and I hate that word...): ARM / RISC-V, do not suffer from this behavior. I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Same game at higher FPS: Ahmdal's law and single-thread bound.
Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.
Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.
(Real time) games are very latency sensitive workloads, which makes them extra challenging to scale across multiple threads.
The game simulation ("game loop" / "game physics" (not effects simulation)) itself should run at a constant tick rate, and should be decoupled from the rendering. Quite often, this simulation is only parallel on a small scale, but sequential on the larger scale, which makes splitting it over multiple threads difficult without running into synchronization issues. If you have ever seen games where the physics go mad and elements accelerate like crazy, it's probably because of timing issues causing incorrect calculations.

Game rendering itself can use multiple threads, but not like how most people imagine it. Independent render passes can easily be split into separate queues, and particle simulation and asset loading too. But splitting up a single render pass among several threads will in all normal situations cause significant overhead. If your rendering thread is somehow CPU bound, I'm willing to bet that overhead has to do with your way of coding and little do to with lack of multithreading.

So there are limits of how far workloads can be parallelized, no matter how well they are done. It all depends on synchronization and dependencies. This is where Amdahl's law comes in, but since there are many levels of parallelization, the principle has to be applied on multiple levels, like when to use multithreading, SIMD, GPUs, etc…

I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.
Yes, it seems like Power is dominated by database and Java workloads, which typically are stalled >90% of the time. I don't know if they designed the CPU for these workloads in mind, or if the workloads found the CPU though.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
The industry is doing it wrong or completely missing the nuance. You cannot compare the single threading performance directly because of their architectural differences. In any case it will be unfair for one or the other but in race to get those clicks, the truth gets thrown to the wayside.


Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;


SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.


This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.


Nonsense.
SMT doesn't improve IPC.
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.
 
Joined
Apr 24, 2020
Messages
2,710 (1.61/day)
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.

Au contraire, Apple M1 has 3 load + 2 store units per core and no SMT at all.

Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick. CPUs are simply following suite with today's compilers (and vice versa: compilers are compiling code to automatically take advantage of the large number of load/store units on modern CPUs).
 
Joined
Apr 30, 2020
Messages
986 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 32Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
IPC fluctuates according to architecture, in fact it even fluctuates within the same architecture. A processor never has a constant IPC, that's quite literally impossible.

You can come up with an "average IPC" but that wouldn't mean much either.

yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.
It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.

Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick.
They are, and this is achieved without any superscalar features exposed through the ISAs.
 
Joined
Jan 8, 2017
Messages
9,438 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.

It's not just the branch prediction, the same amount of instructions never have the same dependencies and therefore the subset of them that can be executed in parallel always varies.
 
Joined
Nov 3, 2011
Messages
695 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.


They are, and this is achieved without any superscalar features exposed through the ISAs.
Your SMT side-channel vulnerabilities argument is flawed when I also use an AMD Zen 2 CPU which I plan to update towards Zen 3. Do not apply Intel's SMT side-channel vulnerabilities on AMD Zen CPUs.

From https://www.zdnet.com/article/arm-cpus-impacted-by-rare-side-channel-attack/
Arm CPUs impacted by rare side-channel attack

Intel's side-channel issues are worst than AMD's.


AMD's Zen has a two-way SMT.
 
Last edited:
Top