• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products

There is no such thing as single threaded software these days, practically everything is written to use multiple threads.

Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

So even though 3d rendering is multithreaded, a lot of the "bone scripts" that 3d modelers write in Python (or whatever scripting language your 3d program supports) are single-thread bound. They could be written multithreaded, but these are 3d artists who are writing a lot of this stuff, not necessarily expert programmers. Rigging, import/export scripts, game animations, etc. etc. A lot of these things end up on a single thread.

People want both: single thread and multithreaded power.
 
That article did help me understand the actual difference you're trying to convey, thanks.
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;

extremetech article said:
SMT-enabled CPUs are able to schedule work from more than one thread for execution in the same clock cycle.
SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.

extremetech article said:
Modern x86 CPUs from AMD and Intel take advantage of SMT to improve performance by an average of 20-30 percent at a fraction of the cost or power that would be required to build an entire second core.
<snip>
Apple’s 8-wide M1 doesn’t have this problem. The front-end of a RISC CPU allows generally higher efficiency in terms of instructions decoded per single thread.
This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

extremetech article said:
An x86 CPU achieves much higher overall efficiency when you run two threads through a single core, partly because they’ve been explicitly designed and optimized for it, and partly because SMT helps CPUs with decoupled CISC front-ends achieve higher IPC overall.
Nonsense.
SMT doesn't improve IPC.
 
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;


SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.


This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.


Nonsense.
SMT doesn't improve IPC.
SMT isn't done the same by everyone, AMD and I think intel have advanced it beyond such simplicity.
"
AMD Zen microarchitecture has 2-way SMT.

VISC architecture[11][12][13] uses the Virtual Software Layer (translation layer) to dispatch a single thread of instructions to the Global Front End which splits instructions into virtual hardware threadlets which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where."

Taken from the wiki.
 
There isn't a decoding bottleneck on x86 designs

Laughs in Bulldozer.

But you're right. I'm just poking fun at your phrasing. Seriously though: Bulldozer had exactly the problem you're describing, so its a good realistic example of what you're talking about.
 
Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.
 
There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.

In practice, most code is multithreaded.

But in practice, most code is single-thread bound, due to Ahmdal's law. Which means the code gets faster when you get +Single thread performance. +Multithread performance is minimized due to the nature of Ahmdal's law.

There are exceptions: 3d Modeling renders are closer to Gustafson's law. That is: people aren't primarily interested in rendering times per se. A 3d Render is "set" at 8-hours or ~72 hours per frame (in the case of Marvel / Pixar movies), which is the largest practical time for their workflow. What 3d modelers want is a better image at the end of those 72 hours, which follows Gustafson's law (you can do more work / more detailed modeling in the same timeframe).

Video games are often multithread-programmed but single-thread bound on the physics thread. AI, Sound, even graphical effects can all complete nearly immediately. But the physics rendering (collision detection. Bullet detection, object-per-object updates) takes the most time, and is often only written in a single thread for maximum consistency. (It is hard to make a multiplayer game all update their physics simultaneously unless you're all doing it in a single-thread in a well defined order and well-defined floating-point rounding)

------

Same game at higher FPS: Ahmdal's law and single-thread bound.

Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.

Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.

RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.
POWER9 has pipeline stalls because it was designed with lol 2-latency on XOR and Add instructions.

The other "RISC" processors (and I hate that word...): ARM / RISC-V, do not suffer from this behavior. I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.
 
Last edited:
Same game at higher FPS: Ahmdal's law and single-thread bound.
Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.
Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.
(Real time) games are very latency sensitive workloads, which makes them extra challenging to scale across multiple threads.
The game simulation ("game loop" / "game physics" (not effects simulation)) itself should run at a constant tick rate, and should be decoupled from the rendering. Quite often, this simulation is only parallel on a small scale, but sequential on the larger scale, which makes splitting it over multiple threads difficult without running into synchronization issues. If you have ever seen games where the physics go mad and elements accelerate like crazy, it's probably because of timing issues causing incorrect calculations.

Game rendering itself can use multiple threads, but not like how most people imagine it. Independent render passes can easily be split into separate queues, and particle simulation and asset loading too. But splitting up a single render pass among several threads will in all normal situations cause significant overhead. If your rendering thread is somehow CPU bound, I'm willing to bet that overhead has to do with your way of coding and little do to with lack of multithreading.

So there are limits of how far workloads can be parallelized, no matter how well they are done. It all depends on synchronization and dependencies. This is where Amdahl's law comes in, but since there are many levels of parallelization, the principle has to be applied on multiple levels, like when to use multithreading, SIMD, GPUs, etc…

I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.
Yes, it seems like Power is dominated by database and Java workloads, which typically are stalled >90% of the time. I don't know if they designed the CPU for these workloads in mind, or if the workloads found the CPU though.
 
The industry is doing it wrong or completely missing the nuance. You cannot compare the single threading performance directly because of their architectural differences. In any case it will be unfair for one or the other but in race to get those clicks, the truth gets thrown to the wayside.


Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;


SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.


This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.


Nonsense.
SMT doesn't improve IPC.
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.
 
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.

Au contraire, Apple M1 has 3 load + 2 store units per core and no SMT at all.

Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick. CPUs are simply following suite with today's compilers (and vice versa: compilers are compiling code to automatically take advantage of the large number of load/store units on modern CPUs).
 
IPC fluctuates according to architecture, in fact it even fluctuates within the same architecture. A processor never has a constant IPC, that's quite literally impossible.

You can come up with an "average IPC" but that wouldn't mean much either.

yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.
 
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.
It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.

Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick.
They are, and this is achieved without any superscalar features exposed through the ISAs.
 
yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.

It's not just the branch prediction, the same amount of instructions never have the same dependencies and therefore the subset of them that can be executed in parallel always varies.
 
It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.


They are, and this is achieved without any superscalar features exposed through the ISAs.
Your SMT side-channel vulnerabilities argument is flawed when I also use an AMD Zen 2 CPU which I plan to update towards Zen 3. Do not apply Intel's SMT side-channel vulnerabilities on AMD Zen CPUs.

From https://www.zdnet.com/article/arm-cpus-impacted-by-rare-side-channel-attack/
Arm CPUs impacted by rare side-channel attack

Intel's side-channel issues are worst than AMD's.


AMD's Zen has a two-way SMT.
 
Last edited:
Back
Top