AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products

dragontamer5788 · Jan 15, 2021

Vya Domus said:
There is no such thing as single threaded software these days, practically everything is written to use multiple threads.

Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

So even though 3d rendering is multithreaded, a lot of the "bone scripts" that 3d modelers write in Python (or whatever scripting language your 3d program supports) are single-thread bound. They could be written multithreaded, but these are 3d artists who are writing a lot of this stuff, not necessarily expert programmers. Rigging, import/export scripts, game animations, etc. etc. A lot of these things end up on a single thread.

People want both: single thread and multithreaded power.

efikkan · Jan 15, 2021

Mathragh said:
That article did help me understand the actual difference you're trying to convey, thanks.
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.

Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;

extremetech article said:
SMT-enabled CPUs are able to schedule work from more than one thread for execution in the same clock cycle.

SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.

extremetech article said:
Modern x86 CPUs from AMD and Intel take advantage of SMT to improve performance by an average of 20-30 percent at a fraction of the cost or power that would be required to build an entire second core.
<snip>
Apple’s 8-wide M1 doesn’t have this problem. The front-end of a RISC CPU allows generally higher efficiency in terms of instructions decoded per single thread.

This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

extremetech article said:
An x86 CPU achieves much higher overall efficiency when you run two threads through a single core, partly because they’ve been explicitly designed and optimized for it, and partly because SMT helps CPUs with decoupled CISC front-ends achieve higher IPC overall.

Nonsense.
SMT doesn't improve IPC.

TheoneandonlyMrK · Jan 15, 2021

efikkan said:
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;

SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.

This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

Nonsense.
SMT doesn't improve IPC.

SMT isn't done the same by everyone, AMD and I think intel have advanced it beyond such simplicity.
"
AMD Zen microarchitecture has 2-way SMT.

VISC architecture [11][12][13] uses the Virtual Software Layer (translation layer) to dispatch a single thread of instructions to the Global Front End which splits instructions into virtual hardware threadlets which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where."

Taken from the wiki.

dragontamer5788 · Jan 15, 2021

efikkan said:
There isn't a decoding bottleneck on x86 designs

Laughs in Bulldozer.

But you're right. I'm just poking fun at your phrasing. Seriously though: Bulldozer had exactly the problem you're describing, so its a good realistic example of what you're talking about.

Vya Domus · Jan 15, 2021

dragontamer5788 said:
Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.

dragontamer5788 · Jan 15, 2021

Vya Domus said:
There is no application which only has an UI, of course a lot of sub components are single threaded, but even the simplest of applications use more threads.

In practice, most code is multithreaded.

But in practice, most code is single-thread bound, due to Ahmdal's law. Which means the code gets faster when you get +Single thread performance. +Multithread performance is minimized due to the nature of Ahmdal's law.

There are exceptions: 3d Modeling renders are closer to Gustafson's law. That is: people aren't primarily interested in rendering times per se. A 3d Render is "set" at 8-hours or ~72 hours per frame (in the case of Marvel / Pixar movies), which is the largest practical time for their workflow. What 3d modelers want is a better image at the end of those 72 hours, which follows Gustafson's law (you can do more work / more detailed modeling in the same timeframe).

Video games are often multithread-programmed but single-thread bound on the physics thread. AI, Sound, even graphical effects can all complete nearly immediately. But the physics rendering (collision detection. Bullet detection, object-per-object updates) takes the most time, and is often only written in a single thread for maximum consistency. (It is hard to make a multiplayer game all update their physics simultaneously unless you're all doing it in a single-thread in a well defined order and well-defined floating-point rounding)

------

Same game at higher FPS: Ahmdal's law and single-thread bound.

Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.

Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.

efikkan said:
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

POWER9 has pipeline stalls because it was designed with lol 2-latency on XOR and Add instructions.

The other "RISC" processors (and I hate that word...): ARM / RISC-V, do not suffer from this behavior. I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.

efikkan · Jan 15, 2021

dragontamer5788 said:
Same game at higher FPS: Ahmdal's law and single-thread bound.
Different game with more effects at the same FPS: Gustafson's law, probably can take advantage of multicore more.
Two different programming styles, two different results. It depends on the game engine, the game programming team and their philosophy with regards to high-performance programming.

(Real time) games are very latency sensitive workloads, which makes them extra challenging to scale across multiple threads.
The game simulation ("game loop" / "game physics" (not effects simulation)) itself should run at a constant tick rate, and should be decoupled from the rendering. Quite often, this simulation is only parallel on a small scale, but sequential on the larger scale, which makes splitting it over multiple threads difficult without running into synchronization issues. If you have ever seen games where the physics go mad and elements accelerate like crazy, it's probably because of timing issues causing incorrect calculations.

Game rendering itself can use multiple threads, but not like how most people imagine it. Independent render passes can easily be split into separate queues, and particle simulation and asset loading too. But splitting up a single render pass among several threads will in all normal situations cause significant overhead. If your rendering thread is somehow CPU bound, I'm willing to bet that overhead has to do with your way of coding and little do to with lack of multithreading.

So there are limits of how far workloads can be parallelized, no matter how well they are done. It all depends on synchronization and dependencies. This is where Amdahl's law comes in, but since there are many levels of parallelization, the principle has to be applied on multiple levels, like when to use multithreading, SIMD, GPUs, etc…

dragontamer5788 said:
I think IBM intended for POWER9 to be a 4-way or 8-way SMT from the start. When you consider that most business class code (ie: databases) are sitting around waiting for cache-stalls, it makes sense to go higher SMT and higher-latency cores.

Yes, it seems like Power is dominated by database and Java workloads, which typically are stalled >90% of the time. I don't know if they designed the CPU for these workloads in mind, or if the workloads found the CPU though.

ValenOne · Jan 15, 2021

thesmokingman said:
The industry is doing it wrong or completely missing the nuance. You cannot compare the single threading performance directly because of their architectural differences. In any case it will be unfair for one or the other but in race to get those clicks, the truth gets thrown to the wayside.

Current x86 vs. Apple M1 Performance Measurements Are Flawed

There's an intrinsic difference between x86 and ARM CPU designs that makes comparing performance difficult -- and it didn't get noticed in the initial wave of coverage.

www.extremetech.com

efikkan said:
Whoever wrote that article doesn't know much about how CPUs work, let's dissect it;

SMT in x86 CPUs switches between two threads, not executes them simultaneously.
The point of SMT is to utilize idle clock cycles, mostly due to cache misses, and let another thread utilize it.

This is completely untrue.
There isn't a decoding bottleneck on x86 designs, and if there were, adding SMT would only make it worse, as this makes two threads share the same front-end and cache.
RISC is if anything more prone to pipeline stalls, which is actually the reason for Power having 4-way and 8-way SMT.

Nonsense.
SMT doesn't improve IPC.

Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.

dragontamer5788 · Jan 15, 2021

rvalencia said:
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.

Au contraire, Apple M1 has 3 load + 2 store units per core and no SMT at all.

Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick. CPUs are simply following suite with today's compilers (and vice versa: compilers are compiling code to automatically take advantage of the large number of load/store units on modern CPUs).

DemonicRyzen666 · Jan 16, 2021

Vya Domus said:
IPC fluctuates according to architecture, in fact it even fluctuates within the same architecture. A processor never has a constant IPC, that's quite literally impossible.

You can come up with an "average IPC" but that wouldn't mean much either.

yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.

efikkan · Jan 16, 2021

rvalencia said:
Your SMT definition is the Pentium IV era. Modern x86 core is wide enough to support two concurrent threads which result in a performance increase.

Ryzen Zen 3 core has 3 loads and 2 store units.
Ryzen Zen 2 core has 2 loads and 1 store units.

It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.

dragontamer5788 said:
Modern code, even on x86 and ARM, is designed to run multiple loads/stores per clock tick.

They are, and this is achieved without any superscalar features exposed through the ISAs.

Vya Domus · Jan 16, 2021

DemonicRyzen666 said:
yeah the branch predictors almost never uses the same predictions. It's sort of crazy cause if one of those prediction may have been the fastest. How are you going to stop the predictor from doing that?
I know only Sandybridge had some way of seeing repeating/looping code to get to he answer for it with out doing the work.

It's not just the branch prediction, the same amount of instructions never have the same dependencies and therefore the subset of them that can be executed in parallel always varies.

ValenOne · Jan 17, 2021

efikkan said:
It's not a question of having enough execution ports, it's a question is whether the complexity needed to execute two threads within the same clock is worth it. The x86 implementations of SMT are very simple and designed to utilize mostly idle clock cycles (which are plentiful), and are very simple compared to e.g. Power CPUs.

With the kind of CPU bugs we've seen exposed over the past few years, SMT seems like less and less of a good idea. It adds a lot of complexity and needs more and more safeguards to prevent timing attacks, data leakage etc. We will soon be approaching a point where these transistors are better spent in other ways.

They are, and this is achieved without any superscalar features exposed through the ISAs.

Your SMT side-channel vulnerabilities argument is flawed when I also use an AMD Zen 2 CPU which I plan to update towards Zen 3. Do not apply Intel's SMT side-channel vulnerabilities on AMD Zen CPUs.

From https://www.zdnet.com/article/arm-cpus-impacted-by-rare-side-channel-attack/
Arm CPUs impacted by rare side-channel attack

https://twitter.com/x/status/1236218121704239104

Intel's side-channel issues are worst than AMD's.

AMD's Zen has a two-way SMT.

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

System Name	S.L.I + RTX research rig
Processor	Ryzen 7 5800X 3D.
Motherboard	MSI MEG ACE X570
Cooling	Corsair H150i Cappellx
Memory	Corsair Vengeance pro RGB 3200mhz 32Gbs
Video Card(s)	2x Dell RTX 2080 Ti in S.L.I
Storage	Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s)	HP X24i
Case	Corsair 7000D Airflow
Power Supply	EVGA G+1600watts
Mouse	Corsair Scimitar
Keyboard	Cosair K55 Pro RGB