• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

First Signs of AMD Zen 3 "Vermeer" CPUs Surface, Ryzen 7 5800X Tested

Joined
Apr 24, 2020
Messages
2,709 (1.62/day)
Of course not, you will never be able to do that, that's not what I meant.
I was thinking of branching logic inside a single scope, like a lot of ifs in a loop. Compilers already turn some of these into branchless alternatives, but I'm sure there is more potential here, especially if the ISA could express dependencies so the CPU could do things out of order more efficiently and hopefully some day limit the stalls in the CPU. As you know, with ever more superscalar CPUs, the relative cost of a cache miss or branch misprediction is growing.
Ideally code should be free of unnecessary branching, and there are a lot of clever tricks with and without AVX, which I believe we have discussed previously.

Possibly, I think we have talked about this issue before.

Dependency management on today's CPUs and compilers is a well solved problem: "xor rax, rax" cuts a dependency, allocates a new register from the reorder buffer, and starts a parallel-calculation that takes advantage of super-scalar CPUs. Its a dirty hack, but it works, and it works surprisingly well. I'm not convinced that a new machine-code format (with more explicit dependency matching) is needed for speed.

I think the main advantage to a potential "dependency-graph representation" would be power-consumption and core-size. Code with more explicit dependencies encoded could have smaller decoders that use less power, leveraging information that the compiler already calculated (instead of re-calculating it from scratch, so to speak).

Modern ROBs are 200+ long already, meaning the CPU can search ~200 instructions looking for instruction-level parallelism. And apparently these reorder buffers are only going to get bigger (300+ for Icelake).
 
Joined
Jan 27, 2015
Messages
1,715 (0.48/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
What the architects are supposed to do and hence probably do, is measure what instructions are being run and how long they run for typical workloads. Then they optimize the instructions that are taking the longest amount of time.

The differences are probably in what a "typical workload" is. Intel is not slower in all non-gaming workloads, that's a false narrative.

As example, look under the 10900K review on this site under "Science and Research" and you'll find it is a clean sweep win for Intel. Look under "Office and Productivity" and it's the same clean sweep for Intel. Web Browsing, Development, and server/workstation are a mixed bag between Intel and AMD. Only under Rendering and Compression/Encryption does AMD take a clear win. If rendering and compression/encryption is your main form of productivity AMD is your clear choice, but I would say that does not describe how the majority of people use their PCs.

 
Joined
Apr 24, 2020
Messages
2,709 (1.62/day)
What the architects are supposed to do and hence probably do, is measure what instructions are being run and how long they run for typical workloads. Then they optimize the instructions that are taking the longest amount of time.

Virtually all instructions these days have a throughput of 1x per clock (or better: 4x ADD per clock for AMD, 3x ADD per clock for Intel), and comparable latencies (1x clock latency for Adds, 5x clock latency for multiply). This is true for AMD, Intel, ARM, and more.

Its no longer a matter of "speeding up instructions", but instead a matter of "keeping the pipelines full, so that as many instructions can be executed in parallel as possible". Its been like this since 1995 when the Pentium Pro introduced out-of-order execution to the consumer (and several other computers had OoO execution in the decades before Pentium Pro)
 
Joined
Jun 10, 2014
Messages
2,986 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Couldn't AMD take chip dies and use the I/O die modulate them much like system memory for double data rate or quadruple data rate to speed up single thread performance.
So you're saying increase memory bandwidth and potentially L3 bandwidth to increase single threaded performance?
You already have Threadripper and Epyc with more memory bandwidth, and you can see it only helps select workloads which are very data intensive, not single threaded performance in general, which is mostly limited by memory latency and pipeline flushes.

They'd each retain their own cache so that itself is a perk of modulating between them in synchronized way controlled thru the I/O die to complete single thread task load. For all intents and purposes the CPU would behave as if it's a single faster chip.
To the extent that I can understand what you're saying, caches are already sort of doing this. Caches are divided into banks, and the CPU cycles through them to read/write, in order to maximize bandwidth. Memory work in a very similar way.

The cache misses defiantly are harsh when they happen, but wouldn't automatically cycle modulating the individual L1/L2/L3 caches in different chip dies through the I/O die get around that? Cycle between the ones available basically. Perhaps they only do it with larger L2/L3 cache's though I mean maybe it doesn't make enough practical sense with the L1 cache being so small and switch times and such. Perhaps in a future design at some level or another I don't know.
Cache misses occur because something is not in cache, and it stalls the thread until the required data is retrieved. CPUs currently use SMT to "work around" this, by executing something else during idle cycles, but it doesn't help single threaded performance. I don't understand how modulating caches could help either, so please elaborate.

Some of the things which could help though; (on the hardware side)
- Larger caches - for things that are pushed out, but not things that are unpredictable.
- Larger instruction window (OoO window) - hide some latency by executing some things earlier, but it can't solve everything.
- Reducing cache latency
- Redesigning the cache hierarchy
- Increasing the CPU's ability to execute some things during stalls - making them more partial stalls, beyond just increasing the OoO window. I'm curious to see what the research into "threadlets" will lead to.
 
Joined
Jan 27, 2015
Messages
1,715 (0.48/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
Virtually all instructions these days have a throughput of 1x per clock (or better: 4x ADD per clock for AMD, 3x ADD per clock for Intel), and comparable latencies (1x clock latency for Adds, 5x clock latency for multiply). This is true for AMD, Intel, ARM, and more.

Its no longer a matter of "speeding up instructions", but instead a matter of "keeping the pipelines full, so that as many instructions can be executed in parallel as possible". Its been like this since 1995 when the Pentium Pro introduced out-of-order execution to the consumer (and several other computers had OoO execution in the decades before Pentium Pro)


That sounds nice and all, except that it is mostly false, which makes me wonder why you said it.

See appendix D:

 
Joined
Jun 10, 2014
Messages
2,986 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
There are certainly many instructions requiring multiple clock cycles. One of the few improvements on the execution side in Sunny Cove is precisely that for integer mul/div. But his underlying point about keeping pipelines full is still true though.
 
Joined
Apr 24, 2020
Messages
2,709 (1.62/day)
That sounds nice and all, except that it is mostly false, which makes me wonder why you said it.

1601585174183.png


IMUL: 1x per clock. ADD/SUB is 3x to 4x per clock cycle. LEA is 2x per clock cycle throughput.

Look at the instructions that take a long time: none of them are expected to be used in the inner loop. "lock cmpxchg" is an inter-thread synchronization instruction, likely to be limited by L3 cache or DDR4 RAM. CPUID is never going to be called in the inner-loop. CLFLUSH (cacheline flush) is also a memory-ordering / inter-thread synchronization thing, and I don't expect it to be in anyone's innerloop.

The only instruction that may need improvement is DIV / IDIV. I'll talk about that later.

---------

Lets check the 256-bit AVX instructions?

1601585268578.png


Oh wow, virtually everything on this page is throughput of 0.33, 0.5, or 1. (3x per clock, 2x per clock, and 1x per clock cycle). This includes the ever important 256-bit VPMADD instruction (vectorized multiply-and-add instruction), with a throughput of either 2x per clock (0.5), or 1x per clock.

For every single performance-critical instruction, the pattern continues. Intel, AMD, and ARM have optimized these instructions to an incredible, and ridiculous degree.

----------

IDIV and DIV are maybe your only exception, which handle not only division but also modulo (%) operations. But it seems like most performance critical code avoids division and favors either inverse-multiplication (with the inverse figured out by the compiler), the use of bitshifts, or in some rare cases, the use of floating-point division (which is much faster).


See how "divide by 31" becomes an imul instruction by most compilers (multiplication of the integer-multiplication inverse), avoiding the use of division. I'm constantly surprised at how good compilers are at removing spurious divisions or modulo operations from my code (and a good old subtract loop handles most "dynamic division" cases in my experience)
 
Last edited:
Joined
Jun 10, 2014
Messages
2,986 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Well, one of my computers just decided to retire…
I guess it saw this news article. :p
 
Last edited:
Top