• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

I have a question about caches in CPU cores.

Joined
Nov 27, 2010
Messages
924 (0.18/day)
System Name future xeon II
Processor DUAL SOCKET xeon e5 2686 v3 , 36c/72t, hacked all cores @3.5ghz, TDP limit hacked
Motherboard asrock rack ep2c612 ws
Cooling case fans,liquid corsair h100iv2 x2
Memory 96 gb ddr4 2133mhz gskill+corsair
Video Card(s) 2x 1080 sc acx3 SLI, @STOCK
Storage Hp ex950 2tb nvme+ adata xpg sx8200 pro 1tb nvme+ sata ssd's+ spinners
Display(s) philips 40" bdm4065uc 4k @60
Case silverstone temjin tj07-b
Audio Device(s) sb Z
Power Supply corsair hx1200i
Mouse corsair m95 16 buttons
Keyboard microsoft internet keyboard pro
Software windows 10 x64 1903 ,enterprise
Benchmark Scores fire strike ultra- 10k time spy- 15k cpu z- 400/15000
true, although gaming gradually becoming multi threaded, my server chips are on fire in gaming, any game, any refresh rate
 
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
Well, servers are big and slow, but do a lot of work. That's why you see 32 core EPYC (server) chips. By contrast, desktops have smaller, faster cores. It's like comparing a Mack truck to a Ferrari. You don't use a fleet of Ferraris to haul cargo (like lots of web traffic), and you don't take an 18 wheeler to a race track (like running your favorite game at 165hz).
This is a good way of putting it:) but honestly if we all could ,we woul all have the 18 wheeler doing Ferrari speed and efficiency haha. But yeah...

Part of me does kinda want a 9980XE, waterchiller, and 4.6+ all core OC for this very fact :3

true, although gaming gradually becoming multi threaded, my server chips are on fire in gaming, any game, any refresh rate
this too^.^ i hope to see the 2600X's of the world maybe giving the 9600k's a run for their money one day. But probably not :(
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
and it shows 7700K much faster, i think faster than the clock speed increase would allow honestly. 8700k seems to do much better. what is all core boost for 7800x? 8700k is 4,.3 afaik and 7700k is 4,.4. not sure of 7800x.
According to this it's 4.0 GHz, but I haven't verified.
i7-7700K is 4.5 GHz (1 core), 4.4 GHz (2-4 cores).

Zen has 512kb of L2 though. i wonder why Skylake client can get away with 50% of the l2 cache ? more efficient prefetcher ? Well actually looking at Skylake server with 1MB of l2 per core not sure it makes a huge difference for gaming

Cache is expensive, in every sense of the word. It's expensive to produce, sucks down power and kicks out a lot of heat. You don't want more cache than you need.<snip>
Why Zen has more cache than Skylake, I'm not sure. Maybe there's more "stuff" in the Zen cores than Skylake cores, which warrants having more cache?
To both;
It's easy to become blind on specs. Heck, even old 80486 supported at least 512kB L2 cache (off-die). L1 and L2 is closely tied to the microarchitecture, which is probably why Intel and AMD tweak the config more or less every generation. Heat is not the primary concern, but the size on the die certainly is, since it needs to be connected in the ideal spot. Moving it slightly might cause higher latency, and with higher clock speeds this is more sensitive than ever.

So back to the subject you both were mentioning; why is Skylake-S more efficient with half the L2 cache of Zen? It comes down to how the cache is used. The front-end/prefetcher operates on an instruction window, does OoOE, predicts branches etc. While Skylake have a slightly larger instruction window than Zen (224 vs. 192), Zen have other advantages like a larger micro-op cache (2048 vs. 1536), more L2 cache and more execution ports, so on paper Zen looks fairly strong but still doesn't answer the question.

When it comes to prefetching, you might think more is better, right? Wrong. Each cache line you write to L2 kicks something else out, so if you cache "useless" stuff, it might kick out more useful stuff, and you'll end up hurting the performance. So the most important thing of all is the algorithm you use to predict, and while it may not be visible in the tech specs, it's more important than the size of the L2 etc. My impression is that AMD's approach here might be a bit more "brute force" than Intel's. This is one of the reasons why I keep saying good benchmarks is what matters, not buying CPUs or GPUs based on a "limited" understanding of what they even mean.

Well, servers are big and slow, but do a lot of work. That's why you see 32 core EPYC (server) chips. By contrast, desktops have smaller, faster cores. It's like comparing a Mack truck to a Ferrari. You don't use a fleet of Ferraris to haul cargo (like lots of web traffic), and you don't take an 18 wheeler to a race track (like running your favorite game at 165hz).
It also comes down to types of workloads.
Many server workloads are typically async, which means they will scale nearly perfectly across any core count, so all that really matters then is the balance between performance and total efficiency.
Synchronized multithreaded workloads on the other hand tend to get diminishing returns with increasing core count, so balancing core count and core speed is more important for typical end-user workloads.
 
Last edited:
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
According to this it's 4.0 GHz, but I haven't verified.
i7-7700K is 4.5 GHz (1 core), 4.4 GHz (2-4 cores).




To both;
It's easy to become blind on specs. Heck, even old 80486 supported at least 512kB L2 cache (off-die). L1 and L2 is closely tied to the microarchitecture, which is probably why Intel and AMD tweak the config more or less every generation. Heat is not the primary concern, but the size on the die certainly is, since it needs to be connected in the ideal spot. Moving it slightly might cause higher latency, and with higher clock speeds this is more sensitive than ever.

So back to the subject you both were mentioning; why is Skylake-S more efficient with half the L2 cache of Zen? It comes down to how the cache is used. The front-end/prefetcher operates on an instruction window, does OoOE, predicts branches etc. While Skylake have a slightly larger instruction window than Zen (224 vs. 192), Zen have other advantages like a larger micro-op cache (2048 vs. 1536), more L2 cache and more execution ports, so on paper Zen looks fairly strong but still doesn't answer the question.

When it comes to prefetching, you might think more is better, right? Wrong. Each cache line you write to L2 kicks something else out, so if you cache "useless" stuff, it might kick out more useful stuff, and you'll end up hurting the performance. So the most important thing of all is the algorithm you use to predict, and while it may not be visible in the tech specs, it's more important than the size of the L2 etc. This is one of the reasons why I keep saying good benchmarks is what matters, not buying CPUs or GPUs based on a "limited" understanding of what they even mean.


It also comes down to types of workloads.
Many server workloads are typically async, which means they will scale nearly perfectly across any core count, so all that really matters then is the balance between performance and total efficiency.
Synchronized multithreaded workloads on the other hand tend to get diminishing returns with increasing core count, so balancing core count and core speed is more important for typical end-user workloads.
I'm loving reading your replies. IK i said thank you before but i say it agian because this atm i am like a kid in a candy store with this. haha. About zen, i heard Skylake has a much better branch predictor (also what exactly does BP do, and does it also rely on the cache?) so Zen may be wider but skylake is currently, smarter ? also on this subject, from Fritzchen Fritz' flickr, i grabbed the Zen+ die shot and using something i read i cropped a core and labeled some of the bits. I also tried to put a highlight on the execution engine logic, i mean the bits that do the number calculations. Alu, fpu, etc. and i may be wrong here, but i notice the amount of die space in the CPU core dedicated to actually working with the numbers, is much, much smaller than the bits dedicated to making it work with the numbers effectively. What do you think?

ZenPlusCoreAnnotated2.jpg


Please correct me if im wrong thanks^^

I hear Zen 1 is front end limited. in that the EU's are not being fully realised in performance due to scheduling, and stuff like that. And that Zen2 brings a completely redesigned front end. maybe this will make Zen finally smarter than skylake. :)

also, if 7800X is 4 ghz all core then yeah it is 10% lower is going to explain some of the performance delta.But in some games it is a whopping 30-50% slower. Just seems weird to me when 8700K has no such penalty. It must be something to do with the SKL-X architecture

Zenpluscore.jpg

All credit for these AWESOME die shots goes to Fritzchens Fritz, here
 
Last edited:
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
… so Zen may be wider but skylake is currently, smarter ?
For non-AVX workloads, yes!
Skylake has 4 shared execution ports hooked up to ALUs, FPUs etc.
Zen (to my understanding) has 4 for INT and 2 for floats, the two for floats are fused to one when doing AVX, but work as two separate 128-bit FPUs otherwise.

also on this subject, from Fritzchen Fritz' flickr, i grabbed the Zen+ die shot and using something i read i cropped a core and labeled some of the bits. I also tried to put a highlight on the execution engine logic, i mean the bits that do the number calculations. Alu, fpu, etc. and i may be wrong here, but i notice the amount of die space in the CPU core dedicated to actually working with the numbers, is much, much smaller than the bits dedicated to making it work with the numbers effectively. What do you think?
Yes, your observation is correct.
The infrastructure to feed the execution engine is much bigger than the execution engine itself.
And if you removed the AVX part of the execution engine, it will be much reduced. Single FPUs and especially single ALUs are surprisingly tiny.

And that Zen2 brings a completely redesigned front end. maybe this will make Zen finally smarter than skylake. :)
:)
I've said it many times, the key to matching or exceeding Intel's performance lies in the front-end.
AMD still have a sizable gap to close, but my guess is that Zen 2 will be much closer to Skylake.
Since I don't have any inside information, and the final products don't exist yet, I will refrain from making precise guesses about performance, since I know it will be pointless. The truth will be revealed in benchmarks.

But that being said, I wouldn't be too surprised if Zen 2 is faster than Skylake in some workloads, but still slower in others. I'm talking per core, of course.

also, if 7800X is 4 ghz all core then yeah it is 10% lower is going to explain some of the performance delta.But in some games it is a whopping 30-50% slower. Just seems weird to me when 8700K has no such penalty. It must be something to do with the SKL-X architecture
Could be, but it's hard to conclude without more details.
BTW; are you sure it was a proper comparison? Some motherboards auto-overclock etc.
 
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
Acording to Wikichip Zen has four FP pipes , 4x 128bits, two for multiplication and two for addition. so overal vector width is the same as skylake client (512bit) but has higher granularity. iam not 100% sure of this but it seems pretty reliable source. This also means, surely, Zen has better FPU for SMT due to the granularity.

According to AMD Zen2 is 4x float per socket vs Zen1. ROME vs NAPLES so thats 64 vs 32 cores. 2x of that is from the doubling of cores and two from the doubling of the FPU right?

so my theory is Zen2 has 4x 256 bit FPU pipes , this potentially means Zen2 can do AVX512 ops the same way Zen1 can do avx2.
:D

edit fixed my typo of 1024 bit. Apparently i cant add XD
 
Last edited:
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Acording to Wikichip Zen has four FP pipes , 4x 128bits, two for multiplication and two for addition. so overal vector width is the same as skylakelike client (512bit) but has higher granularity. iam not 100% sure of this but it seems pretty reliable source. This also means, surely, Zen has better FPU for SMT due to the granularity.

According to AMD Zen2 is 4x float per socket vs Zen1. ROME vs NAPLES so thats 64 vs 32 cores. 2x of that is from the doubling of cores and two from the doubling of the FPU right?

so my theory is Zen2 has 4x 256 bit FPU pipes , this potentially means Zen2 can do AVX512 ops the same way Zen1 can do avx2.
I was counting complete sets of ADD + MUL, but sure, by your terms Zen technically has "four" 128-bit units.

Zen 2 will have 2 complete 256-bit sets. I haven't seen any info so far if they can be fused to one 512-bit unit or not.
 
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
I was counting complete sets of ADD + MUL, but sure, by your terms Zen technically has "four" 128-bit units.

Zen 2 will have 2 complete 256-bit sets. I haven't seen any info so far if they can be fused to one 512-bit unit or not.
Ah right sorry, i misunderstood. Btw is the seperation of the mul and add units better than the combined ones? I mean surely amd had a reason for it. I always assumed this approach is better for SMT when using float heavy code, as with the 4 pipes it can be shared between the 2 threads in the core better?
 
Joined
Mar 23, 2016
Messages
4,844 (1.52/day)
Processor Core i7-13700
Motherboard MSI Z790 Gaming Plus WiFi
Cooling Cooler Master RGB something
Memory Corsair DDR5-6000 small OC to 6200
Video Card(s) XFX Speedster SWFT309 AMD Radeon RX 6700 XT CORE Gaming
Storage 970 EVO NVMe M.2 500GB,,WD850N 2TB
Display(s) Samsung 28” 4K monitor
Case Phantek Eclipse P400S
Audio Device(s) EVGA NU Audio
Power Supply EVGA 850 BQ
Mouse Logitech G502 Hero
Keyboard Logitech G G413 Silver
Software Windows 11 Professional v23H2
I hear Zen 1 is front end limited. in that the EU's are not being fully realised in performance due to scheduling, and stuff like that. And that Zen2 brings a completely redesigned front end. maybe this will make Zen finally smarter than skylake. :)
It was covered here on TPU back in November:
The front-end of "Zen" and "Zen+" cores are believed to be refinements of previous-generation architectures such as "Excavator." Zen 2 gets a brand-new front-end that's better optimized to distribute and collect workloads between the various on-die components of the core.
https://www.techpowerup.com/249450/amd-zen-2-ipc-29-percent-higher-than-zen
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Ah right sorry, i misunderstood. Btw is the seperation of the mul and add units better than the combined ones? I mean surely amd had a reason for it. I always assumed this approach is better for SMT when using float heavy code, as with the 4 pipes it can be shared between the 2 threads in the core better?
It depends on how many execution ports are hooked up to them.
Some AVX instructions, like FMA requires both MUL and ADD at the same time, either on the same execution port or multiple ports fused.

I wouldn't mix threads (SMT) into this, they are not executing at the same time, but switching between them.
 
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
It depends on how many execution ports are hooked up to them.
Some AVX instructions, like FMA requires both MUL and ADD at the same time, either on the same execution port or multiple ports fused.

I wouldn't mix threads (SMT) into this, they are not executing at the same time, but switching between them.
I thought SMT uses thread level paralelism to allow resources in the core to be used concurrently by two threads to increase utilisation? :s surely at some point, for example, an integer operation on thread 1 is being used at the same time as, say, an Float operation on thread 2? This is where my understanding really is quite low, getting into things like instruction level parallelism and such.

IDK but surely these threads can use the FP engine concurrently some time, if for example thread 1 needs only 128bit Add, and thread 2 needs 128bit mul?

edit sorry if im being dumb. i'm learning all the time^^

edit 2: fixed typo
 
Last edited:
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
I thought SMT uses thread level paralelism to allow resources in the core to be used concurrently by two cores to increase utilisation?
It's fine to ask. It depends on the implementation. To my understanding, Intel's implementation is mostly about utilizing idle cycles (caused by cache misses, flushes from branch mispredictions, etc.).

I forgot to answer this one:
also what exactly does BP do, and does it also rely on the cache?
I don't know the exact details of the algorithm, but I have a general understanding of how it works, based on reading documentation from Intel. The branch prediction basically keeps a list of recent conditionals and some kind of statistics of how often they are true or false. This list is not stored in L1, it has a separate specialized bank of memory for this.

Let's say the CPU iterates a loop, and on a specific address it runs into a conditional. Every time it meets the same conditional from the list, it uses the statistics to guess true or false, and every time it's done executing it feeds the updated statistics back.

It's worth mentioning that this list is not large, and probably just contains the most recent conditionals. So it's not like it contains everything in a program, and it's not not stored for later benefiting then next time you run the program, it's more like a "short-term memory" of the last few sections of code.
 
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
Cache size versus efficiency is an issue. That is why Intel had less L2 versus L3, for efficiency, recently they too opted for faster lower level caches.
There was a reference I cannot recall as of right now, but if you can allocate more data in a higher state cache, you save on fetches which end up making Ryzen quite more efficient with slower & larger-efficient SRAM cells.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,171 (2.80/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
It's fine to ask. It depends on the implementation. To my understanding, Intel's implementation is mostly about utilizing idle cycles (caused by cache misses, flushes from branch mispredictions, etc.).

I forgot to answer this one:

I don't know the exact details of the algorithm, but I have a general understanding of how it works, based on reading documentation from Intel. The branch prediction basically keeps a list of recent conditionals and some kind of statistics of how often they are true or false. This list is not stored in L1, it has a separate specialized bank of memory for this.

Let's say the CPU iterates a loop, and on a specific address it runs into a conditional. Every time it meets the same conditional from the list, it uses the statistics to guess true or false, and every time it's done executing it feeds the updated statistics back.

It's worth mentioning that this list is not large, and probably just contains the most recent conditionals. So it's not like it contains everything in a program, and it's not not stored for later benefiting then next time you run the program, it's more like a "short-term memory" of the last few sections of code.
Just to add a little bit of background on this, I want to answer the question that wasn't asked: why do we need branch prediction (not just how it works.) Branch prediction is important because otherwise the pipeline in a superscalar CPU would stall every time a conditional was encountered because you don't know what the next instruction will be if the conditional hasn't been evaluated yet. This is the same reason why speculative execution exists, but takes a different approach to keep the pipeline filled, by executing both code paths (as much as possible,) in order to mitigate the impact of a stall should the branch be predicted incorrectly. Both of these techniques are designed to prevent or mitigate pipeline stalls which are more costly the longer the pipeline is.

Edit: The thing about branch prediction also is that it's possible that the data that determines the condition may have already been calculated and that something in the pipeline isn't required to determine the branch, in this case, the computer can accurately say "I already know what the result of this is going to be even though the instruction hasn't executed yet." This is far harder to solve when the last instruction alters the data used for the condition.
 
Last edited:
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
Just to add a little bit of background on this, I want to answer the question that wasn't asked: why do we need branch prediction (not just how it works.) Branch prediction is important because otherwise the pipeline in a superscalar CPU would stall every time a conditional was encountered because you don't know what the next instruction will be if the conditional hasn't been evaluated yet. This is the same reason why speculative execution exists, but takes a different approach to keep the pipeline filled, by executing both code paths (as much as possible,) in order to mitigate the impact of a stall should the branch be predicted incorrectly. Both of these techniques are designed to prevent or mitigate pipeline stalls which are more costly the longer the pipeline is.

Edit: The thing about branch prediction also is that it's possible that the data that determines the condition may have already been calculated and that something in the pipeline isn't required to determine the branch, in this case, the computer can accurately say "I already know what the result of this is going to be even though the instruction hasn't executed yet." This is far harder to solve when the last instruction alters the data used for the condition.
What I've noticed is, the opposite is the case with gpus. That makes me believe the caches far outweigh the execution resource power budget in cpus. I still wish I understood gpu caching and pipeline scalarization in general.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Branch prediction is important because otherwise the pipeline in a superscalar CPU would stall every time a conditional was encountered because you don't know what the next instruction will be if the conditional hasn't been evaluated yet. This is the same reason why speculative execution exists, but takes a different approach to keep the pipeline filled, by executing both code paths (as much as possible,) in order to mitigate the impact of a stall should the branch be predicted incorrectly. Both of these techniques are designed to prevent or mitigate pipeline stalls which are more costly the longer the pipeline is.
You are mixing the terms a bit here. Both predictive execution and "eager execution" (executing both branches of a conditional) are types of speculative execution. Each strategy have their advantages and disadvantages. Most notably is the problem of executing both branches gives an exponential problem and doesn't scale well for multiple conditionals. Predictive execution works fairly well with repeated conditionals (with the same outcome) in a loop, but everywhere else it has a 50/50 chance of success per branching. It's worth noting that both AMD and Intel currently relies on predictive execution.
 
Joined
Jan 8, 2017
Messages
9,504 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
I thought SMT uses thread level paralelism to allow resources in the core to be used concurrently by two cores to increase utilisation?

Resource are used concurrently anyway because all modern CPU are superscalar and out of order, the deal with SMT is somewhat more complex. A core may have up to something like 3 ALUs and 2 FPUs for example, depending on which instructions are issued from 1 thread different execution units are in use and some remain idle. This happens because dependencies exist between instructions and simply because of the fact that for instance you may have a sequence of 1000 instructions and not a single one of them may need an floating point calculation.

Having more than one thread from which you can reorder instructions means more opportunities to use the executions units available hence the usefulness of SMT.

What I've noticed is, the opposite is the case with gpus. That makes me believe the caches far outweigh the execution resource power budget in cpus. I still wish I understood gpu caching and pipeline scalarization in general.

Modern GPU cores are based on SIMT architectures, Single Instruction Multiple Threads. This means you simply can't have complex control over which thread does what which doesn't, the way GPUs handle conditional instructions is that that both paths are executed and a mask is applied to filter out the threads that took the wrong branch. This is wasteful but that's why GPUs are designed to have up to 64 threads per core.

As far as caches are concerned their effectiveness isn't as relevant because most of the latencies caused by cache misses are hidden by other instructions that are already scheduled to be executed.
 
Last edited:
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
btw i used to write scripts for fallout 3 and new vegas. So i ask, is a"conditional":

Code:
scn myscript

short var1

begin onActivate
    if var1 == 1 ; is this the conditional check?
        ;do this operation if var1 returns a value of 1
    elseif var1 == 2
        ;do this operation if var1 returns a value of 2
    else
       ; do this if var1 is not those 2
   endif

end

btw it been YEARS since i did scritps for this engine so i may made a mistake. but i think thats how it goes. even then you get the idea XD

Actually, i think i would have done it this way:

Code:
scn myscript

short var1

begin onActivate
    if var1 == 0 ; is this the conditional check?
        enable ;turn my light on
        set var1 to 1
    endif
    if var == 1
        disable ;turn my light off
        set var1 to 0
    endif

end

rip. I could have used the getEnabled check. LOL, ok now i want to install F3 again and make a mod.........
 
Joined
Jan 8, 2017
Messages
9,504 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Yeah that would be an example of instruction branching.
 
Joined
Jun 10, 2014
Messages
2,995 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
If statements is a type of conditional, which is a logical branching in control flow. If statements is the most common type, but there are also others such as ternary operators, switch etc. "Conditional" is the generic term.
 
Top