• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

Joined
May 10, 2023
Messages
304 (0.52/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
The prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger
Yeah, and this behavior is exactly because the L3 in most current designs is set as a victim cache.
Your idea would not hold for cache inclusive designs, like older Intel gens, since prefetching is way easier, and allows it to feed both the L2 or L3.

This now leads me to wonder how well an inclusive design for such modern CPUs with L3 caches that are multiple times bigger than the L2 would fare. At least latency could be really improved, and the wasted space due to duplicated data becomes kinda irrelevant given the total amount of cache available.
Not sure how well it'd fare in terms of side-channel attacks tho.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Keeping data resident in cache longer only helps hit rates assuming latency remains constant.
As long as the latency is lower than memory, it's fine. Every tiny bit helps to save wasted clock cycles on a cache miss.

I'm sure there is a tipping point where if you have a relatively tight loop where if you have too much data in cache that you'll be evicting things before the loop starts from the beginning again where a little more cache might get over that hurdle. An example of this might be a tight rendering loop in a game.
In a sense, you're on to something, but let me first make sure you understand the premise. (Then I'll go back and address your question about a rendering loop.)

Firstly I'm sure you know that L1 is split into two separate caches; L1I(instructions) and L1D(data). This is not just because they're fed into different parts of the pipeline, it's also to accommodate for different access patterns (it would be stupid if the pipeline constantly stalled because the data cache lines constantly flushed out the instruction cache lines or vice versa).
So while you could argue that anthing is actuall data, in this context we must be a bit more specific to make sense of it.

Also, you'll see me use the term cache line, which is a set of 64 bytes. If you access any bytes within a cache line, the whole line is cached. If your data or instructions are more spread out, it will take more cache lines to store "the same" amount of information.

For L2 and L3, instructions and data cache lines are competing for the same cache. Let's say you have a heavy math workload and an algorithm churning through several gigabytes of data fairly sequentially. Then it should be obvious that the size of the data is much greater than the instructions, and that you access the same instruction cache lines many times throughout the workload, while the data cache lines are accessed only a few times. Okay so far?

And here comes the crucial point to understand; while both data and instruction cache lines can have cache misses, the total cost of a instruction cache miss can be several times worse (I'll explain why in a moment 1*), but firstly you need to understand that if the code is more bloated, it will not only take up more cache space, but each cache line is more likely to be "outcompeted" by data cache lines, which is why I keep saying that sensitivity to L3 is usually a sign of bloated code. So if you manage to reduce the size of the performance critical part of your algorithm, the code will take fewer cache lines but have more accesses per cache line increasing the chance of it "surviving" longer in cache. Optimization of both instructions and data is important for performance scaling on modern hardware, and it's only getting more important(2*). Both types of optimization go hand-in-hand, and some techniques actually does both. (Link #1 is great if you want to know more, but requires a good understanding of programming.)

1) The reason why instruction cache misses can be very severe is that the prefetcher looks at the upcoming instructions to determine which data and instruction cache lines to fetch next, but if a misprediction causes it to not have the right upcoming instructions in time, then you'll get a stall. And in theory you can have a chain of cache misses if it needs a little code to execute and find out where to fetch next, and then finally which data to fetch. Or sometimes instructions->data->more instructions->more data . So a tiny bit of code can cause 2-4 cache misses, each adding up ~400-500 clock cycles, it can severely affect overall performance. The main culprits of this in "modern" code is abstractions and branching.
Let's say your code is iterating a long list of "objects" and executing a function process() on each, but you have a huge hierarchy of inheritance, where it has for each object to fetch a huge amount of code just to figure out where the process() function of each object is (and similar bloat for child functions and objects). If this is performance critical code, then pretty much any reduction in this complexity will unlock magnitudes of more performance, as it makes the program much less sensitive to cache (especially L3), so if you grasp this then you understand the essence of memory optimization.

2) While CPUs are getting better front-ends and more efficient caches over time which makes them better at "bad code" than ever, the performance you could have is even greater. I want to make this point clear, as even many programmers think optimizing doesn't matter any more, when it on the contrary matters more than ever. It's also a requirement to get any gains from SIMD(like AVX), and also a great benefit for multithreading. Cache optimizations generally unlocks major performance gains, since it eliminates (a lot of) stalls of the CPU pipeline, in extreme cases we can see 10-50x performance gains, so we're not talking about a .01% difference here. And even if you have a large poorly written program, you might be tempted to split an algorithm across e.g. 4 threads and perhaps gain 40% performance, but why start there when optimizing the code can yield 10x? And as a bonus, the removal of complexity makes the addition of multithreading so much easier afterwards.
A few years ago I was optimizing some simple math code for rendering engines (just basic matrix, vector an quaternion stuff, in C), and I benchmarked various implementations my functions, and even though my baseline implementations was very clean, tiny tweaks and even sometimes a keyword in the function header was enough to gain an extra 10% or even 50%, especially on newer CPUs (I tested Sandy Bridge, Haswell, Skylake and Zen 3). This goes to show that if the optimization enables the compiler to reduce the code even further, newer CPUs are able to scale much further when the code is more optimized. I'm looking forward to be able to test this on newer CPUs. So by know you'll probably understand that I'm not against cache (like some of the immature responses on page 1). And even though we often see people complaining about a new generation only yielding ~5-20% more IPC, the theoretical capabilities are much greater than that. Don't blame Intel and AMD, blame bad programmers.

Now finally getting back to your core question;
If a rendering loop gains performance from extra L3, then it's mostly due to instruction cache lines not data cache lines, as the code is repeating much more than the data. If the code is then optimized, then it will be less likely to be evicted from cache, and the resulting performance will increase but the sensitivity to cache is reduced. Most ignorant people would think this is a bad thing, assuming that when software shows gains thanks to any hardware "improvement" it means the software is then somehow better.

To take things further, you should take a look at link #2 which is specifically about game engine design, and how a difference in approach yields completely different performance. (Link #1 is more about caches in general and requires extensive programming knowledge to grasp, but link #2 is probably understandable to anyone with decent understanding of tech.)
It talks a bit about how modern programmers typically through "world modelling" introduces vast complexity to their code, thinking the code structure should mimic the physical objects in real life, and that all code associated with those "objects" should belong together. (E.g. creating a "bush" object with functions for both rendering and simulation grouped together, and a huge hierarchy of such objects, e.g. generic object->terrain object->foliage->bush->bush#54) When a game engine typically has two main data flows; game loop (simulation) and render loop. And strictly speaking, there is only a tiny bit of information shared between these; a snapshot of the game in time. As you probably can imagine the state that each "object" uses during the game loop is quite different than the state required for each "object" during rendering. Not to mention that game loop and rendering loop are async, so some poor game engine designs which use a shared state have glitches in rendering (game loop moving objects in the middle of rendering), despite having "mitigations" for this. The proper way is obviously to separate these and share a snapshot to eliminate this type of glitching, and additionally, this enables the render loop to traverse the data differently, possibly grouping together objects that may be different on the game loop side, but from the renderer's perspective it's only another mesh with a texture. With modern Vulkan and DirectX 12 (and even recent OpenGL features) which enables to pass state information with very little effort, so it's actually puzzling to me that some game engines still do this so inefficiently (I'm fullt aware it's because of bloated generic code), when we should by know see more game engines basically not be CPU bottlenecked by cache in this way. We should of course expect over time that games become more demanding, but it should scale with computational performance not cache. :)

Further information:
#1 Scott Meyers: Cpu Caches and Why You Care
#2 Mike Action: Data-Oriented Design and C++
Thanks for asking good questions and contributing to an interesting discussion.
 
Top