Firstly I'm sure you know that L1 is split into two separate caches; L1I(instructions) and L1D(data). This is not just because they're fed into different parts of the pipeline, it's also to accommodate for different access patterns (it would be stupid if the pipeline constantly stalled because the data cache lines constantly flushed out the instruction cache lines or vice versa).
So while you could argue that anthing is actuall
data, in this context we must be a bit more specific to make sense of it.
Also, you'll see me use the term
cache line, which is a set of 64 bytes. If you access any bytes within a cache line, the whole line is cached. If your data or instructions are more spread out, it will take more cache lines to store "the same" amount of information.
For L2 and L3, instructions and data cache lines are
competing for the same cache. Let's say you have a heavy math workload and an algorithm churning through several gigabytes of data fairly sequentially. Then it should be obvious that the size of the data is much greater than the instructions, and that you access the same instruction cache lines many times throughout the workload, while the data cache lines are accessed only a few times. Okay so far?
And here comes the crucial point to understand; while both data and instruction cache lines can have cache misses, the total cost of a instruction cache miss can be several times worse (I'll explain why in a moment 1*), but firstly you need to understand that if the code is more bloated, it will not only take up more cache space, but each cache line is
more likely to be "outcompeted" by data cache lines, which is why I keep saying that sensitivity to L3 is usually a sign of bloated code. So if you manage to reduce the size of the performance critical part of your algorithm, the code will take fewer cache lines but have more accesses per cache line increasing the chance of it "surviving" longer in cache. Optimization of both instructions and data is important for performance scaling on modern hardware, and it's only getting more important(2*). Both types of optimization go hand-in-hand, and some techniques actually does both. (Link #1 is great if you want to know more, but requires a good understanding of programming.)
1) The reason why instruction cache misses can be very severe is that the prefetcher looks at the upcoming instructions to determine which data and instruction cache lines to fetch next, but if a misprediction causes it to not have the right upcoming instructions in time, then you'll get a stall. And in theory you can have a chain of cache misses if it needs a little code to execute and find out where to fetch next, and then finally which data to fetch. Or sometimes instructions->data->more instructions->more data . So a tiny bit of code can cause 2-4 cache misses, each adding up ~400-500 clock cycles, it can severely affect overall performance. The main culprits of this in "modern" code is abstractions and branching.
Let's say your code is iterating a long list of "objects" and executing a function
process() on each, but you have a huge hierarchy of inheritance, where it has for each object to fetch a huge amount of code just to figure out where the
process() function of each object is (and similar bloat for child functions and objects). If this is performance critical code, then pretty much any reduction in this complexity will unlock magnitudes of more performance, as it makes the program much less sensitive to cache (especially L3), so if you grasp this then you understand the essence of memory optimization.
2) While CPUs are getting better front-ends and more efficient caches over time which makes them better at "bad code" than ever, the performance you
could have is even greater. I want to make
this point clear, as even many programmers think optimizing doesn't matter any more, when it on the contrary matters more than ever. It's also a requirement to get any gains from SIMD(like AVX), and also a great benefit for multithreading. Cache optimizations generally unlocks major performance gains, since it eliminates (a lot of) stalls of the CPU pipeline, in extreme cases we can see 10-50x performance gains, so we're not talking about a .01% difference here. And even if you have a large poorly written program, you might be tempted to split an algorithm across e.g. 4 threads and perhaps gain 40% performance, but why start there when optimizing the code can yield 10x? And as a bonus, the removal of complexity makes the addition of multithreading so much easier afterwards.
A few years ago I was optimizing some simple math code for rendering engines (just basic matrix, vector an quaternion stuff, in C), and I benchmarked various implementations my functions, and even though my baseline implementations was very clean, tiny tweaks and even sometimes a keyword in the function header was enough to gain an extra 10% or even 50%, especially on newer CPUs (I tested Sandy Bridge, Haswell, Skylake and Zen 3). This goes to show that if the optimization enables the compiler to reduce the code even further, newer CPUs are able to scale much further when the code is more optimized. I'm looking forward to be able to test this on newer CPUs. So by know you'll probably understand that I'm not against cache (like some of the immature responses on page 1). And even though we often see people complaining about a new generation only yielding ~5-20% more IPC, the theoretical capabilities are much greater than that. Don't blame Intel and AMD, blame bad programmers.
Now finally getting back to your core question;
If a rendering loop gains performance from extra L3, then it's mostly due to instruction cache lines not data cache lines, as the code is repeating much more than the data. If the code is then optimized, then it will be less likely to be evicted from cache, and the resulting performance will increase
but the sensitivity to cache is reduced. Most ignorant people would think this is a bad thing, assuming that when software shows gains thanks to any hardware "improvement" it means the software is then somehow
better.
To take things further, you should take a look at link #2 which is specifically about game engine design, and how a difference in approach yields completely different performance. (Link #1 is more about caches in general and requires extensive programming knowledge to grasp, but link #2 is probably understandable to anyone with decent understanding of tech.)
It talks a bit about how modern programmers typically through "world modelling" introduces vast complexity to their code, thinking the code structure should mimic the physical objects in real life, and that all code associated with those "objects" should belong together. (E.g. creating a "bush" object with functions for both rendering and simulation grouped together, and a huge hierarchy of such objects, e.g. generic object->terrain object->foliage->bush->bush#54) When a game engine typically has two main data flows; game loop (simulation) and render loop. And strictly speaking, there is only a tiny bit of information shared between these; a snapshot of the game in time. As you probably can imagine the state that each "object" uses during the game loop is quite different than the state required for each "object" during rendering. Not to mention that game loop and rendering loop are async, so some poor game engine designs which use a shared state have glitches in rendering (game loop moving objects in the middle of rendering), despite having "mitigations" for this. The proper way is obviously to separate these and share a snapshot to eliminate this type of glitching, and additionally, this enables the render loop to traverse the data differently, possibly grouping together objects that may be different on the game loop side, but from the renderer's perspective it's only another mesh with a texture. With modern Vulkan and DirectX 12 (and even recent OpenGL features) which enables to pass state information with very little effort, so it's actually puzzling to me that some game engines still do this so inefficiently (I'm fullt aware it's because of bloated generic code), when we should by know see more game engines basically not be CPU bottlenecked by cache in this way. We should of course expect over time that games become more demanding, but it should scale with computational performance not cache.
Further information:
#1
Scott Meyers: Cpu Caches and Why You Care
#2
Mike Action: Data-Oriented Design and C++