Monday, August 14th 2023
Intel Arrow Lake-S to Feature 3 MB of L2 Cache per Performance Core
Intel's next-generation designs are nearing launch, and we are already getting information about the upcoming generations. Today, we have the information that Intel's Arrow Lake-S desktop/client implementations of the Arrow Lake family will feature as much as 3 MB of level two (L2) cache for each performance core. Currently, Intel's latest 13th-generation Raptor Lake and 14th-generation Raptor Lake Refresh feature 2 MB of L2 cache per performance core. However, the 15th generation Arrow Lake, scheduled for launch in 2024, will bump that up by 50% and reach 3 MB. Given that P-cores are getting a boost in capacity, we expect E-cores to do so as well, but at a smaller size.
Arrow Lake will utilize Lion Cove P-core microarchitecture, while the E-core design will be based on Skymont. Intel plans to use a 20A node for this CPU, and more details will be presented next year.
Source:
via VideoCardz
Arrow Lake will utilize Lion Cove P-core microarchitecture, while the E-core design will be based on Skymont. Intel plans to use a 20A node for this CPU, and more details will be presented next year.
36 Comments on Intel Arrow Lake-S to Feature 3 MB of L2 Cache per Performance Core
Caches are usually organized in banks, which increases bandwidth substantially and offsets latency, but decreases overall cache efficiency (per cache line).
New node improvements may also lead to latency decreases.
And so on. Both AMD and Intel have increased and decreased their L1D/L1I and L2 caches over various generations, it all depends on the cache design and priorities of the architecture. Comparing a cache across CPU architectures solely based on size is nearly pointless. And as I often say, performance is what ultimately matters.
Pretty much all current CPU architectures caches memory of the same region size, it's called a "cache line", and currently it's 64 bytes with most x86 and ARM architectures.
I do expect them to move to 128 bytes eventually, as this would greatly benefit dense data accesses (which is where you have good hit rates anyways), and implementing e.g. 3 MB of L2(128b cache lines) would not cost anywhere near 50% more than 2 MB L2 (64b cache lines) in die space, and have approx. the same latency, so a huge win in hit rates. This will also allow for 1024-bit SIMD, which is probably coming "soon".
As you might already know, L3 is a spillover cache, which means it only contains discarded cache lines from L2. L3 is also accessible across cores, which is why it has some effect on multithreaded workloads. There is a tremendous amount of data flowing constantly through the caches, including lots of prefetched data which was ultimately unnecessary. In terms of cache lines, the largest volume is data, while a smaller volume is instructions, but the chances of a single cache line being needed before it is evicted from L3 is much higher for instructions, especially from other cores. (The chances of another core needing lots of the same data within nanoseconds is slim, except for explicit synchronization.) This is why CPUs need so large L3 caches before it starts to matter, in most cases where we see sensitivity to L3, it's due to instruction cache lines being shared, not data. But we usually don't see significant gains from huge L3 caches in most computationally intense tasks, even though they churn though large amounts of data. This is due to the application being cache optimized, which is one of the most important types of low-level optimization. As any low-level programmer can tell you, sensitivity to L3 usually means the code is too large, bloated and unpredictable, which is why the CPU evicts it from cache.
Even though huge L3s make appreciable in some games and select applications, I don't believe it's a good direction to go for CPU development. It costs a tremendous amount of die space, and don't yield any meaningful significance for most heavy workloads. This die space and development effort could be spent on much more useful improvements, which would benefit most workloads. But I guess this is what we get when people are more focused on synthetic benchmark than real world results. Just think about it; slapping a whole extra cache die on the CPU makes less of a difference than a minor architectural upgrade (~10% IPC gains). That's a crude brute-force approach to extract very little overall. And this is why I'm not for L4, the usefulness of L4 would be even less, especially with a larger L3. But I do believe there is one way L3 could become more cost-effective though, splitting instructions and data. Then a much smaller L3 pool could have the same effect as 100 MB or so, at a small cost.
I'm much more excited about real architectural improvements, such as much wider execution. The difference between well written and poorly written software will only become more clear over time, as well written software will continue to scale. PCIe lanes?