- Joined
- Jun 10, 2014
- Messages
- 3,016 (0.78/day)
Processor | AMD Ryzen 9 5900X ||| Intel Core i7-3930K |
---|---|
Motherboard | ASUS ProArt B550-CREATOR ||| Asus P9X79 WS |
Cooling | Noctua NH-U14S ||| Be Quiet Pure Rock |
Memory | Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz |
Video Card(s) | MSI GTX 1060 3GB ||| MSI GTX 680 4GB |
Storage | Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB |
Display(s) | Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24" |
Case | Fractal Design Define 7 XL x 2 |
Audio Device(s) | Cambridge Audio DacMagic Plus |
Power Supply | Seasonic Focus PX-850 x 2 |
Mouse | Razer Abyssus |
Keyboard | CM Storm QuickFire XT |
Software | Ubuntu |
Sure, but let me clarify why I believe an inclusive L3 cache isn't beneficial to even some real world multithreaded workloads;2. It's true and I can't deny it The trend is towards Epyc / Xeon and there is important scaling and the highest possible performance of a single core thanks to the large L2. The relationship between the threads goes to the background, which is confirmed by leaks about the Alderlake system, i.e. GoldenCove. GoldenCove has the same cache type and capacity as WillowCove, ie L2 1.25MB and L3 3MB non-inclusive. You can see that Intel will no longer develop two separate microarchitecture in the era of more and more cores.
When a cache line is prefetched into L2, an inclusive cache would immediately duplicate this into L3. Then when the cache line is evicted from L2, it would normally be copied to L3, but instead it's already there. The only advantage of having a copy in L3 is in case another core needs it within that very short window before it's moved there anyway, I believe this window isn't many thousands of clock cycles, way too short for any software to be "optimized" for it. The disadvantages are numerous; more wasted die space, increased complexity to maintain integrity with higher core counts, L3 waste increases if L2 increases in size, etc.
I would argue there are other ways to make the cache more efficient, if the concern is sharing cache between cores. As you know, L1 is split into two separate caches; instruction and data. If L2 were split similarly, the design could take advantage of the different usage patterns. Data cache lines have often higher throughput and is rarely shared between threads, while instruction cache lines are often used repeatedly and by many cores within a short time. If L3 were split, L3I could be tightly interconnected, and L3D could be larger and more off-die. That's just an idea, but there might be design considerations I'm not aware of.