Then there's the issue of latency which is already not very good with CCX design. We'll see though, it would make some amount of sense as caches are very costly in terms of die area and don't benefit much from node shrinks, and I suppose having the CCX's all use a single unified L3 could be considered an architectural advancement.
Well, cache is one of the things that benefits
the most from die shrinks. Cache is on the least thermally intensive end of the scale, which means it can be packed tighter, compared to FPUs, ALUs and register files which are on the opposite end of the scale. When it comes to packing cache tight, it comes more down to placement in terms of latency and in relation to the other parts of the design that needs to interact with it. So cache has traditionally been challenging to place due to its size and the increasing core complexity, but shrinks should generally help this.
I'm not sure making a unified L3 cache for several chiplets is a good idea in general, but not primarily because of latency as
Darmok N Jalad mentioned, but because of the way L3 works. As you said, L3 is a victim cache, and in most designs it's an inclusive cache. There is a reason why Skylake-X changed this, because it's very inefficient use of die space, and it also means that increasing L2 will also decrease the efficiency of L3. As you probably know, modern CPUs typically split L1 cache into instruction and data caches, while L2 and L3 is both. And while L3 cache is shared between multiple cores, the actual sharing is commonly very minimal. The entire cache is overwritten every few microseconds, so the chance of two cores needing data from the same cache line is very minimal, because when you have multiple threads working, they have to use separate data, otherwise they would stall all the time. So the only thing that is generally shared between cores is instructions, if the cores are executing the same part of the code of course. And the few times the times the L3 victim cache is useful for data, it's usually from the same core that evicted it. So to sum up, L3 is largely wasteful in its current application, and only gives minor benefits.
I think it's time to re-evaluate L3 cache's role, and the changes Intel did in Skylake-X is probably just the beginning. Perhaps a split L3 cache, or instructions only L3 cache? Perhaps L3 shouldn't be shared and be data only, but L4 be instructions only and shared?