Sunday, November 25th 2018
AMD Doubles L3 Cache Per CCX with Zen 2 "Rome"
A SiSoft SANDRA results database entry for a 2P AMD "Rome" EPYC machine sheds light on the lower cache hierarchy. Each 64-core EPYC "Rome" processor is made up of eight 7 nm 8-core "Zen 2" CPU chiplets, which converge at a 14 nm I/O controller die, which handles memory and PCIe connectivity of the processor. The result mentions cache hierarchy, with 512 KB dedicated L2 cache per core, and "16 x 16 MB L3." Like CPU-Z, SANDRA has the ability to see L3 cache by arrangement. For the Ryzen 7 2700X, it reads the L3 cache as "2 x 8 MB L3," corresponding to the per-CCX L3 cache amount of 8 MB.
For each 64-core "Rome" processor, there are a total of 8 chiplets. With SANDRA detecting "16 x 16 MB L3" for 64-core "Rome," it becomes highly likely that each of the 8-core chiplets features two 16 MB L3 cache slices, and that its 8 cores are split into two quad-core CCX units with 16 MB L3 cache, each. This doubling in L3 cache per CCX could help the processors cushion data transfers between the chiplet and the I/O die better. This becomes particularly important since the I/O die controls memory with its monolithic 8-channel DDR4 memory controller.
Source:
SiSoft SANDRA Database
For each 64-core "Rome" processor, there are a total of 8 chiplets. With SANDRA detecting "16 x 16 MB L3" for 64-core "Rome," it becomes highly likely that each of the 8-core chiplets features two 16 MB L3 cache slices, and that its 8 cores are split into two quad-core CCX units with 16 MB L3 cache, each. This doubling in L3 cache per CCX could help the processors cushion data transfers between the chiplet and the I/O die better. This becomes particularly important since the I/O die controls memory with its monolithic 8-channel DDR4 memory controller.
24 Comments on AMD Doubles L3 Cache Per CCX with Zen 2 "Rome"
A single-die contains 2 Quad-Core CCX, like Zen1 and Zen1+ before.
Those are showing 2x 8MB L3-Cache in SiSoft Sandra, today. (1700, 1700X, 1800X, 2700, 2700X)
This leads to Zen2 having 2x 16MB L3-Cache per die/chiplet,
SiSoft shows the 2P hint before the CPU-Name, but thereafter the Single-CPU-Stats,
resulting in 2P, each 64 Cores and 16x 16MB (256MB L3) (2x 16MB per 8-Core-Chiplet)
the one I am referring to in the techspot is going to be like having 4x 8700k's in our PC's as normal usage...
This is getting increasingly interesting, in the aspect of cache hierarchy. 256MB L3 = definitely no L4 as LLC on the IO chip, because IO chip definitely is not large enough to cram in 512MB L4. So how will they arrange and manage these L3 cache?
In fact I also remember prevalent rumors that AMD has completely done away with their current NUMA design, and yet this new architecture is supposed to gain IPC while arguably spreading out the resources more than before. This puzzled me until now.
Doubling the L3 cache might allow them to uniformly design all of their product lines (AM4, TR, EPYC) in a manner that effectively works in the same way (as opposed to now where AM4 uses only one die, but TR and EPYC use multiple dies). The exciting prospect of this is that no longer would their be ANY need to "localize" memory for certain games that only use 4 cores, and Threadripper would have the same gaming IPC as AM4 chips. It would just work.
But i think the OS should get correct reports about the segmentation etc., so its unlikely but in deed possible.
So there is little chance that every Chiplet is One big 8-Core CCX with 32MB L3-Cache.
@btarunr "you are very welcome"
How the R/W-buffering is maintained on the I/O-die will be interesting to see, i don´t thing there will be any big compromise.
Maybe this whole layout is even better by default to use something like the infinity-fabric, maybe Zen1 was only to look if it´s even possible.
Anandtech measured the powerconsumption-ratio cores vs. fabric, intel mesh vs. Zen.
In conclusion, the next battle in servers is not the efficiency of the better core, it´s the fabric that counts.
The charts showed mesh is better at low load, but gets beaten by the Epyc at higher load.
Now remember the whole possibilities of the Zen2 layout ? Chiplet power-gating anyone, I/O-die segment power-gating. Thats only possible if you have no MC or IO etc on the chiplets.
----------------------------------------------------------------
I/O-die will be something like this,
without GPU, Multimedia and Display
This thing will be big, smart and very fast
Remember the 2700X has a base clock of 3.7Ghz. So he thinks 7nm will allow AMD to just slap 700Mhz on top of that!
4.4ghz base is about 20% higher than ryzen 7 2700x base clock(3.7g. Which is not impossible from a higher level especially when we already know that zen+ can clock up tp 4.3-4.4
The challenge is to scale max clocks. Also 14nm glofo/samsung had decent density but didn't scale well on higher voltage. So here u r going from a 14nm process that has its efficiency sweet spot in lower voltages, to a 7nm high performance pro3.7ghz)