The only question now is whether the 32MB L3 cache per CCX chip will be present as this leak suggests. It is totally possible that L3 cache all get dumped to the center controller chip. 32MB cache in 7nm is really some cost to consider. And making 8 of them shared and coherent is hard AF. If this is the case (and they use it in MSDT), it's screwed.
Nah, SRAM scales really well with process, it's IMC that doesn't
Thus one of the reasons this is genius
Together with the fact that on desktop and laptop the second (of two) chiplets will be a GPU...
Now I get what you were asking. I don't propose different CCX at all. 8C/16T are the ones AMD is supposed to bring forward with the Zen2 arch after all. Layout and packaging is the difference I think they should have between EPYC, TR on the one side and desktop Ryzen CPUs on the other. And if AMD wanted to keep price for desktop low enough, they should keep desktop Ryzens to max 8C/16T which can be made using just 1 CCX and thus, not having the need of using IF at all. Latency wouldn't be a problem then. And the small cost of increased latency is decreased vs the existing one for the next gen EPYC and TR CPUs with the new idea about the IF changes the article refers to.
for entry-level, mainstream and mobile they'll keep 8C/16T, make it one chiplet (no CCX), interface to the NB and combine those 2 with a GPU
It's a solution that creates more problems
actually, see above, it creates all the solutions...chiplet will be 8 cores with lower internal latency
plus:
-latency to RAM will be even for all MCM solutions
-it's easy to combine with a GPU for mainstream and mobile
Hrm... I don't think we've really had multi die chips since Core 2... and since then, the northbridge has moved off the board onto the chip. Still, creating a separate design for EPYC (or even some Threadripper chips) to work around that performance penalty kinda ruins the scalability of the Zen architecture, and may not perform all that well anyway... cause now you've got X amount of dies trying to communicate with the same northbridge, and thereby the rest of the system, at the same time...
it will be a single design for all Zen, consisting of CPU^8 or CPU+GPU, it's kinda brilliant really
these days virtually no app is optimised for more than 8 cores thus the chiplet unit will have 8 cores all with low-latency communications via a 32MB L3
Wouldnt it be much better to just make the memory controller modular? just thinking out loud.
Im just saying this because im not sure if more then one memory controller is beneficial at all when you have a multi cpu setup...
I know... its a bit out of the box but yeah
you're right...it will be a single memory controller on the Northbridge that feeds all the CPU chiplets -and that's the beauty of it, the same latency to all chiplets plus a massive L4 cache...
Imho, this type of connectivity between CCXs is only meant for the next EPYC and Threadripper. And for this type of usage it is excellent and ingenious indeed. For Desktop Ryzens my opinion is that they will just improve the already existing connectivity. It is more than enough. And with 8C/16T CCX, most Ryzens will have just one CCX which means no added latency from the IF.
i'd wager the CCX is going to go all together and allow each chiplet to have 8 low-latency cores
after all the Northbridge will do most of the memory work
Fabric solutions always create more problems than they solve once it becomes this complex, the ring bus approach may be simpler and offer more throughput and lower latentcy if they can get it wide or fast enough.
AMD brought most of this on themselves, technical issues with ZEN, bulldozer, and other designs and latency to cache and memory has never truly been solved for years and "add more cores" has always been the solution. They need to build a memory controller for a 8 core that can be expanded to these insane core and thread counts, where a little latency added to a server workload with custom aware of penalties software handling the threads can mask it.
they will have a ring-bus but only for each 8-core chiplet...makes perfect sense, solve the latency issue for what is the standard number of cores whilst keeping it standard
then scale it OR +GPU it depending on platform
completely solves the Threadripper 32 core problems...
There are 2 differents situations, first inter-core communications with cores in different dies will require a third die in between to communicate. Second, single threaded performance would be lower because the memory controller won't be on-die, that is why AMD implemented the new Dynamic Local Mode.
every chiplet from Ryzen to EPYC will be the same
8-cores, ringbus, no CCX
massive L3 cache to offset memory latency likely together with a even more massive L4 cache on the memory controller
as to Threadripper, Dynamic Local Mode goes out the window as the OS just sees 4 equally-balanced CPU NUMA domains
the end result will be similar to the IBM approach except with another 2 layers of cache hierarchy to hide latency...L3 for the chiplets plus an L4 for the Northbridge