AMD Could Solve Memory Bottlenecks of its MCM CPUs by Disintegrating the Northbridge

HTC · Nov 2, 2018

This post from anandtech forums needs to be here as It refers to several patents of AMD's supposed ZEN 2 chiplet design:

https://forums.anandtech.com/thread...cture-overview？.2554453/page-14#post-39633456

HD64G · Nov 2, 2018

HD64G said:
By separately I mean they have different packaging and layout maybe. And that's exactly the difference between the supposed new layout of the upcoming EPYC and a normal Ryzen if that stays the same in its layout aspect.

Now I get what you were asking. I don't propose different CCX at all. 8C/16T are the ones AMD is supposed to bring forward with the Zen2 arch after all. Layout and packaging is the difference I think they should have between EPYC, TR on the one side and desktop Ryzen CPUs on the other. And if AMD wanted to keep price for desktop low enough, they should keep desktop Ryzens to max 8C/16T which can be made using just 1 CCX and thus, not having the need of using IF at all. Latency wouldn't be a problem then. And the small cost of increased latency is decreased vs the existing one for the next gen EPYC and TR CPUs with the new idea about the IF changes the article refers to.

David Fallaha · Nov 2, 2018

First Strike said:
The only question now is whether the 32MB L3 cache per CCX chip will be present as this leak suggests. It is totally possible that L3 cache all get dumped to the center controller chip. 32MB cache in 7nm is really some cost to consider. And making 8 of them shared and coherent is hard AF. If this is the case (and they use it in MSDT), it's screwed.

Nah, SRAM scales really well with process, it's IMC that doesn't

Thus one of the reasons this is genius

Together with the fact that on desktop and laptop the second (of two) chiplets will be a GPU...

HD64G said:
Now I get what you were asking. I don't propose different CCX at all. 8C/16T are the ones AMD is supposed to bring forward with the Zen2 arch after all. Layout and packaging is the difference I think they should have between EPYC, TR on the one side and desktop Ryzen CPUs on the other. And if AMD wanted to keep price for desktop low enough, they should keep desktop Ryzens to max 8C/16T which can be made using just 1 CCX and thus, not having the need of using IF at all. Latency wouldn't be a problem then. And the small cost of increased latency is decreased vs the existing one for the next gen EPYC and TR CPUs with the new idea about the IF changes the article refers to.

for entry-level, mainstream and mobile they'll keep 8C/16T, make it one chiplet (no CCX), interface to the NB and combine those 2 with a GPU

XiGMAKiD said:
It's a solution that creates more problems

actually, see above, it creates all the solutions...chiplet will be 8 cores with lower internal latency

plus:
-latency to RAM will be even for all MCM solutions
-it's easy to combine with a GPU for mainstream and mobile

hat said:
Hrm... I don't think we've really had multi die chips since Core 2... and since then, the northbridge has moved off the board onto the chip. Still, creating a separate design for EPYC (or even some Threadripper chips) to work around that performance penalty kinda ruins the scalability of the Zen architecture, and may not perform all that well anyway... cause now you've got X amount of dies trying to communicate with the same northbridge, and thereby the rest of the system, at the same time...

it will be a single design for all Zen, consisting of CPU^8 or CPU+GPU, it's kinda brilliant really

these days virtually no app is optimised for more than 8 cores thus the chiplet unit will have 8 cores all with low-latency communications via a 32MB L3

Aomine_Law said:
Wouldnt it be much better to just make the memory controller modular? just thinking out loud.

Im just saying this because im not sure if more then one memory controller is beneficial at all when you have a multi cpu setup...

I know... its a bit out of the box but yeah

you're right...it will be a single memory controller on the Northbridge that feeds all the CPU chiplets -and that's the beauty of it, the same latency to all chiplets plus a massive L4 cache...

HD64G said:
Imho, this type of connectivity between CCXs is only meant for the next EPYC and Threadripper. And for this type of usage it is excellent and ingenious indeed. For Desktop Ryzens my opinion is that they will just improve the already existing connectivity. It is more than enough. And with 8C/16T CCX, most Ryzens will have just one CCX which means no added latency from the IF.

i'd wager the CCX is going to go all together and allow each chiplet to have 8 low-latency cores

after all the Northbridge will do most of the memory work

Steevo said:
Fabric solutions always create more problems than they solve once it becomes this complex, the ring bus approach may be simpler and offer more throughput and lower latentcy if they can get it wide or fast enough.

AMD brought most of this on themselves, technical issues with ZEN, bulldozer, and other designs and latency to cache and memory has never truly been solved for years and "add more cores" has always been the solution. They need to build a memory controller for a 8 core that can be expanded to these insane core and thread counts, where a little latency added to a server workload with custom aware of penalties software handling the threads can mask it.

they will have a ring-bus but only for each 8-core chiplet...makes perfect sense, solve the latency issue for what is the standard number of cores whilst keeping it standard

then scale it OR +GPU it depending on platform

completely solves the Threadripper 32 core problems...

WikiFM said:
There are 2 differents situations, first inter-core communications with cores in different dies will require a third die in between to communicate. Second, single threaded performance would be lower because the memory controller won't be on-die, that is why AMD implemented the new Dynamic Local Mode.

every chiplet from Ryzen to EPYC will be the same

8-cores, ringbus, no CCX

massive L3 cache to offset memory latency likely together with a even more massive L4 cache on the memory controller

as to Threadripper, Dynamic Local Mode goes out the window as the OS just sees 4 equally-balanced CPU NUMA domains

the end result will be similar to the IBM approach except with another 2 layers of cache hierarchy to hide latency...L3 for the chiplets plus an L4 for the Northbridge

WikiFM · Nov 2, 2018

David Fallaha said:
-latency to RAM will be even for all MCM solutions

these days virtually no app is optimised for more than 8 cores thus the chiplet unit will have 8 cores all with low-latency communications via a 32MB L3

you're right...it will be a single memory controller on the Northbridge that feeds all the CPU chiplets -and that's the beauty of it, the same latency to all chiplets plus a massive L4 cache...

they will have a ring-bus but only for each 8-core chiplet...makes perfect sense, solve the latency issue for what is the standard number of cores whilst keeping it standard

completely solves the Threadripper 32 core problems...

every chiplet from Ryzen to EPYC will be the same

8-cores, ringbus, no CCX

massive L3 cache to offset memory latency likely together with a even more massive L4 cache on the memory controller

as to Threadripper, Dynamic Local Mode goes out the window as the OS just sees 4 equally-balanced CPU NUMA domains

First, customers want lower latency, not even, let's say in TR2 some cores have 2x latency of others, you think people will be happy if Zen 2 brings 1.5x latency in all, but customers want 1x or lower in all.
Second, interchiplet communications latency is lower in a true 8 core but memory latency will be higher if memory controller is in another die compared to on-die. One step forward, one step back.
Third, that massive L4 cache will be expensive to manufacture, especially harming the price of CPUs with just 1 chiplet.
Fourth, the main problem with 24/32 TR is that 2 of the 4 dies don't have direct access to the memory controller, now imagine none of them.
Fifth, if MCM is a solution for EPYC, it is not for Ryzen where costs should remain low (no L4) and low threaded performance high, so it is better to have different designs.
Sixth, the OS will see X number of equally-slower domains in Zen 2 compared to TR2.

David Fallaha · Nov 2, 2018

WikiFM said:
First, customers want lower latency, not even, let's say in TR2 some cores have 2x latency of others, you think people will be happy if Zen 2 brings 1.5x latency in all, but customers want 1x or lower in all.
Second, interchiplet communications latency is lower in a true 8 core but memory latency will be higher if memory controller is in another die compared to on-die. One step forward, one step back.
Third, that massive L4 cache will be expensive to manufacture, especially harming the price of CPUs with just 1 chiplet.
Fourth, the main problem with 24/32 TR is that 2 of the 4 dies don't have direct access to the memory controller, now imagine none of them.
Fifth, if MCM is a solution for EPYC, it is not for Ryzen where costs should remain low (no L4) and low threaded performance high, so it is better to have different designs.
Sixth, the OS will see X number of equally-slower domains in Zen 2 compared to TR2.

ok, points taken, but here's a couple of key 'buts'

-latency to eDRAM will be very low and give you your x1 for instructions; data can be prefetched, bandwidth not latency is the issue there

-in fact a large eDRAM has been done before and it's not that expensive, it's IrisPro or the GameCube (1T SRAM); combine yields@14nm plus salvage for eg 64MB vs 256MB and what does it really cost?

-adding all that together why still have such a large L4 on mainstream Ryzen? as above, it's to pair up Zen Chiplet with a Vega GPU -but with proper coherency because it's IF not Intel Graphics

-need to keep it really low cost at the low end? then ok, tweak and rebrand the 2700x into the 3600

CheapMeat · Nov 3, 2018

I think part of the back & forth that is getting stuck on is the thought that they'll do the same for Ryzen products. They probably won't. I really doubt people like WikiFM are buying EPYC systems.

R0H1T · Nov 3, 2018

CheapMeat said:
I think part of the back & forth that is getting stuck on is the thought that they'll do the same for Ryzen products. They probably won't. I really doubt people like WikiFM are buying EPYC systems.

They won't, they'll make at least 2 dies IMO. Just like RR the second one will have an IGP, as for mainstream & notebooks it makes more sense though it's possible that they could end up making 3 dies.

David Fallaha · Nov 3, 2018

R0H1T said:
They won't, they'll make at least 2 dies IMO. Just like RR the second one will have an IGP, as for mainstream & notebooks it makes more sense though it's possible that they could end up making 3 dies.

i'm sure you're right at least in the short term (another 14nm APU) but one way or another businesses need a decent 6- and 8-core CPU with built in GFX, perhaps 7nm will leave enough die size for all this but AMD always seems to make GPU-heavy APUs, not a great business product

will be exciting to see where this goes..looking forward to the event on the 6th!

HTC · Nov 6, 2018

Here's another post that deserves to be here: also spotted this @ Anandtech Forums.

Some tidbits:

More info and pics on the post @ Anandtech forums linked above.

There's also this from another post in the same topic:

https://twitter.com/i/web/status/1059374629939081216

sergionography · Nov 7, 2018

Zubasa said:
Currently the Zen die compose of 2 CCX of Quad-Cores and both are connected to the SOC / NB via Infinity Fabric.
So to access the L3 Cache that is on another die it requires 3 hops, first from the CCX to the local SOC, second to SOC of the other die, then from the other SOC to the CCX with the L3.
On this new layout the number of hops is 2, first to the Central Hub, second to the other CCX where the L3 is located.

What this does though, is avoid the issues with the 2990WX / 2970WX where some cores needs 3 hops to the memory.
First from CCX to local SOC then to the SOC on the IO Die.
Also the 2-Die Threadripper connects to each other via 2 links of Infinity Fabric, and the 4-Die version only has 1 connection to each die, so half the bandwidth.
If each Zen 2 die also keeps its 2 IF links, it would always have as much if not double bandwidth to the memory, if AMD can keep the IF speed the same as Zen 1.
On Zen 2 each CCX is always 1 hop away from memory, meaning it will have consistent latency across all dies.

For gaming isn't it mostly the maximum latency that cause frame-time issues?
After all the 1% and 0.1% lows are measuring the max frame time between each frame, as the minimum frame time aka Max FPS isn't nearly as important.

Oh that makes sense now! Idk why but for some reason i was under the impression all L3 cache will remain on the central chip so all memory access to L3 will require 1 extended hop. But if i understood correctly, each chip will still retain local L3 cache but simply have a max of 2 hops for reaching out to the other die L3 when needed instead of 3?

Zubasa · Nov 8, 2018

sergionography said:
Oh that makes sense now! Idk why but for some reason i was under the impression all L3 cache will remain on the central chip so all memory access to L3 will require 1 extended hop. But if i understood correctly, each chip will still retain local L3 cache but simply have a max of 2 hops for reaching out to the other die L3 when needed instead of 3?

Yes.

AMD could potentially get around this as well.
There is some speculation based on the huge IO-Die that there might be a Large L4 cache.
If you have an L4 large enough to maintain a copy of all the Data in the 8x L3 caches.
Instead of getting the Data from another die's L3, just fetch it from the L4.

System Name	HTC's System
Processor	Ryzen 5 5800X3D
Motherboard	Asrock Taichi X370
Cooling	NH-C14, with the AM4 mounting kit
Memory	G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s)	Sapphire Pulse 6600 8 GB
Storage	1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s)	LG 27UD58
Case	Fractal Design Define R6 USB-C
Audio Device(s)	Onboard
Power Supply	Corsair TX 850M 80+ Gold
Mouse	Razer Deathadder Elite
Software	Ubuntu 20.04.6 LTS

Processor	AMD Ryzen 5 5600@80W
Motherboard	MSI B550 Tomahawk
Cooling	ZALMAN CNPS9X OPTIMA
Memory	2*8GB PATRIOT PVS416G400C9K@3733MT_C16
Video Card(s)	Sapphire Radeon RX 6750 XT Pulse 12GB
Storage	Sandisk SSD 128GB, Kingston A2000 NVMe 1TB, Samsung F1 1TB, WD Black 10TB
Display(s)	AOC 27G2U/BK IPS 144Hz
Case	SHARKOON M25-W 7.1 BLACK
Audio Device(s)	Realtek 7.1 onboard
Power Supply	Seasonic Core GC 500W
Mouse	Sharkoon SHARK Force Black
Keyboard	Trust GXT280
Software	Win 7 Ultimate 64bit/Win 10 pro 64bit/Manjaro Linux

System Name	Big Chief
Processor	Intel i7-980x@4.2Ghz
Motherboard	Gigabyte UD3R v2
Cooling	Noctua NH-U12P
Memory	24GB Kingston DDR3 1600Mhz CL10
Video Card(s)	ASUS Strix GTX970
Storage	3xSamsung Evo840 500GB RAID-0
Display(s)	3xDell 2407WFP
Power Supply	Corsair TX650

System Name	N/A
Processor	Intel Core i5 3570
Motherboard	Gigabyte B75
Cooling	Coolermaster Hyper TX3
Memory	12 GB DDR3 1600
Video Card(s)	MSI Gaming Z RTX 2060
Storage	SSD
Display(s)	Samsung 4K HDR 60 Hz TV
Case	Eagle Warrior Gaming
Audio Device(s)	N/A
Power Supply	Coolermaster Elite 460W
Mouse	Vorago KM500
Keyboard	Vorago KM500
Software	Windows 10
Benchmark Scores	N/A

System Name	Big Chief
Processor	Intel i7-980x@4.2Ghz
Motherboard	Gigabyte UD3R v2
Cooling	Noctua NH-U12P
Memory	24GB Kingston DDR3 1600Mhz CL10
Video Card(s)	ASUS Strix GTX970
Storage	3xSamsung Evo840 500GB RAID-0
Display(s)	3xDell 2407WFP
Power Supply	Corsair TX650

Processor	Core i7-12700k
Motherboard	Z690 Aero G D4
Cooling	Custom loop water, 3x 420 Rad
Video Card(s)	RX 7900 XTX Phantom Gaming
Storage	Plextor M10P 2TB
Display(s)	InnoCN 27M2V
Case	Thermaltake Level 20 XT
Audio Device(s)	Soundblaster AE-5 Plus
Power Supply	FSP Aurum PT 1200W
Software	Windows 11 Pro 64-bit