Tuesday, September 7th 2021
"Zen 3" Chiplet Uses a Ringbus, AMD May Need to Transition to Mesh for Core-Count Growth
AMD's "Zen 3" CCD, or compute complex die, the physical building-block of both its client- and enterprise processors, possibly has a core count limitation owing to the way the various on-die bandwidth-heavy components are interconnected, says an AnandTech report. This cites what is possibly the first insights AMD provided on the CCD's switching fabric, which confirms the presence of a Ring Bus topology. More specifically, the "Zen 3" CCD uses a bi-directional Ring Bus to connect the eight CPU cores with the 32 MB of shared L3 cache, and other key components of the CCD, such as the IFOP interface that lets the CCD talk to the I/O die (IOD).
Imagine a literal bus driving around a city block, picking up and dropping off people between four buildings. The "bus" here resembles a strobe, the buildings resemble components (cores, uncore, etc.,) while the the bus-stops are ring-stops. Each component has its ring-stops. To disable components (eg: in product-stack segmentation), SKU designers simply disable ring-stops, making the component inaccessible. A bi-directional Ring Bus would see two "vehicles" driving in opposite directions around the city block. The Ring Bus topology comes with limitations of scale, mainly resulting from the latency added from too many ring-stops. This is precisely why coaxial ring-topology faded out in networking.Intel realized in the early 2010s that it could not scale up CPU core counts on its monolithic processor dies beyond a point using Ring Bus, and had to innovate the Mesh Topology. The Mesh is a more advanced ringbus but with additional points of connectivity between components, making halfway between a Ring Bus and full-interconnectivity (in which each component is directly interconnected with the other, an impractical solution at scale). AMD's recipe for extreme core-count processors, such as the 64-core EPYC, is in using 8-core CCDs (each with an internal bi-directional Ring Bus), that are networked at the sIOD.
It's interesting to note here, that AMD didn't always use a Ring Bus on its CCDs. Older "Zen 2" chiplets with 4-core CCX (CPU complex) used full interconnectivity between four components (i.e. four CPU cores and their slices of the shared L3 cache). This was illustrated more looking at the slide, where AMD mentioned "same latency" for a core to access every other L3 slice (which wouldn't quite be possible even with a bi-directional Ring Bus). This begins to explain AMD's rationale behind the 4-core CCX. Eventually the performance benefit of a monolithic 8-core CCX interconnected with a bi-directional Ring Bus won out, so AMD went with this approach for "Zen 3."
For the future, AMD might need to let go of Ring Bus to scale beyond a certain number of CPU cores per CCD, AnandTech postulates. This is for the same reason Intel ditched Ring Bus for high core-count processors—latency. The CCD of the future could be made up of three distinct dies stacked up: the topmost die could be made up of cache, the middle die of the CPU cores, and the bottom die of a Mesh Interconnect. The next logical step would be to scale this interconnect layer into a silicon interposer with several CPU+cache dies stacked on top.
Source:
AnandTech
Imagine a literal bus driving around a city block, picking up and dropping off people between four buildings. The "bus" here resembles a strobe, the buildings resemble components (cores, uncore, etc.,) while the the bus-stops are ring-stops. Each component has its ring-stops. To disable components (eg: in product-stack segmentation), SKU designers simply disable ring-stops, making the component inaccessible. A bi-directional Ring Bus would see two "vehicles" driving in opposite directions around the city block. The Ring Bus topology comes with limitations of scale, mainly resulting from the latency added from too many ring-stops. This is precisely why coaxial ring-topology faded out in networking.Intel realized in the early 2010s that it could not scale up CPU core counts on its monolithic processor dies beyond a point using Ring Bus, and had to innovate the Mesh Topology. The Mesh is a more advanced ringbus but with additional points of connectivity between components, making halfway between a Ring Bus and full-interconnectivity (in which each component is directly interconnected with the other, an impractical solution at scale). AMD's recipe for extreme core-count processors, such as the 64-core EPYC, is in using 8-core CCDs (each with an internal bi-directional Ring Bus), that are networked at the sIOD.
It's interesting to note here, that AMD didn't always use a Ring Bus on its CCDs. Older "Zen 2" chiplets with 4-core CCX (CPU complex) used full interconnectivity between four components (i.e. four CPU cores and their slices of the shared L3 cache). This was illustrated more looking at the slide, where AMD mentioned "same latency" for a core to access every other L3 slice (which wouldn't quite be possible even with a bi-directional Ring Bus). This begins to explain AMD's rationale behind the 4-core CCX. Eventually the performance benefit of a monolithic 8-core CCX interconnected with a bi-directional Ring Bus won out, so AMD went with this approach for "Zen 3."
For the future, AMD might need to let go of Ring Bus to scale beyond a certain number of CPU cores per CCD, AnandTech postulates. This is for the same reason Intel ditched Ring Bus for high core-count processors—latency. The CCD of the future could be made up of three distinct dies stacked up: the topmost die could be made up of cache, the middle die of the CPU cores, and the bottom die of a Mesh Interconnect. The next logical step would be to scale this interconnect layer into a silicon interposer with several CPU+cache dies stacked on top.
26 Comments on "Zen 3" Chiplet Uses a Ringbus, AMD May Need to Transition to Mesh for Core-Count Growth
But then they did, anyway.
Yes, ring busses stop scaling very quickly beyond a few cores. Intel's mesh has one ugly side affect, and that is that cores get "stranded" in no mans land much too far away from the memory landings. The mesh has to be insanely fast to compensate, and that means a metric ton of overhead to power, and increased latency. (Look at how Skylake's HCC even with 10 cores disabled overpowers the LCC with only two cores disabled, both running at the same TB3.0 frequencies.)
so focus on increased clocks for those 6 to 12 cores and stop acting like a ***** about more cores.
Their Epyc/threadripper lineup with be effected if they can't scale up the core count. Which is one of their strong points against Intel....
Watch Dr Ian's video maybe you will get a better idea.
meh. its all good. i just hope i get a next gen cpu and gpu Fall of 2022. :D then i am retiring for prob 10 years. maybe more.
Nvidia and Intel are both bigger than AMD by a fair margin, and they both understood that ML, A.I, data center is where you can make bank. AMD ressources needs to be there, the competition is fierce, and their pockets are still smaller than the competition.
Don't forget that we got DLSS because data center technology found it's way in a consumer product. Intel is taking the same approach.
You can compare this to how motor sport technology ends up making better conssumer cars. Gaming doesn't have to exist in a bubble to get better shared development can be good.
Its an illusion Intel or AMD will ever design a gaming chip, why would they, CPUs slaughter gaming loads in realtime already. Any half decent midrange CPU saturates any GPU on the market right now. Let alone the fact there is no market even without that fact. CPU is general purpose for MSDT.
Unless the software side really demands "More than 8 cores" + "ultra low latency between cores" so desperately, otherwise AMD won't have to redesign >8 cores CCX in the near future....
Apart from the latency issue, AMD probably has things like 144-core CPU in its plans, which can be achieved with as 18 chiplets x 8 cores, or 12x12, or 9x16. Imagine the monster interconnect, of any topology, on the I/O die, that needs to efficiently connect 18 chiplets to each other, as well as to memory, PCIe and other stuff. In that regard, 12 or 9 certainly is better than 18.
Scaling core connectivity is and always has been a compromise, not sure why Ian Cutress dragged this out today, especially since I had to double take that this was a current article and not one from Zen3's launch last year.