yhe, it does sounds like they again trying to "fix" a latency problem with speed, but the true is I have DDR3 and it gives me 20-25 giga speed and nothing is remotely limited by speed and I have only a dual channel, people with 4 channels show that it gived them nothing (outside bechmarks), what good will 100 giga do if nothing is limited by speed anyway? but math checks out: more cores, if they all working could probably use more speed from memory but this is only good for the programs that do large data sets on many cores at the same time, games on the other hand don't :x it is whay core 2 duo was mutch faster per clock for games: fast access to large L2 cache (6 mega in 15 clocks) and even to this day nothing beats it at that (haswell, which is pretty much the same cores intel using today with latest generations (only with DDR4 controller) has 30 clocks penalty for access to his 8 mega cache, so to compensate for this very high latency, intel added a 256K dedicated cache (which is 12 clocks) in the hope that it will help (it probably does, for smaller data sets of corse).
if the IF still has big latency, this are a processors that are going to be good for heavy duty things with large data sets, you will can play games on them but probably not as high performance as intel (round 2 of low 1080p performance on zen).
but this is all theories nothing is know right now, and I also hope that outside the increase speed the latency this time will also be good.
it strange but seems to me that all industry is going in the same direction: DDR4 higher latency than DDR3, haswell processors more latency over core 2 duo, AMD more latency in cache and memory etc
funny thing is that up to 2008 the trend was reverse: they both intel and AMD developed integrated memory controllers with nahaylem and phenom
Check this out if you haven't already. Its perhaps one of my favourite break downs on cache memory. And in regards to zen/zen+ watch at 23min. But in short, the latency issues are not as bad as people think if you actually look at this from the right perspective. Basically you would need to look at zen as having 8mb L3 cache per ccx rather than 16mb total L3 cache per chip. Often times its not as big of an issue because data from main memory is copied to the L1 and L2 cache only(inclusive cache), and L3 cache working only when data is evicted from L2 cache. When L3 cache is filled; the ccx goes back to main memory rather than the other ccx L3 cache. Normally thats ok because the L3 cache works more to support the L2 which is local to each core so the performance impact is hardly a big deal especially when the OS scheduler is aware of the memory configuration.
Also in response to some of the other comments in this thread; I'm not exactly sure this has anything to do with gaming performance compared to intel, that is more due to the slight single core advantage intel has on the super high clocked models, but otherwise we see AMD ryzen doing rather excellent on multicore performance which is, in theory, where you would expect to see a shortcoming.
With the latency and cross migration issues being highlighted however; we can now speculate on what AMD can do to offset the issues and how an IO die fits into all of this:
1. An IO can simply work as a scheduler that stores data addresses to ensure no redundancy takes place when you have multiple cores and data migration, so even if the latency is higher, the communication remains streamlined and manageable. When you have 4 chips on a module like in threadripper; each memory controller would need to connect with 3 other chips via different IF links, which is probably why according to the test in the video we see the latency inline with main memory which indicates resorting to main memory rather than other ccx directly. This implementation would still be NUMA but would work much better than previous iterations of MCM allowing for some level of L3 utilization/sharing across all chiplets without always resorting to main memory.
2. The IO chip can also include the memory controller rather than just IF interconnects and schedulers. This would mean the chiplet complex wont really need a NUMA configuration and the latencies would be normalized across all chiplets. I can see this having some drawbacks/trade-off's but also much cost effectiveness in terms of the chiplet design. This implementation is most likely the case because AMD already showed a 1 chiplet cpu that had the IO chip as well; which gives the impression that a single chiplet cannot function without the IO chip.
3. AMD can double L3 cache and retain the higher modularity aspect per ccx. This retains the older challenges but gives a larger buffer before needing to reach out to main memory or other CCX L3.
4. make L3 cache shared between 2 ccx on each chiplet and add complexity in design in case of a one ccx zen2 implementation (unless a one ccx design retains the same L3 cache size). However we already saw the zen apu(2400g) having 4mb l3 cache for the 1 ccx it has rather than 8mb so perhaps this is not a big concern for AMD. This implementation can be used to pair 2 ccx's together without needing to redesign the whole ccx into an 8 core; giving one bigger pool of L3 cache per 8 cores. This means apps using up to 8 cores would naturally be less effected by any latency issues of cross chip/cross core migration etc. Do note though that I'm ignorant of much of the finer technicalities here so id love some input on this area and whether shared L3 cache local to all 8 cores in 2 ccx's is even possible without major redesign or using IF links.
5. AMD could combine aspects from all the above which would practically minimize most or all issues related to latency. One thing that we can however count on for sure is that the IO chip does a better job connecting the ccx's and chiplets together without resorting to main memory; otherwise AMD would've stuck to the old design. One thing I do worry about is if other drawbacks get introduced in case the IO has a unified memory controller for all chips that would fix old problems of cross migration by offering consistent latencies, but in turn increase latencies when an application exceeds all L3 cache and is running on system memory as well.