Monday, March 6th 2017

AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

AMD's Ryzen 7 lower than expected performance in some applications seems to stem from a particular problem: memory. Before AMD's Ryzen chips were even out, reports pegged AMD as having confirmed that most of the tweaks and programming for the new architecture had been done in order to improve core performance to its max - at the expense of memory compatibility and performance. Apparently, and until AMD's entire Ryzen line-up is completed with the upcoming Ryzen 5 and Ryzen 3 processors, the company will be hard at work on improving Ryzen's cache handling and memory latency.

Hardware.fr has done a pretty good job in exploring Ryzen's cache and memory subsystem deficiencies through the use of AIDA 64, in what would otherwise be an exceptional processor design. Namely, the fact that there seems to be some problem with Ryzen's L3 cache and memory subsystem implementation. Paired with the same memory configuration and at the same 3 GHz clocks, for instance, Ryzen's memory tests show memory latency results that are up to 30 ns higher (at 90 ns) than the average latency found on Intel's i7 6900K or even AMD's FX 8350 (both at around 60 ns).
Update: The lack of information regarding the test system could have elicited some gray areas in the interpretation of the results. Hardware.fr tests, and below results, were obtained by setting the 8-core chips at 3 GHz, with SMT and HT deactivated. Memory for the Ryzen and Intel platforms was DDR4-2400 with 15-15-15-35 timings, and memory for the AMD FX platform was DDR3-1600 operating at 9-9-9-24 timings. Both memory configurations were set at 4x 4 GB, totaling 16 GB of memory.

From some more testing results, we see that Intel's L1 cache is still leagues ahead from AMD's implementation; that AMD's L2 is overall faster than Intel's, though it does incur on a roughly 2 ns latency penalty; and that AMD's L3 memory is very much behind Intel's in all metrics but L3 cache copies, with latency being almost 3x greater than on Intel's 6900K.
The problem is revealed through an increasing work size. In the case of the 6900K, which has a 32 KB L1 cache, performance is greatest until that workload size. Higher-sized workloads that don't fit on the L1 cache then "spill" towards the 6900K's 256 KB L2 cache; workloads higher than 256 KB and lower than 16 MB are then submitted to the 6900 K's 20 MB L3 cache, with any workloads larger than 16 MB then forcing the processor to access the main system memory, with increasing latency in access times until it reaches the RAM's ~70 ns access times.
However, on AMD's Ryzen 1800X, latency times are a wholly different beast. Everything is fine in the L1 and L2 caches (32 KB and 512 KB, respectively). However, when moving towards the 1800X's 16 MB L3 cache, the behavior is completely different. Up to 4 MB cache utilization, we see an expected increase in latency; however, latency goes through the roof way before the chip's 16 MB of L3 cache is completely filled. This clearly derives from AMD's Ryzen modularity, with each CCX complex (made up of 4 cores and 8 MB L3 cache, besides all the other duplicated logic) being able to access only 8 MB of L3 cache at any point in time.
The difference in access speeds between 4 MB and 8 MB workloads can be explained through AMD's own admission that Ryzen's core design incurs in different access times depending on which parts of the L3 cache are accessed by the CCX. The fact that this memory is "mostly exclusive" - which means that other information may be stored on it that's not of immediate use to the task at hand - can be responsible for some memory accesses on its own. Since the L3 cache is essentially a victim cache, meaning that it is filled with the information that isn't able to fit onto the chips' L1 or L2 cache levels, this would mean that each CCX can only access up to 8 MB of L3 cache if any given workload uses no more than 4 cores from a given CCX. However, even if we were to distribute workload in-between two different cores from each CCX, so as to be able to access the entirety of the 1800X's 16 MB cache... we'd still be somewhat constrained by the inter-CCX bandwidth achieved by AMD's Data Fabric interconnect... 22 GB/s, which is much lower than the L3 cache's 175 GB/s - and even lower than RAM bandwidth. That the Data Fabric interconnect also has to carry data from AMD's IO Hub PCIe lanes also potentially interferes with the (already meagre) available bandwidth

AMD's Zen architecture is surely an interesting beast, and these kinds of results really go to show the amount of work, of give-and-take design that AMD had to go through in order to achieve a cost-effective, scalable, and at the same time performant architecture through its CCX modules. However, this kind of behavior may even go so far as to give us some answers with regards to Ryzen's lower than expected gaming performance, since games are well-known to be sensitive to a processor's cache performance profile.
Source: Hardware.fr
Add your own comment

120 Comments on AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

#51
Grings
Im seeing too many badly theorycrafted reasons for that bad gaming performance (that disabling smt fixes)
Posted on Reply
#53
geon2k2
r9Missing the point there. It can be 2 threads and still bottleneck if the software tries to move the thread from CCX0 to CCX1.
Which is something that Games and OS do quite often to balance load among cores.
By doing that will have to move the data from CCX0 L3 Cache to CCX1 L3 Cache which will cause the bottleneck because of the ultra slow L3 interconnect.
The solution should be in sight, they just to make the Windows scheduler aware of the design and move thread only in the CCX that thread originates.
That way it eliminates moving data between L3 caches for both modules.

This hopefully can be confirmed benching a game that doesn't use more than 4 threads and disable SMT and one of the CCX on the Ryzen 7.
That eliminates all the above scenarios.
If this is the case, why on earth didn't AMD just send an email to Microsoft to modify the scheduler in the way they wanted, just before the launch or even better, why they didn't release a driver. In the old days for Athlon X2 there was a driver called dual core optimizer.
Posted on Reply
#54
Kanan
Tech Enthusiast & Gamer
geon2k2If this is the case, why on earth didn't AMD just send an email to Microsoft to modify the scheduler in the way they wanted, just before the launch or even better, why they didn't release a driver. In the old days for Athlon X2 there was a driver called dual core optimizer.
Too busy optimizing Ryzen. ;)
Posted on Reply
#55
mouacyk
KananToo busy optimizing Ryzen. ;)
And not enough time beta-testing their new CPU with production software/games to see how they perform in the real world? Engineers and their clean rooms... puhhh!
Posted on Reply
#56
TheLaughingMan
geon2k2If this is the case, why on earth didn't AMD just send an email to Microsoft to modify the scheduler in the way they wanted, just before the launch or even better, why they didn't release a driver. In the old days for Athlon X2 there was a driver called dual core optimizer.
Well that was different. That was created to correct an issue with older game titles that were programmed to run on 1 thread only, but left the scheduling up to Windows. *My memory is a little fuzzy* but I recall the two cores without the optimizer would occasion get into a race case which would stall the game and sometimes crash it. The optimizer fixed that issue and allowed for games and programs that didn't care if there was more than 1 core to run smoothly.

As I understand it**

Now the issue seems to be that as tasks are migrating from core to core is causing a lost in data access. If a task was on core 1 CCX 1 and got picked up by core 7 on CCX 2, all of the data related to that task in the L3 cache is gone. At that point the chip can either copy over the data from the L3 cache on CCX1 through the data fabric (22 GB/s shared with the memory controller) or let it get created naturally as the task starts from scratch with code paths in the L1 and L2 and finally the L3 when needed. Now multiple that happening 1000 fold because a game doesn't care what core is handling the math. This is what potentially is causing the drop in performance for some games.

What they have to do is either limit which logical cores can pick up a task or rework how the CPU handles these kinds of data shifts. Fixing the data fabric bandwidth, unifying the L3 cache into a true single 16 MB cache, of fixing the latency issue with memory would need to be fixed in future generations of the chip.
Posted on Reply
#57
Ramin Rostami
Above writen is wrong and fake

Archtectural and design completly diffrance in ryzen cpu and intel

There are many reson like

L3 cash in 6900 is 1 unit but L3 cash is 16 unit (1 for each block)
Posted on Reply
#58
TheGuruStud
This stuff is cute. I already saw a cache benchmark where ryzen annihilated Intel's speeds (almost 2x sometimes). There was some latency issue, but it looked nothing like this goofball crap.
Posted on Reply
#59
Steevo
Ramin RostamiAbove writen is wrong and fake

Archtectural and design completly diffrance in ryzen cpu and intel

There are many reson like

L3 cash in 6900 is 1 unit but L3 cash is 16 unit (1 for each block)
Please don't repost. Also, making your post bold will do nothing. Source of your information would be good, otherwise it will be ignored and continued reposting may anger the ban stick gods.


Lastly, why would AMD make their cache non-contiguous when that would create extra overhead and slow down the read and write speed as it would have to wait for data lines to be cleared (grounded or terminated to make sure no residual voltage created a false bit) between each segment being read or written to? Cache works on the same physics as all other memory systems.
Posted on Reply
#60
nem..
*Article translated

Nothing better than getting up on a Sunday and keep talking about Ryzen, and if we already know that the platform is plagued with problems with RAM, because manufacturers have released their motherboards with BIOS beta, they are causing numerous problems with memory RAM, we now know that these problems greatly affect CPU performance.

These problems, present in all assemblers, such as MSI, Gigabyte, ASRock or Asus, has to do with RAM, which has problems working at its maximum speed, so depending on the motherboard may only work at 1866 MHz or 2400 MHz (as in our case) instead of the 3400 MHz that reaches the memory, and it is when we know that the memory influences much in the final performance, being able to see as in the software of benchmarking Geekbench, in the mononuclear test , Moving from a memory of 2133 MHz @ 3466 MHz implies no less than 10 percent extra performance.




In games, the same thing happens. With The Witcher 3, along with a GeForce GTX 1080 graphics, Full HD resolution and maximum graphics quality (with Hairworks off), we can see how the game reaches 92.5 FPS with a RAM DDR4 @ 2133 MHz , While if this memory is 3200 MHz we see how we get a gain of 14.9 FPS that makes us reach 107.4 FPS.

In this way, the first Ryzen reviews all score the same when using the same boards or memory modules, so it will not be until the motherboard manufacturers solve the problems when we can see the actual performance of the CPUs, Unlike Intel, the speed of memory does imply large differences in performance.



thnx google translate xd
Source:
www.eteknix.com/memory-speed-large-impact-ryzen-performance/
elchapuzasinformatico.com/2017/03/la-velocidad-la-ram-fuerte-impacto-rendimiento-amd-ryzen/ ..
Posted on Reply
#61
Grings
Ramin RostamiAbove writen (high latency on L3 cash of rysen) is 100% wrong and fake

Archtectural and design completly diffrance in ryzen cpu and intel

There are many reson like

L3 cash in 6900 is 1 unit but L3 cash is 16 unit (1 for each block)
This is a forum, not a scrolling live chat, we can see your first (and second) post already
Posted on Reply
#62
Steevo
Ramin RostamiSorry i am new here

In rysen L3 cash not share cash in 1 unit(like intel)
In rysen L3 cash is 16 unit so read from 16 diffrent unit takes more than 1
But volume size on rysen is 16 times more than intel 6900

I hope i can explain anf understant above issue reason
I already explained it, irregardless of memory technology what matters are

1) Data lines (physical traces or wires allowing data to send or receive) and how many there are, and we have no idea as the architecture is not detailed to the public.
2) Clock rate (How fast the data is accessed) again we don't know the specifics, but through extrapolating data and using exploratory data sets we can gain some insight.
3) Latency (how many clock cycles between a request to read and the actual data showing up on the data lines) and we can figure this out if we know other things about the processor or by allowing a program to run and copy differing data sizes into a core or cache and then asking it to be read back and counting the number of clock cycles between.


What we don't have access to yet is a few key components and some software control that may allow us to force data into one cache, and allow us to copy between caches to determine the extra latency introduced by the transfer.

1 or 16 does NOT matter.
Posted on Reply
#63
ender79
Ok, in fact ryzen R7 is a Intel Core2Quad in AMD vision.
So you ask AMD for a octa core processor and we have now a dual quad core .
The design problem with 2 CCX interconnected trough an internal bus is a crime for gaming, maybe 10 years ago was a clear innovation.

Anyway this design can't be fixed by a BIOS update, even with more accurate values from AIDA64 beta and lower latency , the gaming on 8 core Ryzen will have to suffer. Games can be patched to use the main thread on one CCX and AI and others on second CCX, but that won't fix 100% the gaming experience.

Rysen 4 core probably will not suffer from the same issue, even if I read well the graph, the problem exist over 4M L3 also , but 4 core ryzen will be the best option for budget gamers.
Posted on Reply
#64
geon2k2
My understanding from hardware.fr is that the CCX complex runs at the same frequency as the memory, and somehow the bandwidth is shared between inter module communication and memory access.

This is the reason for which higher memory frequency will provide much better results as the bandwidth for inter-module communication increases with frequency. From 2133 to 3200 the bandwidth for internal communication increases from 34GB/s to 51GB/s, and that's why the witcher 3 benchmark posted above scales so well, not necessarily due to faster memory, which by itself has little impact as we saw numerous times, but because the communication between modules increases drastically with better memory frequency.

I'm waiting with more interes the R5 1400x. The the 1600x (6 core) will probably be plagued by the same issue.
Posted on Reply
#65
qubit
Overclocked quantum bit
RaevenlordNamely, the fact that there seems to be some problem with Ryzen's L3 cache and memory subsystem implementation.
This seems to be the core of the problem and I'm not sure if a microcode update can fix this. If not, then a Ryzen version 2 silicon will be needed, which is a shame since it gives Intel the chance to stay one step ahead of them again. The best competition will happen if AMD can leapfrog Intel and that hasn't happened since 2005 with the Athlon64.

I hope AMD give customers a generous trade-in program if such a revision is released. This will significantly boost confidence in the brand and make customers feel well looked after.
Posted on Reply
#66
Steevo
Without further testing (whats the penalty matrix for thread core change*) we are all guessing, educated guesses for the most part, but guessing.

Thread handling is the job of the OS primarily unless the executables are updated to handle their own, and most just look to see how many threads to spawn based on cores and no further. But like Bulldozer if the OS assigns threads based on the CCX architecture there will be little or no penalty as the thread won't move unless it can benefit from the move.

* CCX0 C1 --> CCX4 C2 move costs 46 cycles of wait time for example, but CCX0 C0 to CCX1 C1 only costs 22 cycles. How many wasted cycles between every CCX and every core in every CCX. Does the penalty increase or remain the same depending on other CCX/cores being busy?


More information is needed before we can make broad statements, a better and more efficient thread handling/dispatching algorithm may increase the performance a lot.
Posted on Reply
#67
TheLaughingMan
These are not the chips for pure gamers either way. We need to wait and see what the Ryzen 5 1400 and 1500 have to offer as the 4 core / 8 thread chips. Will they clock higher, will they scale in performance better, and they will also have the benefit of being released on a platform that has some time to mature and work out some of these kinks.
Posted on Reply
#68
mastrdrver
qubitThis seems to be the core of the problem and I'm not sure if a microcode update can fix this. If not, then a Ryzen version 2 silicon will be needed, which is a shame since it gives Intel the chance to stay one step ahead of them again. The best competition will happen if AMD can leapfrog Intel and that hasn't happened since 2005 with the Athlon64.

I hope AMD give customers a generous trade-in program if such a revision is released. This will significantly boost confidence in the brand and make customers feel well looked after.
And Windows apparently has problems allocating system resources correctly for Ryzen:

forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-2#post-38770630
Posted on Reply
#69
Renald
To add some info on this test, Hardware.fr is saying in this article that they perfectly knows that AIDA64 is not suitable for Ryzen, but they ran the test anyway just to have some base of work and show the real problem which is demonstrated in the graph later in this article.

After confirming that there was a problem with L3 cache, they ran a custom-made program which was intended to perform cache usage incrementally higher and test latency. They clearly stated that the creator of the program did not checked it before using it on Ryzen, just to be clear.

The result in ns shown in green are the results of this custom-made test.
Sum up : as soon as a thread is moved from a core to another, it's cache is immediately pushed into this high latency cache, creating a huge drop in performance. The CCX architecture is not helping, neither bandwidth between them.

Since then, they try to isolate this behavior to prove Ryzen sensibility to core parking(which is not yet still proven or "that simple"). Not an easy task it seems, because of Windows 10 scheduler. And still, many manufacturers are not yet ready, and come to mess around with not suitable drivers and patches... Kinda annoying :/

Sorry for my poor English, and HFR team, just whip me if I'm saying some crappy things. I'm still (and you too as I understood), not quite sure what is the real underlying problem in the end ;)
Posted on Reply
#70
Wark0
nem..*Article translated

Nothing better than getting up on a Sunday and keep talking about Ryzen, and if we already know that the platform is plagued with problems with RAM, because manufacturers have released their motherboards with BIOS beta, they are causing numerous problems with memory RAM, we now know that these problems greatly affect CPU performance.

These problems, present in all assemblers, such as MSI, Gigabyte, ASRock or Asus, has to do with RAM, which has problems working at its maximum speed, so depending on the motherboard may only work at 1866 MHz or 2400 MHz (as in our case) instead of the 3400 MHz that reaches the memory, and it is when we know that the memory influences much in the final performance, being able to see as in the software of benchmarking Geekbench, in the mononuclear test , Moving from a memory of 2133 MHz @ 3466 MHz implies no less than 10 percent extra performance.




In games, the same thing happens. With The Witcher 3, along with a GeForce GTX 1080 graphics, Full HD resolution and maximum graphics quality (with Hairworks off), we can see how the game reaches 92.5 FPS with a RAM DDR4 @ 2133 MHz , While if this memory is 3200 MHz we see how we get a gain of 14.9 FPS that makes us reach 107.4 FPS.

In this way, the first Ryzen reviews all score the same when using the same boards or memory modules, so it will not be until the motherboard manufacturers solve the problems when we can see the actual performance of the CPUs, Unlike Intel, the speed of memory does imply large differences in performance.



thnx google translate xd
Source:
www.eteknix.com/memory-speed-large-impact-ryzen-performance/
elchapuzasinformatico.com/2017/03/la-velocidad-la-ram-fuerte-impacto-rendimiento-amd-ryzen/ ..
Oh no... not this thing again ... this figures are from MSI that reused the same benchmark from an older MSI slide with benchmark on ... Z270 ... :lol:

Posted on Reply
#72
Renald
mastrdrverAnd Windows apparently has problems allocating system resources correctly for Ryzen:

forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-2#post-38770630
(Sorry for double post)
They looked into it at Hardware.fr and said that the guy is not saying the truth, and that Core0/1/2/3 are sharing 8Mb and Core4/5/6/7 are sharing another 8Mb.
www.hardware.fr/articles/956-1/amd-ryzen-7-1800x-test-retour-amd.html ==> Page 80 if you speak french

This line : "****---- Unified Cache 1, Level 3, 8 MB, Assoc 16" Says : Core 0/1/2/3 share 8MB of L3
Or here "**------ Unified Cache 0, Level 2, 512 KB, Assoc 8" Says : Core 0/1 share 512KB of L2
and so on.

So it's some crappy analyze here, and they are pretty technical guys so they won't affirm it's crap unless it really is.
Posted on Reply
#73
Wark0
RenaldSince then, they try to isolate this behavior to prove Ryzen sensibility to core parking(which is not yet still proven or "that simple"). Not an easy task it seems, because of Windows 10 scheduler. And still, many manufacturers are not yet ready, and come to mess around with not suitable drivers and patches... Kinda annoying :/
Ryzen with SMT is not more sensible to core parking than Intel with SMT. Both don't like scheduler with core parking.

The fact is that on Windows 10, Core Parking is OFF in balanced mode with an Intel CPU (BDW-E, SKL for example), and ON on Ryzen. So you have to switch it off manually in order to compare apple to apple.
Posted on Reply
#74
C_Wiz
hardware.fr
RaevenlordI can compare them between your own results, which where all done with the same configuration between the 6900K and the 1800X, right? That's what I compare in the article.
Yes our values are comparable, you got that right don't worry.

I was talking to someone else at that time who was comparing that value (98ns) to Aida64 stock values, which was not ok because of the clock being different ;).
RaevenlordTime isn't as we would like, hence why only now I'm here and improving the article.
Yeah we've been working non-stop on Ryzen post launch, trying to figure out the issues we were seeing, time is way too short and sleep levels way too low :)
RaevenlordFor me, that was the whole point of the post. AIDA 64 is a benchmarking utility, but until it has been "fixed", as in, properly optimized for Ryzen, I think it presents itself as a great opportunity to see Ryzen's behavior on non-optimized workloads (ie, what all games currently are).
Yeah again, to be 100% clear, we double checked with FinalWire what was accurate and wasn't according to them, our deep dive is exactly about trying to work around why some of the values reported aren't accurate, and go from there. It's by starting to track this down that we got to the CCX interconnect bandwidth limitation, the way split caches are handled in single thread etc etc.

Thanks for fixing your summary !

G.
Posted on Reply
#75
C_Wiz
hardware.fr
GringsIm seeing too many badly theorycrafted reasons for that bad gaming performance (that disabling smt fixes)
We actually covered that in an update on sunday if you are interested : www.hardware.fr/articles/956-8/retour-smt-mode-high-performance.html

Long story short, some Windows 10 Anniversary Update scheduler settings aren't set the same way for Ryzen and Intel CPUs. We tested that and updated our article accordingly.
geon2k2My understanding from hardware.fr is that the CCX complex runs at the same frequency as the memory, and somehow the bandwidth is shared between inter module communication and memory access.

This is the reason for which higher memory frequency will provide much better results as the bandwidth for inter-module communication increases with frequency. From 2133 to 3200 the bandwidth for internal communication increases from 34GB/s to 51GB/s, and that's why the witcher 3 benchmark posted above scales so well, not necessarily due to faster memory, which by itself has little impact as we saw numerous times, but because the communication between modules increases drastically with better memory frequency.
Actually the Witcher 3 "bench" is from a MSI/Intel advert (if I remember correctly).

But your overall point is exactly correct : data fabric clock is set with memory (so DDR4-2400 = 1200 MHz clock for that bus), so if you are limited there, you'll see a componding effect by pushing memory higher.
Posted on Reply
Add your own comment
Nov 22nd, 2024 05:57 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts