AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

Raevenlord · Mar 6, 2017

AMD's Ryzen 7 lower than expected performance in some applications seems to stem from a particular problem: memory. Before AMD's Ryzen chips were even out, reports pegged AMD as having confirmed that most of the tweaks and programming for the new architecture had been done in order to improve core performance to its max - at the expense of memory compatibility and performance. Apparently, and until AMD's entire Ryzen line-up is completed with the upcoming Ryzen 5 and Ryzen 3 processors, the company will be hard at work on improving Ryzen's cache handling and memory latency.

Hardware.fr has done a pretty good job in exploring Ryzen's cache and memory subsystem deficiencies through the use of AIDA 64, in what would otherwise be an exceptional processor design. Namely, the fact that there seems to be some problem with Ryzen's L3 cache and memory subsystem implementation. Paired with the same memory configuration and at the same 3 GHz clocks, for instance, Ryzen's memory tests show memory latency results that are up to 30 ns higher (at 90 ns) than the average latency found on Intel's i7 6900K or even AMD's FX 8350 (both at around 60 ns).

Update: The lack of information regarding the test system could have elicited some gray areas in the interpretation of the results. Hardware.fr tests, and below results, were obtained by setting the 8-core chips at 3 GHz, with SMT and HT deactivated. Memory for the Ryzen and Intel platforms was DDR4-2400 with 15-15-15-35 timings, and memory for the AMD FX platform was DDR3-1600 operating at 9-9-9-24 timings. Both memory configurations were set at 4x 4 GB, totaling 16 GB of memory.

From some more testing results, we see that Intel's L1 cache is still leagues ahead from AMD's implementation; that AMD's L2 is overall faster than Intel's, though it does incur on a roughly 2 ns latency penalty; and that AMD's L3 memory is very much behind Intel's in all metrics but L3 cache copies, with latency being almost 3x greater than on Intel's 6900K.

The problem is revealed through an increasing work size. In the case of the 6900K, which has a 32 KB L1 cache, performance is greatest until that workload size. Higher-sized workloads that don't fit on the L1 cache then "spill" towards the 6900K's 256 KB L2 cache; workloads higher than 256 KB and lower than 16 MB are then submitted to the 6900 K's 20 MB L3 cache, with any workloads larger than 16 MB then forcing the processor to access the main system memory, with increasing latency in access times until it reaches the RAM's ~70 ns access times.

However, on AMD's Ryzen 1800X, latency times are a wholly different beast. Everything is fine in the L1 and L2 caches (32 KB and 512 KB, respectively). However, when moving towards the 1800X's 16 MB L3 cache, the behavior is completely different. Up to 4 MB cache utilization, we see an expected increase in latency; however, latency goes through the roof way before the chip's 16 MB of L3 cache is completely filled. This clearly derives from AMD's Ryzen modularity, with each CCX complex (made up of 4 cores and 8 MB L3 cache, besides all the other duplicated logic) being able to access only 8 MB of L3 cache at any point in time.

The difference in access speeds between 4 MB and 8 MB workloads can be explained through AMD's own admission that Ryzen's core design incurs in different access times depending on which parts of the L3 cache are accessed by the CCX. The fact that this memory is "mostly exclusive" - which means that other information may be stored on it that's not of immediate use to the task at hand - can be responsible for some memory accesses on its own. Since the L3 cache is essentially a victim cache, meaning that it is filled with the information that isn't able to fit onto the chips' L1 or L2 cache levels, this would mean that each CCX can only access up to 8 MB of L3 cache if any given workload uses no more than 4 cores from a given CCX. However, even if we were to distribute workload in-between two different cores from each CCX, so as to be able to access the entirety of the 1800X's 16 MB cache... we'd still be somewhat constrained by the inter-CCX bandwidth achieved by AMD's Data Fabric interconnect... 22 GB/s, which is much lower than the L3 cache's 175 GB/s - and even lower than RAM bandwidth. That the Data Fabric interconnect also has to carry data from AMD's IO Hub PCIe lanes also potentially interferes with the (already meagre) available bandwidth

AMD's Zen architecture is surely an interesting beast, and these kinds of results really go to show the amount of work, of give-and-take design that AMD had to go through in order to achieve a cost-effective, scalable, and at the same time performant architecture through its CCX modules. However, this kind of behavior may even go so far as to give us some answers with regards to Ryzen's lower than expected gaming performance, since games are well-known to be sensitive to a processor's cache performance profile.

View at TechPowerUp Main Site

Camm · Mar 6, 2017

One does wonder if the 4 core parts will suffer the same fate since it will be one straight core complex.

medi01 · Mar 6, 2017

Raevenlord said:
with latency being almost 3x greater than on Intel's 6900K.

Huh?
69.3 vs 98 is... 3 times?

PS
Are they testing "Core from the left quad accessing L3 of the right quad" scenario? (CCX in the title hints at that, but nothing in the chaotic text of OP talks about it.

londiste · Mar 6, 2017

hasn't amd repeatedly said that aida64 does not know how to properly test ryzen cache?

Aenra · Mar 6, 2017

Dumb question! What is this QC/DC next to the broadwell?

R0H1T · Mar 6, 2017

Aenra said:
Dumb question! What is this QC/DC next to the broadwell?

Quad vs Dual channel, the first tests results are of memory or simply RAM.

Xzibit · Mar 6, 2017

londiste said:
hasn't amd repeatedly said that aida64 does not know how to properly test ryzen cache?

AIDA64 tweeted

AIDA64 said:
AMD hadn't sent us a Ryzen before launch. As soon as we can get one, we will fix the L2+L3 benchmarks

Kind of hard to have a working AIDA64 for Ryzen when the company Tweets it cant fix it until they get a Ryzen chip the same day that article is published.

the54thvoid · Mar 6, 2017

So...... Is this AMD's equivalent to Nvidia not doing Async? And can software coding help address this?

Aenra · Mar 6, 2017

R0H1T said:
Quad vs Dual channel, the first tests results are of memory or simply RAM.

O.K., so it was a dumb question. Can be smart like that, that's me. Thanks for replying

Camm · Mar 6, 2017

the54thvoid said:
So...... Is this AMD's equivalent to Nvidia not doing Async? And can software coding help address this?

I think I would want to see some true benchmarks on this first before I drew conclusions. However if I had to, a more aware scheduler could stop or at least reduce those painfully slow interfabric cache calls. But yes, much like Nvidia's async problem, ultimately I think its an architectural limitation.

the54thvoid · Mar 6, 2017

Camm said:
I think I would want to see some true benchmarks on this first before I drew conclusions. However if I had to, a more aware scheduler could stop or at least reduce those painfully slow interfabric cache calls. But yes, much like Nvidia's async problem, ultimately I think its an architectural limitation.

I thought so it can be addressed though. Nvidia have an asynchronous warp schedulers, it's just more restrictive than GCN's implementation of it. But where coded properly, it shouldn't cause too much detriment.
I think caching could surely be coded 'sympathetically' to the Ryzen architecture. Then again, I know nothing about coding and I am probably talking out my ass.

theGryphon · Mar 6, 2017

All this makes it even more impressive the current Ryzen performance. I mean, it's a chip with basically a handicapped cache/memory implementation but it still trades blows with Intel chips clock-to-clock. This actually makes me think that the real Ryzen IPC (how it handles the instructions) is significantly better than Intel's.

At the end, this is good news for AMD: they have a clear improvement path --> Lower those L3 and system memory latency figures!

It's clear that the CCX design relies on the interconnect bandwidth, so AMD has two paths going forward: 1) either find a way to increase that bandwidth for a truly scalable architecture, or 2) go Intel's route and design a chip that uses a larger CCX (with 16 cores), or 3) Do both.

It seems to me AMD should really do both if they want to also become a player in the server market again. 32-core (2 x CCX), 4-chip configurations with up to 128 cores/system is not too much to ask in the server business...

Or (totally fantasizing now, or am I?), they could truly innovate and ditch the multi-chip system designs but rather build up on the scalability idea to come up with 16-core CCX's that can do up to 8-way (on-chip) interconnects, yielding a full chip with 128 cores. Think about the implications for business clients: a single 128-core chip on a small board, meaning much-easier-to-deal-with systems with much lower power utilization (4 chips on a huge board means huge power overhead). Then, similar to what they do in GPUs, they can trim it down to create a product line-up. I have a feeling this is AMD's way (vision), but it's a goal that's a long way off at the moment...

R0H1T · Mar 6, 2017

Anyone with a Ryzen willing to test this out ~ change the affinity of AIDA64 to first four cores plus SMT (just select CPU affinity from 0 to 7) using process hacker or process explorer. Just a quick glance at these results might give us some answers.

Deeveo · Mar 6, 2017

Camm said:
One does wonder if the 4 core parts will suffer the same fate since it will be one straight core complex.

With only one CCX unit 4 core cpus shouldn't have the same problem.

asH9 · Mar 6, 2017

OK, Sooooo Why do HEDT professional programs/benchmarks (Blender...) that are 'Numa aware' (hint hint) run just as well on RyZen as they do on 6900, but gaming benchmarks between the 2 are different (cough HT proprietary cough) ???

niboar · Mar 6, 2017

Hi, the memory latency is in "ns" (nano) =1/1000000000 second not "ms" 1/1000 second.

Vlada011 · Mar 6, 2017

If Skylake-E and Kaby Lake-E samples are finished I don;t know how much Intel could change to improve his tragic position where his 1700$ worth CPU lost from 500$ AMD with 2 core less and much less power consumption, almost half.
Even if Intel catch AMD that would be with 8 and 10 cores processors and 150W power consumption.
Because of that upgrade on AMD is good choice at the moment.
Special if someone want small PC, mATX mobo, fanless 500W PSU and RX 580 + 1800X.

I don;t want to comment at all rumors about some strange lags, and some hidden problems of AMD.
Their CPU on paper shine, numbers are fantastic. If powerfull Intel fall so low that need to justify his presents with i7-7700K and
4.5GHz in games locked on 2 and 4 cores and on that way distract customers from AMD, than really no word. No one will help you except i7-7700K.
Everyone who sabotage real picture of AMD processor is enemy of enthusiasts and improvements and shoot in own legs.
Because AMD give you CPU capable to beat i7-6950X on LN2 for 500$, you can buy world recorder for 500$, with 2 core less, and far smaller power consumption.

In Windows 10 and DX12 people could get far better performance than Intel Broadwell-E. But Intel didn;t do nothing to provide that. We non stop listen about some walls and no space for improvements. No space to drain same architecture 5 years, everything what they done with X79 and X99 could fit in single socket, but there is space for new generations.

PiotrekDG · Mar 6, 2017

niboar said:
Hi, the memory latency is in "ns" (nano) =1/1000000000 second not "ms" 1/1000 second.

So much YES, that's a millionfold difference. See what difference 30 ns makes, now imagine a million times slower memory.
And it's not a typo, it appears 5 times in the text, while "ns" never appears.

C_Wiz · Mar 6, 2017

Author of the article here, I know the language barrier doesn't make things easy but there are a few innacuracies here in this summary. Some quick points on what we found :

- Memory latency (not L3) is higher (and ns, not ms

)
- L3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.

Plus many other things regarding CCX etc. I don't know how good a job Google Translate does of our article but I'd suggest people interested give it a shot (page 22/23 maybe 24 [we found another issue with game performance that's linked to Windows 10] is what you're looking for).

To answer another question, yes, L3 readings are innacurate in Aida (that's why we show them in orange in the table). We do use another test (a beta benchmark from Aida, too) to check latency at different block sizes, that one is the basis of our analysis.

G.

EarthDog · Mar 6, 2017

I wonder if aida64 was updated... we were told directly from FinalWire not to use it for data until they updated it... AMD didn't send them ryzen pre launch...

uuuaaaaaa · Mar 6, 2017

C_Wiz said:
Author of the article here, I know the language barrier doesn't make things easy but there are a few innacuracies here in this summary. Some quick points on what we found :

- Memory latency (not L3) is higher (and ns, not ms )
- L3 is split in half and communication between the two CCX is thru the same link that links the CCX to the memory controller, PCIe, etc, at a much lower speed.

Plus many other things regarding CCX etc. I don't know how good a job Google Translate does of our article but I'd suggest people interested give it a shot (page 22/23 maybe 24 [we found another issue with game performance that's linked to Windows 10] is what you're looking for).

To answer another question, yes, L3 readings are innacurate in Aida (that's why we show them in orange in the table). We do use another test (a beta benchmark from Aida, too) to check latency at different block sizes, that one is the basis of our analysis.

G.

Thank you for the clarifications!

RejZoR · Mar 6, 2017

Also be aware that Intel makes one of the best L caches. After all, they have the foundries and both teams working together. AMD doesn't have that luxury so slightly higher latency isn't something strange. And it's not even that horrible to be honest. If it was, then multi-threaded benchmarks would suffer horrendously once L3 gets thrashed by HT cache misses. But it doesn't.

lexluthermiester · Mar 6, 2017

Raevenlord said:
AMD's Ryzen 7 lower than expected performance in some applications seems to stem from a particular problem: memory latency. Before AMD's Ryzen chips were even out, reports pegged AMD as having confirmed that most of the tweaks and programming for the new architecture had been done in order to improve core performance to its max - at the expense of memory compatibility and performance. Apparently, and until AMD's entire Ryzen line-up is completed with the upcoming Ryzen 5 and Ryzen 3 processors, the company will be hard at work on improving Ryzen's cache handling and memory latency.

Hardware.fr has done a pretty good job in exploring Ryzen's cache and memory subsystem deficiencies through the use of AIDA 64, in what would otherwise be an exceptional processor design. Namely, the fact that there seems to be some problem with Ryzen's L3 implementation, in that it produces latency results that are up to 30 ns higher than the average, at 90 ns, than the L3 latency found on Intel's i7 6900K or even AMD's FX 8350 (both with latency around 60 ns).

From some more testing results, we see that Intel's L1 cache is still leagues ahead from AMD's implementation; that AMD's L2 is overall faster than Intel's, though it does incur on average a roughly 2 ns latency penalty; and that AMD's L3 memory is very much behind Intel's offerings in all metrics but L3 cache copies, with latency being almost 50% greater than on Intel's 6900K.

The problem is revealed through an increasing work size. In the case of the 6900K, which has a 32 KB L1 cache, performance is greatest until that workload size; higher-sized workloads that don't fit on the L1 cache then "spill" towards the 6900K's 256 KB L2 cache; workloads higher than 256 KB and lower than 16 MB are then submitted to the 6900 K's 20 MB L3 cache, with any workloads higher than 16 MB in size then forcing the processor to access the main system memory, with increasing latency in access times until it reaches the RAM's ~70 ns access times.

However, on AMD's Ryzen 1800X, latency times are a wholly different beast. everything is fine in the L1 and L2 caches (32 KB and 512 KB, respectively). However, when moving towards the 1800X's 16 MB L3 cache, the behavior is completely different. Up to 4 MB cache utilization, we see an expected increase in latency; however, latency goes through the roof way before the chip's 16 MB of L3 cache is completely filled. This clearly derives from AMD's Ryzen modularity, with each CCX complex (made up of 4 cores and 8 MB L3 cache, besides all the other duplicated logic) being able to access only 8 MB of L3 cache at any point in time.

The difference in access speeds between 4 MB and 8 MB workloads can be explained through AMD's own admission that Ryzen's core design incurs in different access times depending on which parts of the L3 cache are access by the CCX. Since the L3 cache is essentially a victim cache, meaning that it is filled with the information that isn't able to fit onto the chips' L1 or L2 cache levels, this would mean that each CCX can only access up to 8 MB of L3 cache if any given workload uses no more than 4 cores from a given CCX. However, even if we were to distribute workload in-between two different cores from each CCX, so as to be able to access the entirety of the 1800X's 16 MB cache... we'd still be somewhat constrained by the inter-CCX bandwidth achieved by AMD's Data Fabric interconnect... 22 GB/s, which is much lower than the L3 cache's 175 GB/s - and even lower than RAM bandwidth.

AMD's Zen architecture is surely an interesting beast, and these kinds of results really go to show the amount of work, of give-and-take design that AMD had to go through in order to achieve a cost-effective, scalable, and at the same time performant architecture through its CCX modules. However, this kind of behavior may even go so far as to give us some answers with regards to Ryzen's lower than expected gaming performance, since games are well-known to be sensitive to a processor's cache performance profile.

Source: Hardware.fr

There were a few problems with this article. The use of "ms"(milliseconds) instead of "ns"(nanoseconds) was fairly glaring. CPU operating reaction speeds have not been measured in "ms" since the early 80's. There were also a few grammatical errors which have been fixed. You're welcome.

fynxer · Mar 6, 2017

Hmmm, is this a permanent design flaw or is this fixable some how?

ssdpro · Mar 6, 2017

I had wondered when someone would start expanding on the memory latency issues. The 90+ns latency on these is like an old Core 2 / P35 from 2007. In the AIDA64 memory latency list you have to scroll down to find the poor 1800x... just below a P4 from 2004. :confused:

System Name	The Ryzening
Processor	AMD Ryzen 9 5900X
Motherboard	MSI X570 MAG TOMAHAWK
Cooling	Lian Li Galahad 360mm AIO
Memory	32 GB G.Skill Trident Z F4-3733 (4x 8 GB)
Video Card(s)	Gigabyte RTX 3070 Ti
Storage	Boot: Transcend MTE220S 2TB, Kintson A2000 1TB, Seagate Firewolf Pro 14 TB
Display(s)	Acer Nitro VG270UP (1440p 144 Hz IPS)
Case	Lian Li O11DX Dynamic White
Audio Device(s)	iFi Audio Zen DAC
Power Supply	Seasonic Focus+ 750 W
Mouse	Cooler Master Masterkeys Lite L
Keyboard	Cooler Master Masterkeys Lite L
Software	Windows 10 x64

System Name	ATHENA
Processor	AMD 7950X
Motherboard	ASUS Crosshair X670E Extreme
Cooling	ASUS ROG Ryujin III 360, 13 x Lian Li P28
Memory	2x32GB Trident Z RGB 6000Mhz CL30
Video Card(s)	ASUS 4090 STRIX
Storage	3 x Kingston Fury 4TB, 4 x Samsung 870 QVO
Display(s)	Acer X38S, Wacom Cintiq Pro 15
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	Topping DX9, Fluid FPX7 Fader Pro, Beyerdynamic T1 G2, Beyerdynamic MMX300
Power Supply	Seasonic PRIME TX-1600
Mouse	Xtrfy MZ1 - Zy' Rail, Logitech MX Vertical, Logitech MX Master 3
Keyboard	Logitech G915 TKL
VR HMD	Oculus Quest 2
Software	Windows 11 + Universal Blue

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

System Name	ACME Singularity Unit
Processor	Coal-dual 9000
Motherboard	Oak Plank
Cooling	4 Snow Yetis huffing and puffing in parallel
Memory	Hasty Indian (I/O: 3 smoke signals per minute)
Video Card(s)	Bob Ross AI module
Storage	Stone Tablet 2.0
Display(s)	Where are my glasses?
Case	Hand sewn bull hide
Audio Device(s)	On demand tribe singing
Power Supply	Spin-o-Wheel-matic
Mouse	Hamster original
Keyboard	Chisel 1.9a (upgraded for Stone Tablet 2.0 compatibility)
Software	It's all hard down here

AMD's Ryzen Cache Analyzed - Improvements; Improveable; CCX Compromises

Raevenlord

News Editor

Camm

medi01

londiste

Aenra

R0H1T

Xzibit

the54thvoid

Super Intoxicated Moderator

Aenra

Camm

the54thvoid

Super Intoxicated Moderator

theGryphon

R0H1T

Deeveo

asH9

New Member

niboar

Vlada011

PiotrekDG

New Member

C_Wiz

hardware.fr

EarthDog

uuuaaaaaa

RejZoR

lexluthermiester

fynxer

ssdpro

System Name	3950X Workstation
Processor	AMD Ryzen 9 3950X
Motherboard	ASUS Crosshair VIII Impact
Cooling	Cryorig C1 with Noctua NF-A12x15
Memory	G.Skill F4-3600C16D-32GTZNC
Video Card(s)	ASUS GTX 1650 LP OC
Storage	2 x Corsair MP510 1920GB M.2 SSD
Case	Realan E-i7
Power Supply	G-Unique 400W
Software	Win 10 Pro
Benchmark Scores	https://smallformfactor.net/forum/threads/the-saga-of-the-little-gem-continues.12877/

System Name	Gaming rig
Processor	AMD Ryzen 7 5900X
Motherboard	Asus X570-Plus TUF /w "passive" chipset mod
Cooling	Noctua NH-D15S
Memory	Crucial Ballistix Sport LT 2x16GB 3200C16 @3600C16
Video Card(s)	MSI RTX 3060 TI Gaming X Trio
Storage	Samsung 970 Pro 1TB, Crucial MX500 2TB, Samsung 860 QVO 4TB
Display(s)	Samsung C32HG7x
Case	Fractal Design Define R5
Audio Device(s)	Asus Xonar Essence STX
Power Supply	Corsair RM850i 850W
Mouse	Logitech G502 Hero
Keyboard	Logitech G710+
Software	Windows 10 Pro

System Name	Intel® X99 Wellsburg
Processor	Intel® Core™ i7-5820K - 4.5GHz
Motherboard	ASUS Rampage V E10 (1801)
Cooling	EK RGB Monoblock + EK XRES D5 Revo Glass PWM
Memory	CMD16GX4M4A2666C15
Video Card(s)	ASUS GTX1080Ti Poseidon
Storage	Samsung 970 EVO PLUS 1TB /850 EVO 1TB / WD Black 2TB
Display(s)	Samsung P2450H
Case	Lian Li PC-O11 WXC
Audio Device(s)	CREATIVE Sound Blaster ZxR
Power Supply	EVGA 1200 P2 Platinum
Mouse	Logitech G900 / SS QCK
Keyboard	Deck 87 Francium Pro
Software	Windows 10 Pro x64

System Name	No name / Purple Haze
Processor	Phenom II 1100T @ 3.8Ghz / Pentium 4 3.4 EE Gallatin @ 3.825Ghz
Motherboard	MSI 970 Gaming/ Abit IC7-MAX3
Cooling	CM Hyper 212X / Scythe Andy Samurai Master (CPU) - Modded Ati Silencer 5 rev. 2 (GPU)
Memory	8GB GEIL GB38GB2133C10ADC + 8GB G.Skill F3-14900CL9-4GBXL / 2x1GB Crucial Ballistix Tracer PC4000
Video Card(s)	Asus R9 Fury X Strix (4096 SP's/1050 Mhz)/ PowerColor X850XT PE @ (600/1230) AGP + (HD3850 AGP)
Storage	Samsung 250 GB / WD Caviar 160GB
Display(s)	Benq XL2411T
Audio Device(s)	motherboard / Creative Sound Blaster X-Fi XtremeGamer Fatal1ty Pro + Front panel
Power Supply	Tagan BZ 900W / Corsair HX620w
Mouse	Zowie AM
Keyboard	Qpad MK-50
Software	Windows 7 Pro 64Bit / Windows XP
Benchmark Scores	64CU Fury: http://www.3dmark.com/fs/11269229 / X850XT PE http://www.3dmark.com/3dm05/5532432

Processor	AMD 1700X
Motherboard	Crosshair VI Hero
Memory	F4-3200C14D-16GFX
Video Card(s)	GTX 1070
Storage	960 Pro
Display(s)	PG279Q
Case	HAF X
Power Supply	Silencer MK III 850
Mouse	Logitech G700s
Keyboard	Logitech G105
Software	Windows 10