Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

btarunr · Aug 24, 2022

According to an investigative report by "Chips and Cheese," the larger L2 caches in Intel's 13th Gen Core "Raptor Lake-S" doesn't come with a proportionate increase in cache latency, and Intel seems to have contained the latency increase well. "Raptor Lake-S" significantly increases L2 cache sizes over the previous generation. Each of its 8 "Raptor Cove" P-cores has 2 MB of dedicated L2 cache, compared to the 1.25 MB with the "Golden Cove" P-cores powering the current-gen "Alder Lake-S," which amounts to a 60 percent increase in size. The "Gracemont" E-core clusters (group of four E-cores), sees a doubling in the size of the L2 cache that's shared among the four cores in the cluster, from 2 MB in "Alder Lake," to 4 MB. The last-level L3 cache shared among all P-cores and E-core clusters, sees a less remarkable increase in size, from 30 MB to 36 MB.

Larger caches have a direct impact on performance, as more data is available close to the CPU cores, sparing them a lengthy fetch/store operation to the main memory (RAM). However, making caches larger doesn't just cost die-area, transistor-count, and power/heat, but also latency, even though L2 cache is an order of magnitude faster than the L3 cache, which in turn is significantly faster than DRAM. Chips and Cheese tracked and tabulated the L2 cache latencies of past Intel client microarchitectures, and found a generational increase in latencies with increasing L2 cache sizes, leading up to "Alder Lake." This increase has somehow tapered with "Raptor Lake."

The report says that the 4-way associative 256 KB dedicated L2 cache with "Skylake" (thru "Comet Lake") CPU cores has an L2 cache latency of 12 cycles. "Sunny Cove" and "Cypress Cove" cores see this increase to 512 KB in size, as the latency is increased to 13 cycles. "Willow Cove" and "Golden Cove" (powering "Tiger Lake" and "Alder Lake," respectively), see a further increase. While "Willow Cove" uses a 20-way associative cache, "Golden Cove" uses 10-way. The latency goes up from 13 cycles to 14 cycles. The upcoming "Raptor Cove" P-core comes with 2 MB of 16-way L2 cache, but here, the latency is contained to 15 cycles. It indicates that "Raptor Lake" has undergone some serious rework with its power-management as well as cache design to reach its cache latency target. Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."

View at TechPowerUp Main Site | Source

dj-electric · Aug 24, 2022

Chips are at the hands of people who can test them. Those people might be impressed with some of the heavy data related things those chips can do. The cache restructure seemed to have contributed performance to some of the weaker spots found on ADL-S previously.

That's all.

Tropick · Aug 24, 2022

Great to see them improving their cache design but the "serious power management rework" they had to do makes me concerned that the power draw needed to keep the latencies down is significant.

Bloax · Aug 24, 2022

Personally then I'm curious whether the DDR4 memory controller has improved, and whether or not this is related to the magic "dec_tCWL" register that's exposed in a silly program (aSUS/ROG MemTweakIt) going from 3 to 1 or 0.

As tCWL is good for one thing on AM4; causing instability - "nooo i don't wanna do tCL 13 with tCWL 12 that's too fast" meanwhile tCWL > tCL is hella unstable (e.g. tCL 13 + tCWL 14 POSTs, unlike tCWL 12) - and generally being a waste of time to mess with.

Conspicuously then tCWL can be run above tCL on Alder Lake, up to around +3.. Hmm......

oh right, and I should leave my obligatory "these processors run Cool & Quiet just fine if you dial in the RAM settings real hard" comment too, i guess
bah

DeathtoGnomes · Aug 24, 2022

Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."

Well now... Whats there to say?

Tropick · Aug 24, 2022

birdie said:
This can't be true considering RPL retains the same TDP while offering significantly higher frequencies, more cache and E-cores.

I think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.

AnotherReader · Aug 24, 2022

Impressive latency for a large and high clocked L2 cache. Before AMD moved to FinFET based nodes, Intel typically had larger and faster caches than them. Now that both of them use comparable nodes, it is interesting to see how they approach cache design.

phanbuey · Aug 24, 2022

Tropick said:
I think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.

I think there was a lot of alder lake variation and inconsistency between bins / clocks. Some ADL hit 5.5Ghz easily while others cant get beyond 5.1 without going full reactor.

The current chip I have can hit 5.4 with no issues and it was a first batch 12600k bin. I think they sacrificed clocks for volumes in the initial process, and if you think about the top 12900k bins, the jump from 5.4/5.5 ghz to 5.6 ghz is much less impressive.

hs4 · Aug 25, 2022

The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.

Rather, Zen4 is clocking up enough to offset the power efficiency improvements due to node advancements, and the power efficiency of RPL and Zen4 will be similar.

efikkan · Aug 26, 2022

AnotherReader said:
Impressive latency for a large and high clocked L2 cache.

You can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.

hs4 said:
The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.

Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.

hs4 · Aug 27, 2022

efikkan said:
You can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.

Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.

What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
　- e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.

efikkan · Aug 27, 2022

hs4 said:
What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
　- e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.

I was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.

hs4 · Aug 27, 2022

efikkan said:
I was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.

I understand what you mean. I wanted to say this: x86 hybrids have not yet had enough time to mature.

It was pointed out that Alder lake had data transfer problems right after its launch. I first saw the article in my native language, but the following articles in English, for example, might be typical evaluation.

"Alder Lake – E-Cores, Ring Clock, and Hybrid Teething Troubles" by Chip and Cheese, December 16, 2021

But first attempts at new things often encounter teething problems. Alder Lake seems to be no exception. The software side has been well covered, but the hardware side is not flawless either. We expect Intel to improve on their hybrid architecture going forward as they get more experience and improving on this aspect with Alder Lake’s follow on, codename Raptor Lake, seems to be a goal for Intel as the leaked Raptor Lake slides indicated.

kinz · Apr 9, 2023

This is mostly incorrect.

Latency, in clock cycles -- is only directly comparable with equal frequencies.

Cache speeds have increased significantly.

For example, looking at Rocket Lake running a ring bus speed (cache frequency) of 5Ghz, we have:

16 clocks @5Ghz= (1/5,000,000,000)x16 =3.2 ns

There is more to the story than simply the frequency and that one timing, when it comes to real world latency, however.
Regardless, if you look at the measured L2 cache latency in aida 64 memory bench, for example -- you'll see that it hasn't changed much at all since skylake.

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	ASUS ROG Strix B450-E Gaming
Cooling	DeepCool Gammax L240 V2
Memory	2x 8GB G.Skill Sniper X
Video Card(s)	Palit GeForce RTX 2080 SUPER GameRock
Storage	Western Digital Black NVMe 512GB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

System Name	Trackstar
Processor	AMD Ryzen 7 5800X3D -30 All Core CO (on Corsair XC5 block)
Motherboard	Gigabyte B550 AORUS Elite V2 Rev 1.0 (F17 BIOS)
Cooling	Corsair XD5 pump / Corsair XR5 1x 360mm (front) + 1x 420mm (top) rads
Memory	32GB G.Skill DDR4-3600 CL14 1:1 (F4-3600C14Q-32GVKA kit)
Video Card(s)	ASRock RX 6950XT OC Formula (on Bykski A-AR6900XTOCF-X block)
Storage	WD_BLACK SN850X 2TB w/HS (FW ver. 620361WD)
Display(s)	Dell S3222DGM 32" 1440p/165Hz FreeSync
Case	Fractal Design Meshify S2
Audio Device(s)	Realtek ALC1200 Integrated Audio
Power Supply	Super Flower Leadex Platinum SE 1200W on Liebert GXT4-1500RT120 UPS
Mouse	Corsair Nightsword RGB
Keyboard	Corsair K60 RGB PRO
VR HMD	N/A
Software	Windows 11 Pro 23H2 (Build 22631.3958)
Benchmark Scores	https://www.3dmark.com/sw/1131940 https://www.3dmark.com/fs/29315810

System Name	Dumbass
Processor	AMD Ryzen 7800X3D
Motherboard	ASUS TUF gaming B650
Cooling	Artic Liquid Freezer 2 - 420mm
Memory	G.Skill Sniper 32gb DDR5 6000
Video Card(s)	GreenTeam 4070 ti super 16gb
Storage	Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s)	1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case	Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s)	onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply	Corsair HX1000i
Mouse	Steeseries Esports Wireless
Keyboard	Corsair K100
Software	windows 10 H
Benchmark Scores	https://i.imgur.com/aoz3vWY.jpg?2

System Name	Trackstar
Processor	AMD Ryzen 7 5800X3D -30 All Core CO (on Corsair XC5 block)
Motherboard	Gigabyte B550 AORUS Elite V2 Rev 1.0 (F17 BIOS)
Cooling	Corsair XD5 pump / Corsair XR5 1x 360mm (front) + 1x 420mm (top) rads
Memory	32GB G.Skill DDR4-3600 CL14 1:1 (F4-3600C14Q-32GVKA kit)
Video Card(s)	ASRock RX 6950XT OC Formula (on Bykski A-AR6900XTOCF-X block)
Storage	WD_BLACK SN850X 2TB w/HS (FW ver. 620361WD)
Display(s)	Dell S3222DGM 32" 1440p/165Hz FreeSync
Case	Fractal Design Meshify S2
Audio Device(s)	Realtek ALC1200 Integrated Audio
Power Supply	Super Flower Leadex Platinum SE 1200W on Liebert GXT4-1500RT120 UPS
Mouse	Corsair Nightsword RGB
Keyboard	Corsair K60 RGB PRO
VR HMD	N/A
Software	Windows 11 Pro 23H2 (Build 22631.3958)
Benchmark Scores	https://www.3dmark.com/sw/1131940 https://www.3dmark.com/fs/29315810

Processor	Ryzen 7 5700X
Motherboard	ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling	Noctua NH-C14S (two fans)
Memory	2x16GB DDR4 3200
Video Card(s)	Reference Vega 64
Storage	Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s)	Nixeus NX-EDG27, and Samsung S23A700
Case	Fractal Design R5
Power Supply	Seasonic PRIME TITANIUM 850W
Mouse	Logitech
VR HMD	Oculus Rift
Software	Windows 11 Pro, and Ubuntu 20.04

Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

btarunr

Editor & Senior Moderator

dj-electric

Tropick

Bloax

DeathtoGnomes

Tropick

AnotherReader

phanbuey

hs4

efikkan

hs4

efikkan

hs4

kinz

New Member

System Name	stress-less
Processor	9800X3D @ 5.42GHZ
Motherboard	MSI PRO B650M-A Wifi
Cooling	Thermalright Phantom Spirit EVO
Memory	64GB DDR5 6000 1:1 CL30-36-36-96 FCLK 2000
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	RIP Corsair SF750... Waiting for SF1000
Mouse	DeathadderV2 X Hyperspeed
Keyboard	65% HE Keyboard
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

Processor	AMD Ryzen 9 5900X \|\|\| Intel Core i7-3930K
Motherboard	ASUS ProArt B550-CREATOR \|\|\| Asus P9X79 WS
Cooling	Noctua NH-U14S \|\|\| Be Quiet Pure Rock
Memory	Crucial 2 x 16 GB 3200 MHz \|\|\| Corsair 8 x 8 GB 1333 MHz
Video Card(s)	MSI GTX 1060 3GB \|\|\| MSI GTX 680 4GB
Storage	Samsung 970 PRO 512 GB + 1 TB \|\|\| Intel 545s 512 GB + 256 GB
Display(s)	Asus ROG Swift PG278QR 27" \|\|\| Eizo EV2416W 24"
Case	Fractal Design Define 7 XL x 2
Audio Device(s)	Cambridge Audio DacMagic Plus
Power Supply	Seasonic Focus PX-850 x 2
Mouse	Razer Abyssus
Keyboard	CM Storm QuickFire XT
Software	Ubuntu