• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,291 (7.53/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
According to an investigative report by "Chips and Cheese," the larger L2 caches in Intel's 13th Gen Core "Raptor Lake-S" doesn't come with a proportionate increase in cache latency, and Intel seems to have contained the latency increase well. "Raptor Lake-S" significantly increases L2 cache sizes over the previous generation. Each of its 8 "Raptor Cove" P-cores has 2 MB of dedicated L2 cache, compared to the 1.25 MB with the "Golden Cove" P-cores powering the current-gen "Alder Lake-S," which amounts to a 60 percent increase in size. The "Gracemont" E-core clusters (group of four E-cores), sees a doubling in the size of the L2 cache that's shared among the four cores in the cluster, from 2 MB in "Alder Lake," to 4 MB. The last-level L3 cache shared among all P-cores and E-core clusters, sees a less remarkable increase in size, from 30 MB to 36 MB.

Larger caches have a direct impact on performance, as more data is available close to the CPU cores, sparing them a lengthy fetch/store operation to the main memory (RAM). However, making caches larger doesn't just cost die-area, transistor-count, and power/heat, but also latency, even though L2 cache is an order of magnitude faster than the L3 cache, which in turn is significantly faster than DRAM. Chips and Cheese tracked and tabulated the L2 cache latencies of past Intel client microarchitectures, and found a generational increase in latencies with increasing L2 cache sizes, leading up to "Alder Lake." This increase has somehow tapered with "Raptor Lake."



The report says that the 4-way associative 256 KB dedicated L2 cache with "Skylake" (thru "Comet Lake") CPU cores has an L2 cache latency of 12 cycles. "Sunny Cove" and "Cypress Cove" cores see this increase to 512 KB in size, as the latency is increased to 13 cycles. "Willow Cove" and "Golden Cove" (powering "Tiger Lake" and "Alder Lake," respectively), see a further increase. While "Willow Cove" uses a 20-way associative cache, "Golden Cove" uses 10-way. The latency goes up from 13 cycles to 14 cycles. The upcoming "Raptor Cove" P-core comes with 2 MB of 16-way L2 cache, but here, the latency is contained to 15 cycles. It indicates that "Raptor Lake" has undergone some serious rework with its power-management as well as cache design to reach its cache latency target. Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."

View at TechPowerUp Main Site | Source
 
Joined
Aug 13, 2010
Messages
5,478 (1.05/day)
Chips are at the hands of people who can test them. Those people might be impressed with some of the heavy data related things those chips can do. The cache restructure seemed to have contributed performance to some of the weaker spots found on ADL-S previously.

That's all.
 
Joined
Jun 8, 2022
Messages
388 (0.42/day)
Location
Ohio, USA
System Name Trackstar
Processor AMD Ryzen 7 5800X3D -30 All Core CO (on Corsair XC5 block)
Motherboard Gigabyte B550 AORUS Elite V2 Rev 1.0 (F17 BIOS)
Cooling Corsair XD5 pump / Corsair XR5 1x 360mm (front) + 1x 420mm (top) rads
Memory 32GB G.Skill DDR4-3600 CL14 1:1 (F4-3600C14Q-32GVKA kit)
Video Card(s) ASRock RX 6950XT OC Formula (on Bykski A-AR6900XTOCF-X block)
Storage WD_BLACK SN850X 2TB w/HS (FW ver. 620361WD)
Display(s) Dell S3222DGM 32" 1440p/165Hz FreeSync
Case Fractal Design Meshify S2
Audio Device(s) Realtek ALC1200 Integrated Audio
Power Supply Super Flower Leadex Platinum SE 1200W on Liebert GXT4-1500RT120 UPS
Mouse Corsair Nightsword RGB
Keyboard Corsair K60 RGB PRO
VR HMD N/A
Software Windows 11 Pro 23H2 (Build 22631.3958)
Benchmark Scores https://www.3dmark.com/sw/1131940 https://www.3dmark.com/fs/29315810
Great to see them improving their cache design but the "serious power management rework" they had to do makes me concerned that the power draw needed to keep the latencies down is significant.
 
Joined
Feb 19, 2022
Messages
59 (0.06/day)
Personally then I'm curious whether the DDR4 memory controller has improved, and whether or not this is related to the magic "dec_tCWL" register that's exposed in a silly program (aSUS/ROG MemTweakIt) going from 3 to 1 or 0.

As tCWL is good for one thing on AM4; causing instability - "nooo i don't wanna do tCL 13 with tCWL 12 that's too fast" meanwhile tCWL > tCL is hella unstable (e.g. tCL 13 + tCWL 14 POSTs, unlike tCWL 12) - and generally being a waste of time to mess with.

Conspicuously then tCWL can be run above tCL on Alder Lake, up to around +3.. Hmm......


oh right, and I should leave my obligatory "these processors run Cool & Quiet just fine if you dial in the RAM settings real hard" comment too, i guess
bah
 
Joined
Jul 16, 2014
Messages
8,216 (2.16/day)
Location
SE Michigan
System Name Dumbass
Processor AMD Ryzen 7800X3D
Motherboard ASUS TUF gaming B650
Cooling Artic Liquid Freezer 2 - 420mm
Memory G.Skill Sniper 32gb DDR5 6000
Video Card(s) GreenTeam 4070 ti super 16gb
Storage Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s) 1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s) onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply Corsair HX1000i
Mouse Steeseries Esports Wireless
Keyboard Corsair K100
Software windows 10 H
Benchmark Scores https://i.imgur.com/aoz3vWY.jpg?2
Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."
Well now... Whats there to say?
 
Joined
Jun 8, 2022
Messages
388 (0.42/day)
Location
Ohio, USA
System Name Trackstar
Processor AMD Ryzen 7 5800X3D -30 All Core CO (on Corsair XC5 block)
Motherboard Gigabyte B550 AORUS Elite V2 Rev 1.0 (F17 BIOS)
Cooling Corsair XD5 pump / Corsair XR5 1x 360mm (front) + 1x 420mm (top) rads
Memory 32GB G.Skill DDR4-3600 CL14 1:1 (F4-3600C14Q-32GVKA kit)
Video Card(s) ASRock RX 6950XT OC Formula (on Bykski A-AR6900XTOCF-X block)
Storage WD_BLACK SN850X 2TB w/HS (FW ver. 620361WD)
Display(s) Dell S3222DGM 32" 1440p/165Hz FreeSync
Case Fractal Design Meshify S2
Audio Device(s) Realtek ALC1200 Integrated Audio
Power Supply Super Flower Leadex Platinum SE 1200W on Liebert GXT4-1500RT120 UPS
Mouse Corsair Nightsword RGB
Keyboard Corsair K60 RGB PRO
VR HMD N/A
Software Windows 11 Pro 23H2 (Build 22631.3958)
Benchmark Scores https://www.3dmark.com/sw/1131940 https://www.3dmark.com/fs/29315810
This can't be true considering RPL retains the same TDP while offering significantly higher frequencies, more cache and E-cores.

I think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.
 
Joined
Nov 26, 2021
Messages
1,702 (1.52/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
Impressive latency for a large and high clocked L2 cache. Before AMD moved to FinFET based nodes, Intel typically had larger and faster caches than them. Now that both of them use comparable nodes, it is interesting to see how they approach cache design.
 
Joined
Nov 13, 2007
Messages
10,827 (1.73/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6400 1:1 CL30-36-36-76 FCLK 2200
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
I think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.

I think there was a lot of alder lake variation and inconsistency between bins / clocks. Some ADL hit 5.5Ghz easily while others cant get beyond 5.1 without going full reactor.

The current chip I have can hit 5.4 with no issues and it was a first batch 12600k bin. I think they sacrificed clocks for volumes in the initial process, and if you think about the top 12900k bins, the jump from 5.4/5.5 ghz to 5.6 ghz is much less impressive.
 

hs4

Joined
Feb 15, 2022
Messages
106 (0.10/day)
The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.

Rather, Zen4 is clocking up enough to offset the power efficiency improvements due to node advancements, and the power efficiency of RPL and Zen4 will be similar.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Impressive latency for a large and high clocked L2 cache.
You can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.

The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.
Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.
 

hs4

Joined
Feb 15, 2022
Messages
106 (0.10/day)
You can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.


Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.
What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
 - e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.
 
Joined
Jun 10, 2014
Messages
2,987 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
 - e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.
I was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.
 

hs4

Joined
Feb 15, 2022
Messages
106 (0.10/day)
I was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.
I understand what you mean. I wanted to say this: x86 hybrids have not yet had enough time to mature.

It was pointed out that Alder lake had data transfer problems right after its launch. I first saw the article in my native language, but the following articles in English, for example, might be typical evaluation.

"Alder Lake – E-Cores, Ring Clock, and Hybrid Teething Troubles" by Chip and Cheese, December 16, 2021

But first attempts at new things often encounter teething problems. Alder Lake seems to be no exception. The software side has been well covered, but the hardware side is not flawless either. We expect Intel to improve on their hybrid architecture going forward as they get more experience and improving on this aspect with Alder Lake’s follow on, codename Raptor Lake, seems to be a goal for Intel as the leaked Raptor Lake slides indicated.
 

kinz

New Member
Joined
Jan 20, 2023
Messages
3 (0.00/day)
This is mostly incorrect.

Latency, in clock cycles -- is only directly comparable with equal frequencies.

Cache speeds have increased significantly.

For example, looking at Rocket Lake running a ring bus speed (cache frequency) of 5Ghz, we have:

16 clocks @5Ghz= (1/5,000,000,000)x16 =3.2 ns

There is more to the story than simply the frequency and that one timing, when it comes to real world latency, however.
Regardless, if you look at the measured L2 cache latency in aida 64 memory bench, for example -- you'll see that it hasn't changed much at all since skylake.
 
Top