Wednesday, August 24th 2022

Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

Aug 24th, 2022 09:38 Discuss (13 Comments)

According to an investigative report by "Chips and Cheese," the larger L2 caches in Intel's 13th Gen Core "Raptor Lake-S" doesn't come with a proportionate increase in cache latency, and Intel seems to have contained the latency increase well. "Raptor Lake-S" significantly increases L2 cache sizes over the previous generation. Each of its 8 "Raptor Cove" P-cores has 2 MB of dedicated L2 cache, compared to the 1.25 MB with the "Golden Cove" P-cores powering the current-gen "Alder Lake-S," which amounts to a 60 percent increase in size. The "Gracemont" E-core clusters (group of four E-cores), sees a doubling in the size of the L2 cache that's shared among the four cores in the cluster, from 2 MB in "Alder Lake," to 4 MB. The last-level L3 cache shared among all P-cores and E-core clusters, sees a less remarkable increase in size, from 30 MB to 36 MB.

Larger caches have a direct impact on performance, as more data is available close to the CPU cores, sparing them a lengthy fetch/store operation to the main memory (RAM). However, making caches larger doesn't just cost die-area, transistor-count, and power/heat, but also latency, even though L2 cache is an order of magnitude faster than the L3 cache, which in turn is significantly faster than DRAM. Chips and Cheese tracked and tabulated the L2 cache latencies of past Intel client microarchitectures, and found a generational increase in latencies with increasing L2 cache sizes, leading up to "Alder Lake." This increase has somehow tapered with "Raptor Lake."

The report says that the 4-way associative 256 KB dedicated L2 cache with "Skylake" (thru "Comet Lake") CPU cores has an L2 cache latency of 12 cycles. "Sunny Cove" and "Cypress Cove" cores see this increase to 512 KB in size, as the latency is increased to 13 cycles. "Willow Cove" and "Golden Cove" (powering "Tiger Lake" and "Alder Lake," respectively), see a further increase. While "Willow Cove" uses a 20-way associative cache, "Golden Cove" uses 10-way. The latency goes up from 13 cycles to 14 cycles. The upcoming "Raptor Cove" P-core comes with 2 MB of 16-way L2 cache, but here, the latency is contained to 15 cycles. It indicates that "Raptor Lake" has undergone some serious rework with its power-management as well as cache design to reach its cache latency target. Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."

Source: Chips and Cheese

Add your own comment

13 Comments on Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

dj-electric

Chips are at the hands of people who can test them. Those people might be impressed with some of the heavy data related things those chips can do. The cache restructure seemed to have contributed performance to some of the weaker spots found on ADL-S previously.

That's all.

Tropick

Great to see them improving their cache design but the "serious power management rework" they had to do makes me concerned that the power draw needed to keep the latencies down is significant.

Bloax

Personally then I'm curious whether the DDR4 memory controller has improved, and whether or not this is related to the magic "dec_tCWL" register that's exposed in a silly program (aSUS/ROG MemTweakIt) going from 3 to 1 or 0.

As tCWL is good for one thing on AM4; causing instability - "nooo i don't wanna do tCL 13 with tCWL 12 that's too fast" meanwhile tCWL > tCL is hella unstable (e.g. tCL 13 + tCWL 14 POSTs, unlike tCWL 12) - and generally being a waste of time to mess with.

Conspicuously then tCWL can be run above tCL on Alder Lake, up to around +3.. Hmm......

oh right, and I should leave my obligatory "these processors run Cool & Quiet just fine if you dial in the RAM settings real hard" comment too, i guess
bah

DeathtoGnomes

Bear in mind, that this chip is built on the same 10 nm Enhanced SuperFin (Intel 7) node as "Alder Lake."

Well now... Whats there to say?

Tropick

birdieThis can't be true considering RPL retains the same TDP while offering significantly higher frequencies, more cache and E-cores.

I think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.

AnotherReader

Impressive latency for a large and high clocked L2 cache. Before AMD moved to FinFET based nodes, Intel typically had larger and faster caches than them. Now that both of them use comparable nodes, it is interesting to see how they approach cache design.

phanbuey

TropickI think it's definitely possible, I mean look what they were able to squeeze out of 14nm. There's definitely a performance increase going from Skylake cores to Comet Lake cores and the TDP remained pretty much the same if not lower between parts with the same core count. A 14nm 4.0GHz quad core i7-6700k has a TDP of 91w whereas a 14nm 4.4GHz quad core i3-10320 has a TDP of 65w. I have no doubt they'd be able to do the same on 10nm.

I think there was a lot of alder lake variation and inconsistency between bins / clocks. Some ADL hit 5.5Ghz easily while others cant get beyond 5.1 without going full reactor.

The current chip I have can hit 5.4 with no issues and it was a first batch 12600k bin. I think they sacrificed clocks for volumes in the initial process, and if you think about the top 12900k bins, the jump from 5.4/5.5 ghz to 5.6 ghz is much less impressive.

hs4

The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.

Rather, Zen4 is clocking up enough to offset the power efficiency improvements due to node advancements, and the power efficiency of RPL and Zen4 will be similar.

efikkan

AnotherReaderImpressive latency for a large and high clocked L2 cache.

You can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.

hs4The Alder lake was not quite complete due to many modifications made in a short period of time, and there are still many inefficiencies that need to be removed. As a result, the Raptor lake seems to have made various improvements, including power efficiency, by removing these inefficiencies.

Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.

#10

hs4

efikkanYou can see they went from a 4-way 256 kB cache in Skylake, to a 10-way 1280 kB cache in Golden Cove and a 16-way 2048 kB cache in Raptor Lake. What e.g. 4-way 256 kB actually means is 4 cache banks of 64 kB, these actually works as separate caches. The benefits of adding more banks is maintaining latency and adding total bandwidth (as cache banks work in parallel), the disadvantage is lower cache efficiency and more transistors. So maintaining latency when moving from Golden Cove's 10x128 kB to Raptor Lake's 16x128 kB is completely expected, it's actually more impressive to see the move from Sunny Cove's 8x64 kB to Golden Cove's 10x128 kB adding just two cycles.

It will be interesting to see if future microarchitectures tries to add even more banks, as the efficiency will be dropping off at some point. Future nodes may make it feasible though to increase the bank size instead.

Which modifications are you thinking of?
At least the design process of Golden Cove (the big cores) were very long, >5 years.

What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
　- e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.

#11

efikkan

hs4What I consider to be improvements to "Alder lake", not only the Golden Cove, are

- Data transfer, such as
- cache (written in the spec sheet)
- the ringbus (@OneRaichu already showed difference in core-to-core latency matrix between ADL and RPL)
- Optimization on clock-speed bottleneck
　- e.g. logically same and electrically different things, such as adjusting the distance between components

For example, ADL had significantly slower decompress than compress with 7zip (compared to other CPU trends), but that problem has been eliminated with RPL. There are not a few reports of algorithm-dependent performance improvement differences between ADL and RPL, but I suspect that this is probably the effect of the elimination of the data transfer bottleneck.

I was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.

#12

hs4

efikkanI was referring to The Alder lake was not quite complete due to many modifications made in a short period of time. Do you have evidence of this, or is it more a assumption based on the tweaks in Raptor Lake?

It's not unusual to see improvements to a second iteration of a microarchitecture, and it's not uncommon for a design to have unforeseen bottlenecks, imbalances and even the odd regression. All the major design decisions are made long before they get to run the real hardware, and by then they can only do minor tweaks without causing years of delays.

I understand what you mean. I wanted to say this: x86 hybrids have not yet had enough time to mature.

It was pointed out that Alder lake had data transfer problems right after its launch. I first saw the article in my native language, but the following articles in English, for example, might be typical evaluation.

"Alder Lake – E-Cores, Ring Clock, and Hybrid Teething Troubles" by Chip and Cheese, December 16, 2021

But first attempts at new things often encounter teething problems. Alder Lake seems to be no exception. The software side has been well covered, but the hardware side is not flawless either. We expect Intel to improve on their hybrid architecture going forward as they get more experience and improving on this aspect with Alder Lake’s follow on, codename Raptor Lake, seems to be a goal for Intel as the leaked Raptor Lake slides indicated.

#13

kinz

This is mostly incorrect.

Latency, in clock cycles -- is only directly comparable with equal frequencies.

Cache speeds have increased significantly.

For example, looking at Rocket Lake running a ring bus speed (cache frequency) of 5Ghz, we have:

16 clocks @5Ghz= (1/5,000,000,000)x16 =3.2 ns

There is more to the story than simply the frequency and that one timing, when it comes to real world latency, however.
Regardless, if you look at the measured L2 cache latency in aida 64 memory bench, for example -- you'll see that it hasn't changed much at all since skylake.

Add your own comment

Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

13 Comments on Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

Related News

13 Comments on Latency Increase from Larger L2 Cache on Intel "Raptor Cove" P-core Well Contained: Report

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts