Tuesday, September 17th 2019

Intel Adds More L3 Cache to Its Tiger Lake CPUs

InstLatX64 has posted a CPU dump of Intel's next-generation 10 nm CPUs codenamed Tiger Lake. With the CPUID of 806C0, this Tiger Lake chip runs at 1000 MHz base and 3400 MHz boost clocks which is lower than the current Ice Lake models, but that is to be expected given that this might be just an engineering sample, meaning that production/consumer revision will have better frequency.

Perhaps one of the most interesting findings this dump shows is the new L3 cache configuration. Up until now Intel usually put 2 MB of L3 cache per each core, however with Tiger Lake, it seems like the plan is to boost the amount of available cache. Now we are going to get 50% more L3 cache resulting in 3 MB per core or 12 MB in total for this four-core chip. Improved cache capacity can result in additional latency because of additional distance data needs to travel to get in and out of cache, but Intel's engineers surely solved this problem. Additionally, full AVX512 support is present except avx512_bf which supports bfloat16 floating-point variation found in Cooper Lake Xeons.
Source: InstLatX64
Add your own comment

67 Comments on Intel Adds More L3 Cache to Its Tiger Lake CPUs

#26
phanbuey
lynx29I mean if AMD can't compete single core, they have no reason to push themselves faster do they? Even Ryzen 3000 falls behind in single in a lot of applications, so my guess is Intel knows Ryzen 4000 might be a 5-8% IPC gain over 3000 (if AMD is lucky), and they probably have already done the math knowing they can beat that or tie it on 10nm 4.5ghz. /shrug

Competition is the only way to make bigger gains, and Intel still has none for single core (not to mention you can't really OC ryzen 3000 and Intel chips OC like a beast even on big air heatsinks like Noctua) further widening the gap.
Yeah - it's one of the reasons i went with 8700k for now vs 3600. I think the 4 series if they can get clocks up will be on better than current intel stock. So if Intel has a hard time beating their old SC @ 5ghz, they will be in a tough spot.
Posted on Reply
#27
HugsNotDrugs
Not only is the silicon costs higher but there is also a power penalty with using increased amounts of on-die cache memory.

Having said that I think the world is ready to move beyond 2MB/core for mobile devices.
Posted on Reply
#28
phanbuey
what ever happened to the broadwell design witht the l4 cache? That seemed to be useful in gaming and other certain applications.
Posted on Reply
#29
Steevo
eidairaman1my 8350 runs 5GHz on Air, not exotic either.
On what process node? Higher latency (deep in order pipelines) is masked with higher frequency, which pipeline stalls when it has to be flushed cause out of order or speculative execution and branch prediction to fail, meaning less IPC.

Architecture designed for a process node is best, which is why trading cache for TDP is a good value on smaller boxes that don't run at as high of frequency.

Intel can't get the frequency they want, so are trading it for more TDP in cache to increase IPC.
Posted on Reply
#30
eidairaman1
The Exiled Airman
SteevoOn what process node? Higher latency (deep in order pipelines) is masked with higher frequency, which pipeline stalls when it has to be flushed cause out of order or speculative execution and branch prediction to fail, meaning less IPC.

Architecture designed for a process node is best, which is why trading cache for TDP is a good value on smaller boxes that don't run at as high of frequency.

Intel can't get the frequency they want, so are trading it for more TDP in cache to increase IPC.
Fx8350.
Posted on Reply
#31
Aquinus
Resident Wat-man
You know, you need more cache to help try and cancel out the cost of memory latency inherent in MCM designs. Speculation is dangerous though.
Posted on Reply
#32
Arc1t3ct
lynx29I mean if AMD can't compete single core, they have no reason to push themselves faster do they? Even Ryzen 3000 falls behind in single in a lot of applications, so my guess is Intel knows Ryzen 4000 might be a 5-8% IPC gain over 3000 (if AMD is lucky), and they probably have already done the math knowing they can beat that or tie it on 10nm 4.5ghz. /shrug

Competition is the only way to make bigger gains, and Intel still has none for single core (not to mention you can't really OC ryzen 3000 and Intel chips OC like a beast even on big air heatsinks like Noctua) further widening the gap.
Intel can’t compete with AMD in IPC. They’ve lost that battle. The reason you sometimes see higher Single Thread results is because those intel cpus run at higher frequencies and the software running is probably intel optimized.

I can’t believe it either!
Posted on Reply
#33
efikkan
micropage7pushing L3 cache? for me looks like a tweak to make it looks a little bit better
Increasing the L3 cache will only give marginal gains (except for edge cases), so I suspect this is partially a marketing decision.

L3 cache is a "spillover" cache, which basically means it's data "discarded" by the memory controller because it didn't fit L2 any more. While the L3 has the advantage of being accessible across CPU cores, the Skylake family (excluding Skylake-X/-SP) does this interesting thing where the L3 cache is inclusive, meaning L3 cache contains a duplicate of the core's L2 just in case another core wants to access it (which is mostly wasted), meaning Skylake family chips effectively have much less L3 cache than you might think.

Speaking of usefulness, each cache line in L2 is obviously many times more useful than each cache line in L3. L2 cache is where the data is prefetched into, while L3 is data recently discarded from L3. More L2 cache seems like an obvious benefit, but L2 is more "costly" for certain reasons, not only because it needs more transistors per capacity, but also because it's more closely connected to the pipeline, the front-end and is very timing sensitive. This is why it's relatively easy to throw in extra L3 into an existing design, while changing L2 requires a redesign.

Sooner or later more (or smarter) L2 cache will be needed to be able to feed the multiple execution ports and SIMD units in the cores. I would love to see CPU designs with way more L2 cache, like 1 MB or even 2 MB, but even with node shrinks it will get challenging to way beyond that. I would argue that it may be time to split L2 and possible even L3 into separate instruction and data caches. This would allow more flexible placement on the die, plus with the "shared" L3 cache it's only the instruction cache that is really shared in practice.
phanbueywhat ever happened to the broadwell design witht the l4 cache? That seemed to be useful in gaming and other certain applications.
I believe it was mostly used by the integrated graphics.
The problem with L4 is generally the same as the problem with L3, just worse; it's a spillover cache, which means it's only useful when it contains discarded cache from the last few thousand clock cycles. The cache discards the least recently used data in each cache bank, there is no prioritization beyond that, which means that you may need extreme amounts of L4 cache to make a significant difference across different workloads.

If L4 data and instruction caches were separate though (read my paragraph above), I would imagine that just a few MB of it could be useful, as data flows through at the rate of GBs per ms, while instructions will usually jump back and forth within "relatively few" MB.
Arc1t3ctIntel can’t compete with AMD in IPC. They’ve lost that battle. The reason you sometimes see higher Single Thread results is because those intel cpus run at higher frequencies and the software running is probably intel optimized.
Nope, Intel still have the lead in IPC, while AMD manages better multicore clock scaling and individual boosting, plus they have the extra burst boost speed of XFR on top of regular boost.
Software isn't "Intel optimized". This BS needs to end now. They use the same ISA, we don't have access to use their microoperations, so there is no real way to optimize for it, even if we wanted.
Posted on Reply
#34
Darmok N Jalad
More cache could act as a thermal buffer on these smaller nodes. The chips can only get so small before you run out of surface area to dissipate heat. Cache could be a fairly simple way to add size without adding more of the heat-producing transistors. It’s also easy to segment off to increase yields. Oh, and you can get a small IPC uplift as well.
Posted on Reply
#35
Steevo
Darmok N JaladMore cache could act as a thermal buffer on these smaller nodes. The chips can only get so small before you run out of surface area to dissipate heat. Cache could be a fairly simple way to add size without adding more of the heat-producing transistors. It’s also easy to segment off to increase yields. Oh, and you can get a small IPC uplift as well.
Cache is the primary power consumption as it requires constant refreshes to keep the data valid. After prediction and caches it's up to the code quality to determine how efficient it is, which is why latency matters, and why AMD Bulldozer was high frequency but couldn't keep up with Intel.
Posted on Reply
#36
voltage
Finally close to release, THIS is the CPU I have been waiting for, for many years... I hope I live long enough to be able to buy two (desk top and in a laptop)
Posted on Reply
#37
cyneater
Need mohr cores I mean cache :P
Posted on Reply
#38
BorgOvermind
This article implies that intel will actually have mass-produced working 10nm CPUs.
Posted on Reply
#39
Arc1t3ct
efikkanNope, Intel still have the lead in IPC, while AMD manages better multicore clock scaling and individual boosting, plus they have the extra burst boost speed of XFR on top of regular boost.
Have a look here:

And Here: www.anandtech.com/show/14605/the-and-ryzen-3700x-3900x-review-raising-the-bar

efikkanSoftware isn't "Intel optimized". This BS needs to end now. They use the same ISA, we don't have access to use their microoperations, so there is no real way to optimize for it, even if we wanted.
Have a look here: software.intel.com/en-us/ipp

And here: www.amazon.com/Optimizing-Applications-Multi-Core-Processors-Performance/dp/1934053015/ref=sr_1_1?keywords=Optimizing+Applications+for+Multi-Core+Processors,+Using+the+Intel+Integrated+Performance+Primitives&qid=1568800003&s=gateway&sr=8-1

I'm seriously considering switching from my trusty 4770K to a 3950X for my main system. Hell has frozen over...
Posted on Reply
#40
Octopuss
So Tiger Lake is a successor to Ice Lake CPUs that don't exist either.
Yes, I get it.
Dafuq Intel.
Posted on Reply
#41
HenrySomeone
lynx29I mean if AMD can't compete single core, they have no reason to push themselves faster do they? Even Ryzen 3000 falls behind in single in a lot of applications, so my guess is Intel knows Ryzen 4000 might be a 5-8% IPC gain over 3000 (if AMD is lucky), and they probably have already done the math knowing they can beat that or tie it on 10nm 4.5ghz. /shrug

Competition is the only way to make bigger gains, and Intel still has none for single core (not to mention you can't really OC ryzen 3000 and Intel chips OC like a beast even on big air heatsinks like Noctua) further widening the gap.
Precisely and at least when you also OC Intel's chips, they lead in all single thread heavy applications, full stop. Despite what hordes of ardent AMD fan(boy)s all over the internet would have you believe, the red team is still the one doing catch-up.
Posted on Reply
#42
nurion
fynxerHoly sh*t, Intel must be really desperate to sacrifice that much silicon real estate to more cache in a bid to catch up with AMD.

Larger silicon die will seriously cut in to Intel's profits, at this point Intel is desperate when they realized that 10nm is not going to save them from AMD's 7nm+ EUV.

Only thing Intel can do now is continue lying and using inaccurate data in the press to try holding back AMD from cutting in to the big market share they have in notebooks but rest assured that AMD is coming for that too in a big way next year.
i dont think we ever gonna see 10nm desktop parts ,more than 4 cores.
Intel though having so many fabs ,goes to Samsung to help them out because the 10nm node
isn't returning what expected numcores-yields-freq-voltages..besides technical difficulties of manufacturing.
so i think Intel goes for the 7nm in mid'20 to '21.
Posted on Reply
#43
efikkan
Arc1t3ctHave a look here: software.intel.com/en-us/ipp

And here: www.amazon.com/Optimizing-Applications-Multi-Core-Processors-Performance/dp/1934053015/ref=sr_1_1?keywords=Optimizing+Applications+for+Multi-Core+Processors,+Using+the+Intel+Integrated+Performance+Primitives&qid=1568800003&s=gateway&sr=8-1

I'm seriously considering switching from my trusty 4770K to a 3950X for my main system. Hell has frozen over...
Just linking to some random Intel libraries…, yeah, you don't quite get how software development works.
As I said, (normal) software isn't "Intel optimized". To "optimize" for something specific, we would need unique instructions differentiating it from the competition, and make multiple compiled versions of the software, but then it will no longer be the same software. As I said Intel and AMD generally have the same ISA, with the exception of new instructions that one or the other adds, and then the other responds by adding support later. So if you wanted to "optimize for Intel", you would have to look for instructions that AMD don't support (yet), and build the software around that using assembly code or intrinsics, not high-level stuff. But if these new instructions are useful, then AMD usually adds support shortly after, and then your code is no longer "Intel optimized".

In reality, optimizing code is not about optimizing it for Intel or AMD, and even if so, it would be features of microarchitectures, not "Intel" or "AMD". The reason why a piece of code performs differently on e.g. Skylake, Ice Lake, Zen 1 or Zen 2 is different resource bottlenecks. Intel and AMD keeps changing/improving the various resources in their CPUs; like prefetching, branch prediction, caches, execution port configuration, ALUs, FPUs, vector units, AGUs, etc. Even if I intentionally or unintentionally optimizes my code so it happens right now to scale better on Zen 2 than Skylake, Ice Lake or the next one is likely to change that resource balance and tilt that the other way. When we write software, we can't target the CPU's microoperations, so we can't truly optimize for a specific microarchtecture, but when we have "optimal" code where one algorithm scales better on Skylake and another scales better on Zen 2, it doesn't mean there is something wrong with either, it just means their workload happens to be better balanced for those respective CPUs, like balancing integer and floating point operations, branching, SIMD, etc.

Since the ISA for x86 CPUs are the same, and we can't target any of the underlying microarchitectures, optimization is by design generic. Optimizing code is about removing redundancies, bloat, abstractions, branching, SIMD and often most importantly cache optimization. Optimizations like this will always benefit all modern x86 microarchitectures, and while the relative gain may vary, a good optimization will work for all of them, including future unknown microarchitectures.

So no; software like games, photo editors, video editors, CADs, web browsers, office applications, development tools, etc. are not "Intel optimized".
Posted on Reply
#44
ZoneDymo
HenrySomeonePrecisely and at least when you also OC Intel's chips, they lead in all single thread heavy applications, full stop. Despite what hordes of ardent AMD fan(boy)s all over the internet would have you believe, the red team is still the one doing catch-up.
Are you sure you are not living in a bubble?
Posted on Reply
#45
Octopuss
efikkanJust linking to some random Intel libraries…, yeah, you don't quite get how software development works.
As I said, (normal) software isn't "Intel optimized". To "optimize" for something specific, we would need unique instructions differentiating it from the competition, and make multiple compiled versions of the software, but then it will no longer be the same software. As I said Intel and AMD generally have the same ISA, with the exception of new instructions that one or the other adds, and then the other responds by adding support later. So if you wanted to "optimize for Intel", you would have to look for instructions that AMD don't support (yet), and build the software around that using assembly code or intrinsics, not high-level stuff. But if these new instructions are useful, then AMD usually adds support shortly after, and then your code is no longer "Intel optimized".

In reality, optimizing code is not about optimizing it for Intel or AMD, and even if so, it would be features of microarchitectures, not "Intel" or "AMD". The reason why a piece of code performs differently on e.g. Skylake, Ice Lake, Zen 1 or Zen 2 is different resource bottlenecks. Intel and AMD keeps changing/improving the various resources in their CPUs; like prefetching, branch prediction, caches, execution port configuration, ALUs, FPUs, vector units, AGUs, etc. Even if I intentionally or unintentionally optimizes my code so it happens right now to scale better on Zen 2 than Skylake, Ice Lake or the next one is likely to change that resource balance and tilt that the other way. When we write software, we can't target the CPU's microoperations, so we can't truly optimize for a specific microarchtecture, but when we have "optimal" code where one algorithm scales better on Skylake and another scales better on Zen 2, it doesn't mean there is something wrong with either, it just means their workload happens to be better balanced for those respective CPUs, like balancing integer and floating point operations, branching, SIMD, etc.

Since the ISA for x86 CPUs are the same, and we can't target any of the underlying microarchitectures, optimization is by design generic. Optimizing code is about removing redundancies, bloat, abstractions, branching, SIMD and often most importantly cache optimization. Optimizations like this will always benefit all modern x86 microarchitectures, and while the relative gain may vary, a good optimization will work for all of them, including future unknown microarchitectures.

So no; software like games, photo editors, video editors, CADs, web browsers, office applications, development tools, etc. are not "Intel optimized".
So why does some software/games run better on Intel for example?
Posted on Reply
#46
efikkan
OctopussSo why does some software/games run better on Intel for example?
As I said, Intel and AMD keeps changing/improving the balance of resources in their CPUs; like prefetching, branch prediction, caches, execution port configuration, ALUs, FPUs, vector units, AGUs, etc. As of right now, Zen∕Zen2 have ALUs and FPUs/vector units spread across more execution ports, while Skylake have theirs spread across fewer more flexible execution ports, so depending on how the instructions are shuffled up AMD can reach a higher max throughput for single (non-vector) operations if they are just the right mix of int and float, while Skylake's more flexible design can reach a better average performance across a wider range of workloads, but the maximum throughput for non-vector operations is lower. This is why we see Zen(2) pull ahead with a good margin in a few benchmarks, while Intel does very well on average across more workloads. Intel is of course still helped by a better front-end (especially games).

Most x86 microarchitectures have since the early 90s been using custom "RISC-like" micro-operations. These native architecture-specific instructions are not available to us software developers, nor would it be feasible to use them; as any code using such instructions will be locked to a specific microarchtecture, and the assembly code would have to be tied specifically to the precise ALU, FPU and register configuration of the superscalar design. So there is no direct way to control the micro-operations on the CPU, so we are left with the x86 ISA which is shared between them. Even if we wanted to, we can't truly optimize for a specific one, just change the algorithms/logic and benchmark them to see what performs the best.

Very little software these days are even using assembly to optimize code. Most applications you use are written in C++ or more even higher level languages, and any low-level optimizations (even generic x86) is very rare in such applications. In fact, most software today are poorly written, rushed, highly abstracted pieces of crap, and it's more common that code bases are not performance optimized at all.
Even if it was technically possible, most coders are too lazy to conspire to "optimize" for Intel and sabotage AMD.
Posted on Reply
#47
Octopuss
Hm, so laziness is the reason/one of the reasons why we need faster and faster computers to run basically the same stuff? I'm looking at you, Windows.
Posted on Reply
#48
yovi
eidairaman1runs fine, no problems from it 24/7 all modules are at 5.0
Hey, can you help me with OC? My specs



voltage
Posted on Reply
#49
eidairaman1
The Exiled Airman
yoviHey, can you help me with OC? My specs



voltage
My rig is not hooked up at the moment and ymmv, @Bones @ShrimpBrime might be able to help
Posted on Reply
#50
Unregistered
Need a screen shot of load temps.

FX is not limited by anything except temperatures.
If you can cool it, it'll clock higher.
Add your own comment
Jan 11th, 2025 15:04 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts