Tuesday, September 17th 2019
Intel Adds More L3 Cache to Its Tiger Lake CPUs
InstLatX64 has posted a CPU dump of Intel's next-generation 10 nm CPUs codenamed Tiger Lake. With the CPUID of 806C0, this Tiger Lake chip runs at 1000 MHz base and 3400 MHz boost clocks which is lower than the current Ice Lake models, but that is to be expected given that this might be just an engineering sample, meaning that production/consumer revision will have better frequency.
Perhaps one of the most interesting findings this dump shows is the new L3 cache configuration. Up until now Intel usually put 2 MB of L3 cache per each core, however with Tiger Lake, it seems like the plan is to boost the amount of available cache. Now we are going to get 50% more L3 cache resulting in 3 MB per core or 12 MB in total for this four-core chip. Improved cache capacity can result in additional latency because of additional distance data needs to travel to get in and out of cache, but Intel's engineers surely solved this problem. Additionally, full AVX512 support is present except avx512_bf which supports bfloat16 floating-point variation found in Cooper Lake Xeons.
Source:
InstLatX64
Perhaps one of the most interesting findings this dump shows is the new L3 cache configuration. Up until now Intel usually put 2 MB of L3 cache per each core, however with Tiger Lake, it seems like the plan is to boost the amount of available cache. Now we are going to get 50% more L3 cache resulting in 3 MB per core or 12 MB in total for this four-core chip. Improved cache capacity can result in additional latency because of additional distance data needs to travel to get in and out of cache, but Intel's engineers surely solved this problem. Additionally, full AVX512 support is present except avx512_bf which supports bfloat16 floating-point variation found in Cooper Lake Xeons.
67 Comments on Intel Adds More L3 Cache to Its Tiger Lake CPUs
Having said that I think the world is ready to move beyond 2MB/core for mobile devices.
Architecture designed for a process node is best, which is why trading cache for TDP is a good value on smaller boxes that don't run at as high of frequency.
Intel can't get the frequency they want, so are trading it for more TDP in cache to increase IPC.
I can’t believe it either!
L3 cache is a "spillover" cache, which basically means it's data "discarded" by the memory controller because it didn't fit L2 any more. While the L3 has the advantage of being accessible across CPU cores, the Skylake family (excluding Skylake-X/-SP) does this interesting thing where the L3 cache is inclusive, meaning L3 cache contains a duplicate of the core's L2 just in case another core wants to access it (which is mostly wasted), meaning Skylake family chips effectively have much less L3 cache than you might think.
Speaking of usefulness, each cache line in L2 is obviously many times more useful than each cache line in L3. L2 cache is where the data is prefetched into, while L3 is data recently discarded from L3. More L2 cache seems like an obvious benefit, but L2 is more "costly" for certain reasons, not only because it needs more transistors per capacity, but also because it's more closely connected to the pipeline, the front-end and is very timing sensitive. This is why it's relatively easy to throw in extra L3 into an existing design, while changing L2 requires a redesign.
Sooner or later more (or smarter) L2 cache will be needed to be able to feed the multiple execution ports and SIMD units in the cores. I would love to see CPU designs with way more L2 cache, like 1 MB or even 2 MB, but even with node shrinks it will get challenging to way beyond that. I would argue that it may be time to split L2 and possible even L3 into separate instruction and data caches. This would allow more flexible placement on the die, plus with the "shared" L3 cache it's only the instruction cache that is really shared in practice. I believe it was mostly used by the integrated graphics.
The problem with L4 is generally the same as the problem with L3, just worse; it's a spillover cache, which means it's only useful when it contains discarded cache from the last few thousand clock cycles. The cache discards the least recently used data in each cache bank, there is no prioritization beyond that, which means that you may need extreme amounts of L4 cache to make a significant difference across different workloads.
If L4 data and instruction caches were separate though (read my paragraph above), I would imagine that just a few MB of it could be useful, as data flows through at the rate of GBs per ms, while instructions will usually jump back and forth within "relatively few" MB. Nope, Intel still have the lead in IPC, while AMD manages better multicore clock scaling and individual boosting, plus they have the extra burst boost speed of XFR on top of regular boost.
Software isn't "Intel optimized". This BS needs to end now. They use the same ISA, we don't have access to use their microoperations, so there is no real way to optimize for it, even if we wanted.
And Here: www.anandtech.com/show/14605/the-and-ryzen-3700x-3900x-review-raising-the-bar
Have a look here: software.intel.com/en-us/ipp
And here: www.amazon.com/Optimizing-Applications-Multi-Core-Processors-Performance/dp/1934053015/ref=sr_1_1?keywords=Optimizing+Applications+for+Multi-Core+Processors,+Using+the+Intel+Integrated+Performance+Primitives&qid=1568800003&s=gateway&sr=8-1
I'm seriously considering switching from my trusty 4770K to a 3950X for my main system. Hell has frozen over...
Yes, I get it.
Dafuq Intel.
Intel though having so many fabs ,goes to Samsung to help them out because the 10nm node
isn't returning what expected numcores-yields-freq-voltages..besides technical difficulties of manufacturing.
so i think Intel goes for the 7nm in mid'20 to '21.
As I said, (normal) software isn't "Intel optimized". To "optimize" for something specific, we would need unique instructions differentiating it from the competition, and make multiple compiled versions of the software, but then it will no longer be the same software. As I said Intel and AMD generally have the same ISA, with the exception of new instructions that one or the other adds, and then the other responds by adding support later. So if you wanted to "optimize for Intel", you would have to look for instructions that AMD don't support (yet), and build the software around that using assembly code or intrinsics, not high-level stuff. But if these new instructions are useful, then AMD usually adds support shortly after, and then your code is no longer "Intel optimized".
In reality, optimizing code is not about optimizing it for Intel or AMD, and even if so, it would be features of microarchitectures, not "Intel" or "AMD". The reason why a piece of code performs differently on e.g. Skylake, Ice Lake, Zen 1 or Zen 2 is different resource bottlenecks. Intel and AMD keeps changing/improving the various resources in their CPUs; like prefetching, branch prediction, caches, execution port configuration, ALUs, FPUs, vector units, AGUs, etc. Even if I intentionally or unintentionally optimizes my code so it happens right now to scale better on Zen 2 than Skylake, Ice Lake or the next one is likely to change that resource balance and tilt that the other way. When we write software, we can't target the CPU's microoperations, so we can't truly optimize for a specific microarchtecture, but when we have "optimal" code where one algorithm scales better on Skylake and another scales better on Zen 2, it doesn't mean there is something wrong with either, it just means their workload happens to be better balanced for those respective CPUs, like balancing integer and floating point operations, branching, SIMD, etc.
Since the ISA for x86 CPUs are the same, and we can't target any of the underlying microarchitectures, optimization is by design generic. Optimizing code is about removing redundancies, bloat, abstractions, branching, SIMD and often most importantly cache optimization. Optimizations like this will always benefit all modern x86 microarchitectures, and while the relative gain may vary, a good optimization will work for all of them, including future unknown microarchitectures.
So no; software like games, photo editors, video editors, CADs, web browsers, office applications, development tools, etc. are not "Intel optimized".
Most x86 microarchitectures have since the early 90s been using custom "RISC-like" micro-operations. These native architecture-specific instructions are not available to us software developers, nor would it be feasible to use them; as any code using such instructions will be locked to a specific microarchtecture, and the assembly code would have to be tied specifically to the precise ALU, FPU and register configuration of the superscalar design. So there is no direct way to control the micro-operations on the CPU, so we are left with the x86 ISA which is shared between them. Even if we wanted to, we can't truly optimize for a specific one, just change the algorithms/logic and benchmark them to see what performs the best.
Very little software these days are even using assembly to optimize code. Most applications you use are written in C++ or more even higher level languages, and any low-level optimizations (even generic x86) is very rare in such applications. In fact, most software today are poorly written, rushed, highly abstracted pieces of crap, and it's more common that code bases are not performance optimized at all.
Even if it was technically possible, most coders are too lazy to conspire to "optimize" for Intel and sabotage AMD.
voltage
FX is not limited by anything except temperatures.
If you can cool it, it'll clock higher.