Friday, November 15th 2024
Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops
Intel is coming around to the idea of large last-level caches on its processors. Florian Maislinger, a tech communications manager for Intel, in an interview with Der8auer and Bens Hardware, revealed that the company is working on augmenting its processors with large shared L3 caches, however, it will begin doing so only with its server processors. The company is working on a new server/workstation processor for 2025 that comes with cache tiles that augment the shared L3 cache on its server processor, so it excels in the kind of workloads AMD's EPYC "Genoa-X" processors and upcoming "Turin-X" processors excel at—technical computing. On "Genoa-X" processors, each of the up to 12 "Zen 4" CCDs comes with stacked 3D V-Cache, which is found to have a profound impact on performance in applications that are cache-sensitive, such as the Ansys suite, OpenFOAM, etc.
The interview reveals that the server processor with large last-level cache should come out in 2025, however there is no such effort on the horizon for the company's client processors, such as the Core Ultra "Arrow Lake-S," at least not in the year 2025. The company's recently launched "Arrow Lake-S" desktop processors do not provide a generational gaming performance uplift over the 14th Gen Core "Raptor Lake Refresh," however, Intel claims to have identified certain correctable reasons for the gaming performance falling below expectations, and is hoping to release updates to the processor (possibly in the form of a new microcode, or something at the OS-vendor level). This, the company claims, should improve the gaming performance of "Arrow Lake-S."
Sources:
Der8auer (YouTube), VideoCardz, HardwareLuxx.de
The interview reveals that the server processor with large last-level cache should come out in 2025, however there is no such effort on the horizon for the company's client processors, such as the Core Ultra "Arrow Lake-S," at least not in the year 2025. The company's recently launched "Arrow Lake-S" desktop processors do not provide a generational gaming performance uplift over the 14th Gen Core "Raptor Lake Refresh," however, Intel claims to have identified certain correctable reasons for the gaming performance falling below expectations, and is hoping to release updates to the processor (possibly in the form of a new microcode, or something at the OS-vendor level). This, the company claims, should improve the gaming performance of "Arrow Lake-S."
77 Comments on Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops
With a product released in 2015 by AMD with packaging on interposer?
Same as what intel later calls "foveros" just with active Vs passive interposer
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
And if it's human visible. Which is a big topic of discussion. Yes vs no.
I do not see those 45 FPS from my "freesync" ASUS PA278QV monitor with a RAdoen 7800XT in widows 11 pro.
--
I had for a few months in 2023 a Ryzen 3 3100 at the time I sold my ryzen 5800x and my b550 mainboard. For daily usage this cheap 30 € second hand bought and sold for 30€ cpu was totally fine. I did not saw any difference in gnu gentoo linux. The software is always compiling next to my pc usage time. It does not really matter if it takes more minutes in the background - the linux kernel handles the load quite well.
--
I see more the issue with badly designed compilers. There should be more optimisations for a processor. I think only a few packages uses the avx512 instruction set from my ryzen 7600x in gnu gentoo linux while compiling or while executing the software.
Intel Pentium 4 HT Extreme Edition 3.40 Specs | TechPowerUp CPU Database
AMD doesn't copy CUDA, they just want to be able to execute CUDA code.
The 7800X3D is undeniably a much faster gaming CPU and will give you a significantly higher FPS ceiling (and better 1/0.1% lows) vs a 9900K @ 5GHZ. There absolutely can be the situation where both provide beyond the FPS you target in certain games, or being GPU limited in certain games, but that does nothing to tell you the true gaming performance of a CPU.
Excellent video covering it here. This literally has absolutely nothing to do with the brand of the CPU.
i7 5775C was a good CPU end of story. If someone doesn't know how to properly configure it or don't wanna bother just get yourself iMac this year don't bother and get the next iMac next year.
www.anandtech.com/show/16195/a-broadwell-retrospective-review-in-2020-is-edram-still-worth-it/12
I also recall W1zzards 4090 vs 53 games on a 5800x vs 5800X3D had some games showing differences at 4k, very much a large dose of 'it depends' on this one, but I do find the math/science of it undeniable, it more so a case of if that matters to the individual or not.
Top 15 best sellers on Amazon has 2 x ZENs with 3DX the rest are all Ryzen's and 2 Intel CPUs.
You can see improvements in any workloads that relies heavily on memory, e.g. rendering.
Anyway this is OT so lets just stop this ok
Rendering is one of the cases were the extra cache does no difference, as far as I've seen.
Only cases the extra cache is really worth it is for CFD/HPC stuff, and some other specific database workloads.
Interestingly enough, Ice Lake/Rocket Lake brought the legacy FMUL down from 5 to 4 cycles, as well as integer division(IDIV) from 97 cycles to 18 cycles.
For comparison, Intel's current CPUs have 4 cycles for multiplication, 11 cycles for division of fp32 using AVX, and 5 cycles for integer multiplication using AVX. (official spec)
As for "worst case" performers of legacy x87 instructions: examples are FSQRT(square root) at 14-21 cycles, sin/cos/tan ~50-160 cycles and the most complex; FBSTP at 264, but this one is probably not very useful today. FDIV is 14-16 cycles (so slightly slower than its AVX counterpart). And for comparison, in Zen 4, legacy x87 instructions seems to be overall lower latency than Intel. All of these figures are from agner.org and are benchmarked, so a grain of salt, but they are probably good approximations.
Many think "legacy" instructions are holding back the performance of modern x86 CPUs, but that's not true. Since the mid 90s, they've all translated the x86 ISA to their own specific micro-operations, and this is also how they support x87/MMX/SSE/AVX through the same execution ports; the legacy instructions are translated to micro-ops anyways. This allows them to design the CPUs to be as efficient as possible with the new features, yet support the old ones. If the older ones happens to have worse latency, it's usually not an issue, as applications that rely on those are probably very old. One thing of note is that x87 instructions are rounding off differently than normal IEEE 754 fp32/fp64 does. It's not pedantic at all, you missed the point.
The prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger, while adding just a tiny bit more of L2 would allow the prefetcher to work differently. In other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
Secondly, as I said, the stream of data flowing through L3 is all coming from memory->L2, so the overall bandwidth here is limited by memory, even though the tiny bit you read back will have higher burst speed.
Software that will be more demanding in the coming years will be more computationally intensive, so over the long term the faster CPUs will be favored over those with more L3 cache. Those that are very L3 sensitive will remain outliers.
Your idea would not hold for cache inclusive designs, like older Intel gens, since prefetching is way easier, and allows it to feed both the L2 or L3.
This now leads me to wonder how well an inclusive design for such modern CPUs with L3 caches that are multiple times bigger than the L2 would fare. At least latency could be really improved, and the wasted space due to duplicated data becomes kinda irrelevant given the total amount of cache available.
Not sure how well it'd fare in terms of side-channel attacks tho.