Friday, November 15th 2024

Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

Intel is coming around to the idea of large last-level caches on its processors. Florian Maislinger, a tech communications manager for Intel, in an interview with Der8auer and Bens Hardware, revealed that the company is working on augmenting its processors with large shared L3 caches, however, it will begin doing so only with its server processors. The company is working on a new server/workstation processor for 2025 that comes with cache tiles that augment the shared L3 cache on its server processor, so it excels in the kind of workloads AMD's EPYC "Genoa-X" processors and upcoming "Turin-X" processors excel at—technical computing. On "Genoa-X" processors, each of the up to 12 "Zen 4" CCDs comes with stacked 3D V-Cache, which is found to have a profound impact on performance in applications that are cache-sensitive, such as the Ansys suite, OpenFOAM, etc.

The interview reveals that the server processor with large last-level cache should come out in 2025, however there is no such effort on the horizon for the company's client processors, such as the Core Ultra "Arrow Lake-S," at least not in the year 2025. The company's recently launched "Arrow Lake-S" desktop processors do not provide a generational gaming performance uplift over the 14th Gen Core "Raptor Lake Refresh," however, Intel claims to have identified certain correctable reasons for the gaming performance falling below expectations, and is hoping to release updates to the processor (possibly in the form of a new microcode, or something at the OS-vendor level). This, the company claims, should improve the gaming performance of "Arrow Lake-S."
Sources: Der8auer (YouTube), VideoCardz, HardwareLuxx.de
Add your own comment

77 Comments on Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

#51
marios15
efikkanThat's just plainly wrong.
Most core AVX operations are within 1-5 cycles on recent architectures. Haswell and Skylake did a lot to improve AVX throughput, but there have been several improvements since then too. E.g. add operations are now down from 4 to 2 cycles on Alder Lake and Sapphire Rapids. Shift operations are down to a single cycle. This is as fast as single integer operations. And FYI, all floating point operations go through the vector units, whether it's single operation, SSE or AVX, the latency will be the same. ;)
Oh, I confused them with I guess x87 ones that take forever, but I was mostly talking about the complex AVX ones not just add/multiply, because I am pretty sure there are a few that take 20-40 cycles
CraptacularYes, AMD designed a chip around the packaging technology that is TSMCs, the 3d packaging technology is TSMC's. It is not AMD's.

The below is from my second link.

If you want TSMC's official link you can find it here: nhttps://3dfabric.tsmc.com/english/dedicatedFoundry/technology/3DFabric.htmt

In fact, here is TSMCs press release introducing the packaging, Introducing TSMC 3DFabric: TSMC’s Family of 3D Silicon Stacking, Advanced Packaging Technologies and Services - Taiwan Semiconductor Manufacturing Company Limited

Wasn't Micron with HBM the first one to release 3d stacking(4 stacks iirc)

With a product released in 2015 by AMD with packaging on interposer?
Same as what intel later calls "foveros" just with active Vs passive interposer
Posted on Reply
#52
igormp
efikkanYou don't grasp the difference between L2 and L3 caches. L3 only contains data recently discarded by L2
L3 victim cache*
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
Posted on Reply
#53
_roman_
DaworaU PC have something wrong, maybe slow Ram?
my second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
even 1440p there is no big differences
That depends on your software.

And if it's human visible. Which is a big topic of discussion. Yes vs no.
I do not see those 45 FPS from my "freesync" ASUS PA278QV monitor with a RAdoen 7800XT in widows 11 pro.

--

I had for a few months in 2023 a Ryzen 3 3100 at the time I sold my ryzen 5800x and my b550 mainboard. For daily usage this cheap 30 € second hand bought and sold for 30€ cpu was totally fine. I did not saw any difference in gnu gentoo linux. The software is always compiling next to my pc usage time. It does not really matter if it takes more minutes in the background - the linux kernel handles the load quite well.

--

I see more the issue with badly designed compilers. There should be more optimisations for a processor. I think only a few packages uses the avx512 instruction set from my ryzen 7600x in gnu gentoo linux while compiling or while executing the software.
Posted on Reply
#54
Space Lynx
Astronaut
SteevoIt was based on AMD interposer technology for the first HBM stacks in 2015. That Intel also copied, and Nvidia.

I have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.
Posted on Reply
#56
Tropick
luytenI always thought the eDRAM in those Intel processors was for the iGPU...
It typically is but if you disable the iGPU in the BIOS the CPU cores get exclusive access to the eDRAM. I have heard it does shave off another ~2MB L3 cache in order to store the tags necessary to address the 128MB slice though. Funnily enough I've recently been messing around with an old i7-5775C.
Posted on Reply
#57
Steevo
Space LynxI have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.
Much like Intel pays AMD for X64, I would wager their engineering costs are covered by their royalties. How many years did AMD bleed money, they were propped up at least some by their competitors through fees
Posted on Reply
#58
JustBenching
lexluthermiesterWhat a shame! Intel's desktop lineup could really use such a boost.
To be fair, they don't. I checked TPU's latest review - a 13600k can deliver - on average - 2 to 3 times more frames at 720p than the 4090 can do at 4k. Depends on the game of course - but if you include all the games tested on the 9800x 3d review, we need much much much faster GPUs for the CPUs to play any important role. It's only 2 games that a faster CPU than the 13600k would matter but in both of those framerate was at 120 and above (Bg3 and starfield).
Posted on Reply
#59
unwind-protect
Space LynxI have never understood the point of RnD in a company, as you mention here, AMD invested in something, but then Intel and Nvidia just copied it, win win for them and saving on RnD costs, but if AMD tries to copy CUDA programming language, they will die... I will never understand that world. Sounds shady as hell though if you ask me.
There is a lot of software written in CUDA. And the quality of the tools for CUDA ensure that this trend won't stop anytime soon.

AMD doesn't copy CUDA, they just want to be able to execute CUDA code.
Posted on Reply
#60
lexluthermiester
JustBenchingTo be fair, they don't.
Sure they do. Why wouldn't they? Are you kidding? The reason AMD currently has the gaming throne is directly because of the Ryzen CPUs with stacked 3D cache. Without it, well, where are the NON-3D cache AMD CPUs ranked? Right, there's the answer. If Intel were to employ something similar and get it right, their CPU line up would return rather handily to the top spot.
Posted on Reply
#61
wolf
Better Than Native
Daworamy second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
This could be for a few reasons, like the games you play, or the FPS you're comfy with or targeting.

The 7800X3D is undeniably a much faster gaming CPU and will give you a significantly higher FPS ceiling (and better 1/0.1% lows) vs a 9900K @ 5GHZ. There absolutely can be the situation where both provide beyond the FPS you target in certain games, or being GPU limited in certain games, but that does nothing to tell you the true gaming performance of a CPU.

Excellent video covering it here. This literally has absolutely nothing to do with the brand of the CPU.
Posted on Reply
#62
darksf
As someone that got the i7 - 5775C five years ago and the Ryzen 5800X3D two and a half years ago I still can't understand why there is even conversation about the big cache? Reminds me back in the days , is a 2 core CPU better than a higher clocked single core CPU , does HyperThreading actually do something. Why should I use a 64bit CPU? I mean the benchmarks are clear and here is nothing to argue about with them.

i7 5775C was a good CPU end of story. If someone doesn't know how to properly configure it or don't wanna bother just get yourself iMac this year don't bother and get the next iMac next year.

www.anandtech.com/show/16195/a-broadwell-retrospective-review-in-2020-is-edram-still-worth-it/12
Posted on Reply
#63
lexluthermiester
wolfThe 7800X3D is undeniably a much faster gaming CPU and will give you a significantly higher FPS ceiling (and better 1/0.1% lows) vs a 9900K @ 5GHZ.
But that's only for non-4k resolution. At 4k res, a 4090 is going to render the same general results with ANY CPU that doesn't bottleneck it, and that's big damn list. The 7800X3D only counts for resolutions under 4k. The benchmarks here at TPU and elsewhere bare that out.
Posted on Reply
#64
wolf
Better Than Native
lexluthermiesterBut that's only for non-4k resolution. At that res, a 4090 is going to render the same general results with ANY CPU that doesn't bottleneck it, and that's big damn list. The 7800X3D only counts for resolutions under 4k. The benchmark here at TPU and elsewhere bare that out.
The video makes some great points however. Say you're GPU limited at 70 FPS on all ultra settings with a 4090 in a given game, and the 9900K can give you 80 fps, then sure you're all good, but what if you want 120fps and are willing to lower settings/use upscaling etc to get there? the 9900K will make your FPS ceiling 80 FPS and no amount of settings lowering, reducing resolution or using upscaling will improve that. It heavily depends on the game and user's preferences, but it absolutely applies to 4k too if you want to game at higher fps. For someone happy with 30/40/60 it matters a fair bit less for sure, but in many games I'm absolutely willing to lower visuals to get the balance of visuals and FPS to my taste, and a CPU absolutely can matter there.

I also recall W1zzards 4090 vs 53 games on a 5800x vs 5800X3D had some games showing differences at 4k, very much a large dose of 'it depends' on this one, but I do find the math/science of it undeniable, it more so a case of if that matters to the individual or not.
Posted on Reply
#65
A Computer Guy
unwind-protectGive us HEDT CPUs with the cache and ECC memory and I'll forget about the desktop. Deal?
And more PCIe lanes
Posted on Reply
#66
Super XP
Intel's been copying AMDs tech for decades, nothing new here people. :laugh:
lexluthermiesterSure they do. Why wouldn't they? Are you kidding? The reason AMD currently has the gaming throne is directly because of the Ryzen CPUs with stacked 3D cache. Without it, well, where are the NON-3D cache AMD CPUs ranked? Right, there's the answer. If Intel were to employ something similar and get it right, their CPU line up would return rather handily to the top spot.
Not everybody is looking for the 3DX CPUs, though they do provide the best gaming.
Top 15 best sellers on Amazon has 2 x ZENs with 3DX the rest are all Ryzen's and 2 Intel CPUs.
Posted on Reply
#67
FoulOnWhite
Super XPIntel's been copying AMDs tech for decades, nothing new here people. :laugh:


Not everybody is looking for the 3DX CPUs, though they do provide the best gaming.
Top 15 best sellers on Amazon has 2 x ZENs with 3DX the rest are all Ryzen's and 2 Intel CPUs.
Really, so why did Intel thrash AMD for over 10 years then?
Posted on Reply
#68
JustBenching
FoulOnWhiteReally, so why did Intel thrash AMD for over 10 years then?
They didn't manage to reverse engineer Bulldozer, so they couldn't copy it, that's why.
Posted on Reply
#69
LittleBro
A fact that X3D cache improves performance only in games is busted with Zen 5.
You can see improvements in any workloads that relies heavily on memory, e.g. rendering.
Posted on Reply
#70
FoulOnWhite
JustBenchingThey didn't manage to reverse engineer Bulldozer, so they couldn't copy it, that's why.
Bulldozer wasn't worth reverse engineering

Anyway this is OT so lets just stop this ok
Posted on Reply
#71
sixor
We need moar cache to be the new moar cores and moar ghz
Posted on Reply
#72
igormp
LittleBroA fact that X3D cache improves performance only in games is busted with Zen 5.
You can see improvements in any workloads that relies heavily on memory, e.g. rendering.
Was it? Afaik performance is pretty much the same with or without the extra cache for most of those. The difference between a 9800x3D and a 9700x in those tasks was mostly due to the V-cache model having a higher power limit and clocking a bit higher due to that.
Rendering is one of the cases were the extra cache does no difference, as far as I've seen.

Only cases the extra cache is really worth it is for CFD/HPC stuff, and some other specific database workloads.
Posted on Reply
#73
efikkan
marios15Oh, I confused them with I guess x87 ones that take forever, but I was mostly talking about the complex AVX ones not just add/multiply, because I am pretty sure there are a few that take 20-40 cycles
There are many instructions that are very slow, although they are usually a tiny fraction of the workload, if at all.
Interestingly enough, Ice Lake/Rocket Lake brought the legacy FMUL down from 5 to 4 cycles, as well as integer division(IDIV) from 97 cycles to 18 cycles.

For comparison, Intel's current CPUs have 4 cycles for multiplication, 11 cycles for division of fp32 using AVX, and 5 cycles for integer multiplication using AVX. (official spec)

As for "worst case" performers of legacy x87 instructions: examples are FSQRT(square root) at 14-21 cycles, sin/cos/tan ~50-160 cycles and the most complex; FBSTP at 264, but this one is probably not very useful today. FDIV is 14-16 cycles (so slightly slower than its AVX counterpart). And for comparison, in Zen 4, legacy x87 instructions seems to be overall lower latency than Intel. All of these figures are from agner.org and are benchmarked, so a grain of salt, but they are probably good approximations.

Many think "legacy" instructions are holding back the performance of modern x86 CPUs, but that's not true. Since the mid 90s, they've all translated the x86 ISA to their own specific micro-operations, and this is also how they support x87/MMX/SSE/AVX through the same execution ports; the legacy instructions are translated to micro-ops anyways. This allows them to design the CPUs to be as efficient as possible with the new features, yet support the old ones. If the older ones happens to have worse latency, it's usually not an issue, as applications that rely on those are probably very old. One thing of note is that x87 instructions are rounding off differently than normal IEEE 754 fp32/fp64 does.
igormpL3 victim cache*
Bit pedantic given how almost all (if not all) CPUs use L3 as a victim cache, but I think it's important to explain that's what causes the behaviour you mentioned w.r.t. the L2<->L3 relationship.
It's not pedantic at all, you missed the point.
The prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger, while adding just a tiny bit more of L2 would allow the prefetcher to work differently. In other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
Secondly, as I said, the stream of data flowing through L3 is all coming from memory->L2, so the overall bandwidth here is limited by memory, even though the tiny bit you read back will have higher burst speed.

Software that will be more demanding in the coming years will be more computationally intensive, so over the long term the faster CPUs will be favored over those with more L3 cache. Those that are very L3 sensitive will remain outliers.
Posted on Reply
#74
Aquinus
Resident Wat-man
efikkanIn other words; a larger L3 doesn't mean you can prefetch a lot more, it just means that the data you've already prefetched anyways stays a little longer.
First, excellent explanation as a whole. Second, I think the quote above is a key point and likely the biggest reason why a larger L3 cache shows improvement in some workloads. Keeping data resident in cache longer only helps hit rates assuming latency remains constant. I'm sure there is a tipping point where if you have a relatively tight loop where if you have too much data in cache that you'll be evicting things before the loop starts from the beginning again where a little more cache might get over that hurdle. An example of this might be a tight rendering loop in a game.
Posted on Reply
#75
igormp
efikkanThe prefetcher only feeds the L2, not the L3, so anything in L3 must first be prefetched into L2 then eventually evicted to L3, where it remains for a short window before being evicted there too. Adding a lot of extra L3 only means the "garbage dump" will be larger
Yeah, and this behavior is exactly because the L3 in most current designs is set as a victim cache.
Your idea would not hold for cache inclusive designs, like older Intel gens, since prefetching is way easier, and allows it to feed both the L2 or L3.

This now leads me to wonder how well an inclusive design for such modern CPUs with L3 caches that are multiple times bigger than the L2 would fare. At least latency could be really improved, and the wasted space due to duplicated data becomes kinda irrelevant given the total amount of cache available.
Not sure how well it'd fare in terms of side-channel attacks tho.
Posted on Reply
Add your own comment
Dec 11th, 2024 22:55 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts