Friday, November 15th 2024

Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

Intel is coming around to the idea of large last-level caches on its processors. Florian Maislinger, a tech communications manager for Intel, in an interview with Der8auer and Bens Hardware, revealed that the company is working on augmenting its processors with large shared L3 caches, however, it will begin doing so only with its server processors. The company is working on a new server/workstation processor for 2025 that comes with cache tiles that augment the shared L3 cache on its server processor, so it excels in the kind of workloads AMD's EPYC "Genoa-X" processors and upcoming "Turin-X" processors excel at—technical computing. On "Genoa-X" processors, each of the up to 12 "Zen 4" CCDs comes with stacked 3D V-Cache, which is found to have a profound impact on performance in applications that are cache-sensitive, such as the Ansys suite, OpenFOAM, etc.

The interview reveals that the server processor with large last-level cache should come out in 2025, however there is no such effort on the horizon for the company's client processors, such as the Core Ultra "Arrow Lake-S," at least not in the year 2025. The company's recently launched "Arrow Lake-S" desktop processors do not provide a generational gaming performance uplift over the 14th Gen Core "Raptor Lake Refresh," however, Intel claims to have identified certain correctable reasons for the gaming performance falling below expectations, and is hoping to release updates to the processor (possibly in the form of a new microcode, or something at the OS-vendor level). This, the company claims, should improve the gaming performance of "Arrow Lake-S."
Sources: Der8auer (YouTube), VideoCardz, HardwareLuxx.de
Add your own comment

77 Comments on Intel Plans to Copy AMD's 3D V-Cache Tech in 2025, Just Not for Desktops

#26
Vayra86
FoulOnWhiteIt's not AMD's 3D vcache, it's TSMC's
So its not Intel's CPU either anymore then. Neat!

Next time I want a patch for an application I'll write some random factory in China, too.
Posted on Reply
#28
SOAREVERSOR
lexluthermiesterWhat a shame! Intel's desktop lineup could really use such a boost.
Desktop is by far the least important line up. Server and mobile are what matters. Desktop is so far behind either it's laughable. They are getting crushed in server but doing ok in mobile.
Posted on Reply
#29
phints
FoulOnWhiteIt's not AMD's 3D vcache, it's TSMC's
Bizarre statement, do you really not realize I just modified TPU's post title to make a point?
Posted on Reply
#30
swaaye
There was a Broadwell chip with 60MB L3 cache. They aren't new to big L3. Sapphire Rapids has around 110MB L3 and also optionally a huge L4. More cache is just the natural progression for all these companies because the problems to solve are the same as ever.
Posted on Reply
#31
Steevo
CraptacularIt is not, TSMC owns the 3d cache packaging. It is not an AMD design. AMD simply took advantage of a services that TSMC offered (3d cache) and tried it out on their processers.

TSMC's 3D Stacked SoIC Packaging Making Quick Progress, Eyeing Ultra-Dense 3μm Pitch In 2027

And you have this deck from TSMC back in 2021 regarding 3d stacking: Advanced Technology Leadership
It was based on AMD interposer technology for the first HBM stacks in 2015. That Intel also copied, and Nvidia.

Posted on Reply
#32
unwind-protect
Give us HEDT CPUs with the cache and ECC memory and I'll forget about the desktop. Deal?
Posted on Reply
#33
LabRat 891
Been sayin' that X3D is good for more than gaming...

Intel will have an issue though: For all but its highest-billing most demanding customers, adding extra cache will 'extend' the usable life of the platform.
I wholly expect hardware-level platform locking, and a non-existent 2nd hand market (in years to come).
Posted on Reply
#34
efikkan
human_errorEven well optimized games and workloads can benefit if the highly utilized code can be contained in the cache, as it is higher bandwidth and lower latency than waiting to go to system RAM. Even factorio, which is an extremely well optimized game, massively benefits from this, as do many other workloads.
You don't grasp the difference between L2 and L3 caches. L3 only contains data recently discarded by L2, so it's cache lines that have either been very recently used or more likely pre-fetched and then never used at all. The most data and computationally intensive workloads see no benefit beyond a decent L3 cache, because the program is what we called cached optimized, which is a requirement for any performant piece of software. For any such heavy workload, the chances of a hit in L3 of a data cache line is extremely low, except for the few times cores are synced. This means the few hits that you actually get is likely instruction cache lines, and the rest is just meaningless garbage streaming through the L3. Sensitivity to L3 cache is mainly known as an indicator of bloat in software optimization, and the solution is to reduce said bloat and make the code more computationally dense.

As heavy workloads move more and more towards SIMD (e.g. AVX-512), the amount of data streaming through memory->L2->L3 is greater than ever, and the chances of a hit in L3 data cache is getting slimmer and slimmer. (Which should be obvious, as the workload needs to be cache optimized, for both instruction and data, otherwise the pipeline would stall.) The amount of data cache lines greatly outnumbers instruction cache lines, which is why AMD needed so much of it in order to make a tiny difference.

While instruction cache lines are comparatively "few" in number and not bottlenecked by memory bandwidth, the cache hierarchy for data cache lines behave like a "streaming buffer"; a continuous stream of data flowing from memory->L2->L3, all the data being overwritten every few thousand clock cycles, so the bottleneck here would not be L3 bandwidth, but rather memory bandwidth.

It's no accident that CPUs over the past decade or so have continuously increased bandwidth of both memory and caches, especially for heavy AVX workloads, and even prioritizing bandwidth over latency. While the cache sizes (L1I, L1D, L2, L3) have comparatively remained fairly stable until the arrival of 3D V-cache (except growing L3 proportionally to core count), otherwise you might have expected a 1GB L2 cache by now. And this "discrepancy" is due to misconceptions about how chaches work; as said the caches are an extremely efficient streaming buffer to keep the execution ports fed (with staggering amounts of data flowing through there), not a hierarchy of data based on "importance". :)
human_errorYou may as well say computers don't need more than 64k of RAM and any applications that do are poorly optimized.
Nice attempt at a straw man argument there, but you are in fact just grasping at straws.
Posted on Reply
#35
ThomasK
FoulOnWhiteIt's not AMD's 3D vcache, it's TSMC's
It was engineered by AMD and manufactured by TSMC.

Intel's taking a similar approach, but will call it something else.
Posted on Reply
#36
phanbuey
efikkanAMD has clearly better efficiency, but it's not due to the large L3 cache.
But it's hard to find something more deserving of the title "waste of sand" than throwing a bunch of L3 cache on a die, as it's only a tiny subset of very poorly optimized code which significantly benefit from it, namely certain outliers in applications and games running at very unrealistically low GPU load. It would be much better to have a CPU with 5% more computational power, especially down the road, as future games are likely to become more demanding so the bottleneck will be computational performance, not "artificial" ones running games at hundreds of frames per second.
For CPUs to advance, they should stop focusing on gimmicks and make actual architectural advancements instead. Large L3 caches is a waste of precious development resources as well as production capacity.
unless your architectural advancements are bottlenecked by memory bandwidth and latency- then that waste of sand turns into out of stock products that everyone who runs games wants….
Posted on Reply
#37
DemonicRyzen666
I must the only person who wants to see AMD try Forveros for dual CCD die cpu's....
Oh well.
Posted on Reply
#38
Dazzm8
Looks like Intel shouldn't have dissed glue so quickly
Posted on Reply
#39
marios15
Caches are usually defined by cycle latencies, not by size or preference.

L1 1ns - 4 cycles
L2 3ns - 14 cycles
L3 10ns - 50 cycles
L4/eDRAM 36ns - 140 cycles
DRAM 60-100ns - MANY cycles

Guess where X3D stands
Now the L4 has what? 50-100GB/sec bandwidth?
Just for comparison the first gen X3D can hit 600GB/sec with 47 cycles latency.
So it has 6x bandwidth and 3x faster access times....which is the same as most L3 caches.
Just fyi simple CPU instructions usually last 1-4 cycles and more complex ones like AVX might be up to 20-60-100 cycles

I think that the reason for L1/L2 caches not increasing is because they're part of the cores, doubling of size means greater area and bigger dies, which means higher latencies, only recently has density improved enough (die shrinks used to provide 2-3x density) due to EUV, that we saw some improvement.

In fact both L1 and L2 have increased in the last few gens after 20 years of staying between 256-512KB (not counting halo products like the FX or the shared L2....but different FX) all without increasing the latencies.

L3 is just easier to increase or move into it's own stacked die, there's even rumours that AMD plans to have the next Zen arch with L3 cache completely moved on a stacked die
Posted on Reply
#41
LabRat 891
DemonicRyzen666I must the only person who wants to see AMD try Forveros for dual CCD die cpu's....
Oh well.
Who knows, what the future may bring?
Posted on Reply
#42
Caring1
Copying could open the door for litigation.
Posted on Reply
#43
Dawora
human_errorI don't understand those people honestly. Benchmarks don't show periodic stutters which you can get in some games for example, and those are fully eliminated for me. Plus, you do see the better lows and general performance in benchmarks. If CPUs didn't make a difference we'd all have 4090s paired with ancient processors.

I have my 7800X3D at 40-60W providing a much better, much more consistent experience with the same GPU and screen than my 9900k that was eating 150W.
CPU is only important now because its AMD right?

U PC have something wrong, maybe slow Ram?
my second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
even 1440p there is no big differences

But if i use GPU like 4060 then i will se difference Asap,not because CPU but because Slow GPU

AMD have good CPUs but in real world GPU is much more important.
Both Intel/Amd even whit older CPUs can do gaming just fine.

Ppls just hyped extra % they see in Bench 1080p+4090
ThomasKIt was engineered by AMD and manufactured by TSMC.

Intel's taking a similar approach, but will call it something else.
it was engineered by TSMC not AMD
Posted on Reply
#44
Nhonho
It looks like (another secret) agreement between Intel and AMD, dividing up which of the two gets which market.
Posted on Reply
#45
FoulOnWhite
DaworaCPU is only important now because its AMD right?

U PC have something wrong, maybe slow Ram?
my second PC whit 9900K and 4090 there is no difference in 4k gaming VS my main system 7800X3D whit same GPU
even 1440p there is no big differences

But if i use GPU like 4060 then i will se difference Asap,not because CPU but because Slow GPU

AMD have good CPUs but in real world GPU is much more important.
Both Intel/Amd even whit older CPUs can do gaming just fine.

Ppls just hyped extra % they see in Bench 1080p+4090


it was engineered by TSMC not AMD

TM says it all. TSMC invention licensed to AMD i guess for their use. I don't think it was originally for memory stacking was it? AMD just used it that way.


Can't wait to see how Intel does it, surely they can't copy, unless they get a secret license from TSMC to use it
Posted on Reply
#46
efikkan
marios15Just fyi simple CPU instructions usually last 1-4 cycles and more complex ones like AVX might be up to 20-60-100 cycles
That's just plainly wrong.
Most core AVX operations are within 1-5 cycles on recent architectures. Haswell and Skylake did a lot to improve AVX throughput, but there have been several improvements since then too. E.g. add operations are now down from 4 to 2 cycles on Alder Lake and Sapphire Rapids. Shift operations are down to a single cycle. This is as fast as single integer operations. And FYI, all floating point operations go through the vector units, whether it's single operation, SSE or AVX, the latency will be the same. ;)
Posted on Reply
#47
LittleBro
To put large cache tile onto CPU cores was idea of one person in AMD team.

TSMCs 3D technology was used to manufacture this idea and they decided to further improve the technology.

Don't mix general 3D manufacturing process with that tile of extra cache im X3D CPUs.
Posted on Reply
#48
luyten
_roman_I do not agree with that. Intel already had such a processor with extra "cache". i7-5775C

www.intel.com/content/www/us/en/products/sku/88040/intel-core-i75775c-processor-6m-cache-up-to-3-70-ghz/specifications.html

Again, the CPU includes 6MB of L3 cache and 128MB of eDRAM.

www.tomshardware.com/reviews/intel-core-i7-5775c-i5-5675c-broadwell,4169.html

It's up to discussion. I see the 7800X3d Cache as 4th level one like the EDRAM cache of the i7-5775C
I always thought the eDRAM in those Intel processors was for the iGPU...
Posted on Reply
#49
Punkenjoy
FoulOnWhite
TM says it all. TSMC invention licensed to AMD i guess for their use. I don't think it was originally for memory stacking was it? AMD just used it that way.


Can't wait to see how Intel does it, surely they can't copy, unless they get a secret license from TSMC to use it
Well the chip are made by TSMC, so in really, it's not AMD Ryzen but TSMC Ryzen right ?


Yes the fabrication technologies was researched by TSMC and they are the one doing it. But guess what, this is normal as they are they are the one making those chip for AMD. AMD is not a fab.


But, AMD is the only one right now using that technologies because this isn't just a box you tick when you order TSMC some wafer. It's not "I would take X chips with more cache". You still have to design a chip that will be able to communicate with the cache chips, send power etc.

The physical portion of 3D Vcache is a TSMC technology. This is expected as AMD is fabless.
The logical portion of 3D Vcache is an AMD technology. This is expected as TSMC do not design chip.

In the end, it's a collaboration of both company.

Also, The added chip is indeed L3. There is no separate lookup for that chip when there is a check if it de data is in the L3 cache. The whole 96 MB is looked at the same time and there is no penalty for accessing data into the 3d vcache chip.
Posted on Reply
#50
Craptacular
SteevoIt was based on AMD interposer technology for the first HBM stacks in 2015. That Intel also copied, and Nvidia.

Yes, AMD designed a chip around the packaging technology that is TSMCs, the 3d packaging technology is TSMC's. It is not AMD's.

The below is from my second link.

If you want TSMC's official link you can find it here: nhttps://3dfabric.tsmc.com/english/dedicatedFoundry/technology/3DFabric.htmt

In fact, here is TSMCs press release introducing the packaging, Introducing TSMC 3DFabric: TSMC’s Family of 3D Silicon Stacking, Advanced Packaging Technologies and Services - Taiwan Semiconductor Manufacturing Company Limited

Posted on Reply
Add your own comment
Dec 11th, 2024 22:38 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts