Thursday, December 26th 2024

AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

Dec 26th, 2024 12:54 Discuss (109 Comments)

Recent engineering samples of AMD's upcoming Ryzen 9 9950X3D reveal what appear to be the finalized specifications of the top-tier AM5 chip. The 16-core, 32-thread processor builds upon the gaming success of the Ryzen 7 9800X3D while addressing its core count limitations. The flagship processor features AMD's refined cache design, combining 96 MB of 3D V-Cache with 32 MB of standard L3 cache. Unlike its predecessor, the 7950X3D, the new Zen 5 architecture incorporates a redesigned CCD stacking method. The CCD now sits above the cache, directly interfacing with the STIM and IHS, eliminating thermal constraints that previously required frequency limitations. The processor features asymmetric cache distribution across its dual CCDs—one die combines 32 MB of base L3 cache with a 64 MB stacked V-Cache layer, while its companion die utilizes a standard 32 MB L3 cache configuration. In total, there is a 128 MB of L3 cache, with 16 MB of L2.

This architectural advancement enables the 9950X3D to achieve a 5.65 GHz boost clock across both CCDs, matching non-X3D variants. The processor maintains a 170 W TDP, suggesting improved thermal efficiency despite the additional cache. AMD's software-based OS scheduler will continue to optimize gaming workloads by directing them to the CCD with 3D V-Cache. Early leaks indicate the 9950X3D matches the base 9950X in Cinebench R23 scores, both in single and multi-threaded tests—a significant improvement over the 7950X3D, which lagged behind its non-X3D counterpart due to frequency limitations. AMD plans to expand the Zen 5 X3D lineup in Q1-2025 with both the 9950X3D and 9900X3D models. Full performance benchmarks and pricing details are expected at CES 2025, where AMD will officially unveil these processors alongside their RDNA 4 GPUs.

Sources: @94G8LA, via VideoCardz

Add your own comment

109 Comments on AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

#76

uplink777

Fun fact. Just spent nearly a year on 7950X3D, now I'm on 9800X3D. Can't help it, but for general work, Office 365, Teams, Skype, JiRA in Chrome, Adobe Creative Cloud PS, Ai, Audition, After Effects, Media Encoder, Figma and gaming Drova, Space Marine II and Cyberbug 2077 I have better experience on lower core 9800X3D compared to higher core prev. gen SKU ‍♂️

#77

Buddha666

AusWolfI'm not saying that these CPUs aren't great, just that they're kind of pointless. Gamers have the 9800X3D. Professionals have the 9950X. Who is the 9950X3D made for exactly? Professionals who also need the last drop of FPS while they're gaming? C'mon...

Exactly, Im one of them :)

#78

AusWolf

uplink777Fun fact. Just spent nearly a year on 7950X3D, now I'm on 9800X3D. Can't help it, but for general work, Office 365, Teams, Skype, JiRA in Chrome, Adobe Creative Cloud PS, Ai, Audition, After Effects, Media Encoder, Figma and gaming Drova, Space Marine II and Cyberbug 2077 I have better experience on lower core 9800X3D compared to higher core prev. gen SKU ‍♂️

That's because the 9800 has faster cores, which is exactly what you need for games and your type of work. Some work needs more cores, but apparently not the one you're doing, which is fine. :)

#79

inquisitor1

Cheeseball@AleksandarK

I probably won't upgrade to this one until after 1 or 2 years. If this was like a 30% performance jump (its not, comparing 7800X3D to 9800X3D), then sure, but not $700 sure. :laugh:

thats what i always do. wait for prices to drop. I just got the 5950x. so will probably get this 9950x/3d in 3-4 for $315

still will be a kick asz cpu a few years later. I believe in jumping AT LEAST 2 gens to get the best bang for buck

#80

dragontamer5788

AnotherReaderIf I recall correctly, inter CCD latencies were corrected after Zen 5 release to be in the same range as Zen 4. At the time of release, worst case latency was just over 210 ns. Now, it should be about the same as Zen 4: 80 ns.

I believed you, but it took me a while to verify it with a link.

www.techpowerup.com/326709/amd-agesa-1-2-0-2-update-fixes-ryzen-9000-series-inter-core-latency-issues?cp=2

I'm finalizing my 5-year-upgrade build, and am planning to go to Microcenter for the Ryzen 9 9900x build soon. So getting 100% proof of this latency issue being fixed was a big priority for my 9900x vs 9800x3d decision.

80ns is within the realm of P-core to P-core on the Intel Ultra 7 265k. I do think the Intel Ultra 7 is underrated but I'm too much of an AVX512 fanboy so Zen5 wins me over.

Chips-and-cheese core-to-core latency graphs of Arrow Lake: chipsandcheese.com/p/examining-intels-arrow-lake-at-the

#81

Kapone33

AnotherReaderIf I recall correctly, inter CCD latencies were corrected after Zen 5 release to be in the same range as Zen 4. At the time of release, worst case latency was just over 210 ns. Now, it should be about the same as Zen 4: 80 ns.

Even if it was 200 nanoseconds how much would that be in real time

#82

lexluthermiester

AnotherReaderIf I recall correctly, inter CCD latencies were corrected after Zen 5 release to be in the same range as Zen 4. At the time of release, worst case latency was just over 210 ns. Now, it should be about the same as Zen 4: 80 ns.

Those are core to core latency numbers on that graph. There are processing latency penalties for core to inter-ccx-cache transfers and requests.

#83

dragontamer5788

lexluthermiesterThose are core to core latency numbers on that graph. There are processing latency penalties for core to inter-ccx-cache transfers and requests.

You missed the update. Its under 80ns now as widely reported by many reliable tech discussion sites.

I linked TechPowerup earlier, but here's Chips and Cheese's tests as well:

You were correct at launch. The issue is that AMD released new microcode recently that fixed the 200-to-400 nanosecond latencies and pushed it all the way down to 80 or less.

#84

lexluthermiester

dragontamer5788You missed the update. Its under 80ns now as widely reported by many reliable tech discussion sites.

I linked TechPowerup earlier, but here's Chips and Cheese's tests as well:

You were correct at launch. The issue is that AMD released new microcode recently that fixed the 200-to-400 nanosecond latencies and pushed it all the way down to 80 or less.

Context is important. Again those are CORE-TO-CORE latencies for the non-X3D model 9900X. Core-to-Core and Core-to-interCCX cache is different with the X3D versions and comes with additional latency that can not be avoided. Now they may have improved it, I'll concede to that, but it is VERY unlikely they have cracked below 350ns no matter what refinements and optimizations they've made. The 3D cache has an additional die boundary to cross for any process, regardless of type. Now for the actual physical die the 3D cache is mounted on, the latency is as was stated above, 210ns-ish, but for the non-attached CCX, there is additional latency involved depending on the transaction request.

Does that make sense? This is why the 3D cache being mounted to only one CCX is bad for latency dependent tasks, like games for example. AMD needed to divide the 3D cache between both CCXs or just give both the same cache dies and interlink through the I/OD.

#85

ncrs

lexluthermiesterContext is important. Again those are CORE-TO-CORE latencies for the non-X3D model 9900X. Core-to-Core and Core-to-interCCX cache is different with the X3D versions and comes with additional latency that can not be avoided. Now they may have improved it, I'll concede to that, but it is VERY unlikely they have cracked below 350ns no matter what refinements and optimizations they've made. The 3D cache has an additional die boundary to cross for any process, regardless of type. Now for the actual physical die the 3D cache is mounted on, the latency is as was stated above, 210ns-ish, but for the non-attached CCX, there is additional latency involved depending on the transaction request.

Does that make sense? This is why the 3D cache being mounted to only one CCX is bad for latency dependent tasks, like games for example. AMD needed to divide the 3D cache between both CCXs or just give both the same cache dies and interlink through the I/OD.

The additional L3 latency of X3D for 7950X3D vs 7950X is 1.61 ns as tested by Chips and Cheese which for the capacity increase is a terrific achievement. ~~For Zen 5 it is most likely improved further:~~

It's not like AMD hasn't thought about doing X3D on more than one chiplet, in fact they are selling EPYCs like 9684X with 12 CCDs and 1152MB L3.

Edit: Zen 5 X3D latency penalty is about the same:

#86

SunMaster

evernessinceIt's not relevant to consumers, it's relevant to enthusiasts. I assume you are not the latter. If you are not interested in the topic, you can kindly avoid butting in. Thank you :)

I think you should contact Intel and, judging by your enthusiast grade expertize, tell them to only manufacture little cores from now on.

#87

Panther_Seraphin

A Computer GuyNot if games were optimized for dual CCD operation. I suspect that will never happen even though AMD became a software company.

The problem with most if not all games is there is a "master" thread that basically everything interacts with so no matter what the game engine is doing it will always have an interaction with this master thread on the regular to keep everything timed correctly. This is half the problem with scaling out games to utilise more cores effectively as no matter what, you are still dependant on the main thread to tie everything back together.

Databases etc can have completely seperate threads that do no interact with the master except at point of creation and completion so they miss all the "regular" penalties of inter CCD communication.

#88

unwind-protect

evernessinceRead the article, the CPU boosts to top clocks on both CCDs this time around. X3D no longer limits frequency or heat.

We only know that the max turbo frequency is the same for both CCDs.

That doesn't necessarily say that they spend the same time at that speed under all conditions.

Anyway, I want one.

#89

Kapone33

lexluthermiesterContext is important. Again those are CORE-TO-CORE latencies for the non-X3D model 9900X. Core-to-Core and Core-to-interCCX cache is different with the X3D versions and comes with additional latency that can not be avoided. Now they may have improved it, I'll concede to that, but it is VERY unlikely they have cracked below 350ns no matter what refinements and optimizations they've made. The 3D cache has an additional die boundary to cross for any process, regardless of type. Now for the actual physical die the 3D cache is mounted on, the latency is as was stated above, 210ns-ish, but for the non-attached CCX, there is additional latency involved depending on the transaction request.

Does that make sense? This is why the 3D cache being mounted to only one CCX is bad for latency dependent tasks, like games for example. AMD needed to divide the 3D cache between both CCXs or just give both the same cache dies and interlink through the I/OD.

Show me the numbers to support your argument.

Panther_SeraphinThe problem with most if not all games is there is a "master" thread that basically everything interacts with so no matter what the game engine is doing it will always have an interaction with this master thread on the regular to keep everything timed correctly. This is half the problem with scaling out games to utilise more cores effectively as no matter what, you are still dependant on the main thread to tie everything back together.

Databases etc can have completely seperate threads that do no interact with the master except at point of creation and completion so they miss all the "regular" penalties of inter CCD communication.

See City Skylines 2 and Space Marine 2.

#90

dragontamer5788

lexluthermiesterContext is important. Again those are CORE-TO-CORE latencies for the non-X3D model 9900X. Core-to-Core and Core-to-interCCX cache is different with the X3D versions and comes with additional latency that can not be avoided. Now they may have improved it, I'll concede to that, but it is VERY unlikely they have cracked below 350ns no matter what refinements and optimizations they've made. The 3D cache has an additional die boundary to cross for any process, regardless of type. Now for the actual physical die the 3D cache is mounted on, the latency is as was stated above, 210ns-ish, but for the non-attached CCX, there is additional latency involved depending on the transaction request.

Does that make sense? This is why the 3D cache being mounted to only one CCX is bad for latency dependent tasks, like games for example. AMD needed to divide the 3D cache between both CCXs or just give both the same cache dies and interlink through the I/OD.

200ns is 5MHz and slower latencies than that I can literally measure with my hobby-grade oscilloscope and Arduino-like AVR microcontrollers. Such latencies are possible on server-class systems where more chips and dies are in play but I'd be surprised to see it on a simpler desktop.

Die-to-die latencies do exist of course but on a scale far smaller than you might imagine. 200ns+ is server-grade equipment latencies, not something I'd expect to see on a desktop system. And that's because server-grade systems have more RAM, more RAM Controllers, more dies and more caches that need to communicate. So everything slows down.

---------

Anyway, 200ns latencies for an on-package SRAM makes no sense. That's slower than DRAM (!!!!) like DDR5 technologies. SRAM always had much smaller latencies than that, and I expect that the x3d caches are made out of the faster SRAM and not the slower DRAM. (also: logic companies like AMD/TSMC can make SRAM more easily than DRAM. DRAM is actually very difficult to make on these processes)

#91

evernessince

unwind-protectWe only know that the max turbo frequency is the same for both CCDs.

That doesn't necessarily say that they spend the same time at that speed under all conditions.

Anyway, I want one.

Non-X3D and X3Ds now perform about the same thermally

Knowing that frequency and thermals are similar, the probability that boosting characteristics will be different enough to make a notable difference in games in near zero.

#92

Wirko

lexluthermiesterAdd a zero behind that and you'd be getting closer. The latency difference between one CCX trying to access the 3D cache on the other CCX is at least 550ns(ish). The CCX with the cache has much faster access but that is because it's directly connected and doesn't need to access through the I/OD.

Which review arrived at these results?

#93

lexluthermiester

ncrsThe additional L3 latency of X3D for 7950X3D vs 7950X is 1.61 ns as tested by Chips and Cheese which for the capacity increase is a terrific achievement. ~~For Zen 5 it is most likely improved further:~~

It's not like AMD hasn't thought about doing X3D on more than one chiplet, in fact they are selling EPYCs like 9684X with 12 CCDs and 1152MB L3.

Edit: Zen 5 X3D latency penalty is about the same:

Once again, context is important. That data is for the 9800X3D...

People, learn how to context.

#94

ncrs

lexluthermiesterOnce again, context is important. That data is for the 9800X3D...

People, learn how to context.

Still no sources for your claims? I've provided professional measurements for two generations of X3D CPUs with both topologies and neither comes even close to what you suggested the latency impact is.
What is more your 550ns figure is over twice the time that one EPYC Turin core takes to communicate with a core in another socket. Please explain how you arrived at this figure, and what is the "context" here.

Edit: seems that the context here is trolling, but it's OK - I've refreshed my knowledge a bit by researching this.
For completeness here's the L3 latency plot for an EPYC Milan-X with 8 X3D CCDs where going to another X3D slice has a penalty but overall keeps below DRAM latency:

#95

ValenOne

VinceroBy that same token, even the single die CCD X3D parts are actually not great - outside of gaming and cache limited / memory bandwidth / thread limited scenarios the higher boosting higher TDP normal CCD CPUs win in average productivity.

Anyway, as for the dual X3D part.... Someone will always pay for it even if the benefit is only 1% (or even less) over the next competitive product in the lineup, witness the 14900KS...

9800X3D is faster than 9700X in Blender.

#96

Panther_Seraphin

For the people who keep quoting Latency graphs to argue there is no penalty, consider this. Those tests you are quoting are 1 core accessing 1 core on the 2nd CCD, I have linked a test that goes into further details where they load down CCDS from single thread to fully loaded and measured its latency and in actually splitting threads across the two CCDs and with Zen 4 it is bad!!!

chipsandcheese.com/p/pushing-amds-infinity-fabric-to-its

Zen 4 has a hardware limitation that a dual X3D setup would have been absolutly HORRENDOUS in performance as accessing the 2nd CCDs cache would have been only as fast as accessing DRAM in certain worst case scenarios and can very easily see 2-3 times the latency penalty rising to nearly 10 times in the worst case. I suspect Zen 3/5xxx series parts would have seen similar issues due to the design of the IO Die etc

Zen 5 has seemingly fixed this issue as well as having the high clock speeds due to the relocated X3D. I wonder if we AMD are holding back dual X3D parts in case Intel pulls something out of the bag ala Nvidias origianl Ti/Super variants of a few years ago? I mean the Single CCD parts are completely handing Intel the L in gaming by quite a margin currently.

Also are they trying to prevent confusion as the dual x3d parts would segregate the market even futher again as you now have 3 different SKUs for each core count and with desktop parts probably pushing up towards the $/£1k mark again for the top end non HEDT part. How much would it cut into their lower end HEDT/Workstation sales.

#97

ValenOne

dragontamer5788SMT isn't magic. SMT works by splitting resources between the two threads.

Primarily, the caches. As we all know, games love cache so I'm not surprised that splitting the cache has a detrimental effect.

But even if a thread isn't loaded, the register files, reorder buffers, decoders and branch predictors remain shared. So SMT will have slightly worse single threaded performance.

SMT is ideal when a 5%ish drop in single threaded is an acceptable tradeoff for +40% multi threaded performance. Games do not work like this.

Intel has changed their design to P-Cores which specialize in singlethread, and E-cores which specialize in multi thread. But this seems like a poor strategy to me for other reasons....

I expect that video is bad for x3d.

The name of the game is fitting in the cache. Video games have lots of stuff that is larger than 32MB but less than 96MB, and the CPU automatically discovers the hot data to share.

Video is not like that. You watch (or encode) one frame and then move into the next one. Nothing will fit in cache. Or at least, nothing extra really fits in the 33rd MB that's worthwhile.

Video and 3D modeling (Blender) usually prefer more cores... While dealing with so much data that the caches are blown over and useless.

Zen 5's 8 decoders are bottlenecked by Zen 4 era I/O die.

For Blender, 9800X3D beats 9700X, and nearly rivals 16 cores 5950X and 12 cores 7900. 9800X3D's SMT is strong relative to Zen 3's SMT.

#98

Random_User

sephiroth117It's not as simple, there are limitations still on 3D CCD, even if the 2nd gen X3D are much better, heat less etc, they still want one CCD for fast Ghz and the other for gaming/3D applications

There's a reason, also with dual CCD, would having 32+32MB be as good as one CCD with 64MB 3D extra L3 cache ? if no, wouldn't 64+64 be too expensive ?

I think there are genuine cost and technological obstacles for dual CCD, it's not just them wanting to add a software director and more complexity, maybe further down the line

This is that simple. These are binned and fused EPYC dies anyways. The waste.
Also, this isn’t second gen X3D, but third. And AMD had the fully production ready sample of dual 3DCCD 5950X3D, back in the day, when their 3D-VCache only emerged. There were other reasons.

The Zen4 was perfectly scalable, at any wattage/power/thermal envelope. Zen5 X3D, seems to be as good. There's no frequency limits for it, and it works as fast as non-X3D parts.
At this point, non-3D parts have become, the "dietetic", budget oriented/cut-down version. And AMD themselves have created this image.
And there's absolutely no exuse, for 9950X3D to not be dual 3D-CCD. The technology allows this, the cost is already high, and the 3D dies are now not limited either by frequency, or power.

Just my thoughts!

#99

Wirko

Random_UserThis is that simple. These are binned and fused EPYC dies anywas.

I too think it's not that simple. Only dies that can reach the Ryzen turbo clock of 5.4 or 5.7 GHz qualify for a Ryzen (maybe a little less because turbo clock is not guaranteed). For Epyc, they are binned with regard to power consumption in the relevant range of clocks, which is roughly between 2 and 4 GHz.

#100

Random_User

Panther_SeraphinFor the people who keep quoting Latency graphs to argue there is no penalty, consider this. Those tests you are quoting are 1 core accessing 1 core on the 2nd CCD, I have linked a test that goes into further details where they load down CCDS from single thread to fully loaded and measured its latency and in actually splitting threads across the two CCDs and with Zen 4 it is bad!!!

chipsandcheese.com/p/pushing-amds-infinity-fabric-to-its

Zen 4 has a hardware limitation that a dual X3D setup would have been absolutly HORRENDOUS in performance as accessing the 2nd CCDs cache would have been only as fast as accessing DRAM in certain worst case scenarios and can very easily see 2-3 times the latency penalty rising to nearly 10 times in the worst case. I suspect Zen 3/5xxx series parts would have seen similar issues due to the design of the IO Die etc

Zen 5 has seemingly fixed this issue as well as having the high clock speeds due to the relocated X3D. I wonder if we AMD are holding back dual X3D parts in case Intel pulls something out of the bag ala Nvidias origianl Ti/Super variants of a few years ago? I mean the Single CCD parts are completely handing Intel the L in gaming by quite a margin currently.

Also are they trying to prevent confusion as the dual x3d parts would segregate the market even futher again as you now have 3 different SKUs for each core count and with desktop parts probably pushing up towards the $/£1k mark again for the top end non HEDT part. How much would it cut into their lower end HEDT/Workstation sales.

My bet, is that this is just the margin/business thing. They want to milk with limited SKU, for as long as possible. This is "reasonable" from the business standpoint, but atrocious, from every other. Including amount of e-waste, produced for the sole purpose of very temporary sales boost, while intel has nothing on the table, yet. As these heterohgenous SKUs might be as avoided, as Zen4 ones, despite they could be a solid solution from the get go. These hybrid stuff might be eclipsed and avoided again in favor of either mono-3D, single 9800X3D, or any ppotential dual CCD 3D. Money...

Add your own comment

AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

109 Comments on AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts

AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

Related News

109 Comments on AMD Ryzen 9 9950X3D Carries 3D V-Cache on a Single CCD, 5.6 GHz Clock Speed, and 170 Watt TDP

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts