Wednesday, January 29th 2025

AMD Details DeepSeek R1 Performance on Radeon RX 7900 XTX, Confirms Ryzen AI Max Memory Sizes

AMD today put out detailed guides on how to get DeepSeek R1 distilled reasoning models to run on Radeon RX graphics cards and Ryzen AI processors. The guide confirms that the new Ryzen AI Max "Strix Halo" processors come in hardwired to LPCAMM2 memory configurations of 32 GB, 64 GB, and 128 GB, and there won't be a 16 GB memory option for notebook manufacturers to cheap out with. The guide goes on to explain that "Strix Halo" will be able to locally accelerate DeepSeek-R1-Distill-Llama with 70 billion parameters on the 64 GB and 128 GB memory configurations of "Strix Halo" powered notebooks, while the 32 GB model should be able to run DeepSeek-R1-Distill-Qwen-32B. Ryzen AI "Strix Point" mobile processors should be capable of running DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Llama-14B on their RDNA 3.5 iGPUs and NPUs. Meanwhile, older generation processors based on "Phoenix Point" and "Hawk Point" chips should be capable of DeepSeek-R1-Distill-Llama-14B. The company recommends running all of the above distills in Q4 K M quantization.

Switching gears to the discrete graphics cards, and AMD is only recommending its Radeon RX 7000 series for now, since the RDNA 3 graphics architecture introduces AI accelerators. The flagship Radeon RX 7900 XTX is recommended for DeepSeek-R1-Distill-Qwen-32B distill, while all SKUs with 12 GB to 20 GB of memory—that's RX 7600 XT, RX 7700 XT, RX 7800 XT, RX 7900 GRE, and RX 7900 XT, are recommended till DeepSeek-R1-Distill-Qwen-14B. The mainstream RX 7600 with its 8 GB memory is only recommended till DeepSeek-R1-Distill-Llama-8B. You will need LM Studio 0.3.8 or later and Radeon Software Adrenalin 25.1.1 beta or later drivers. AMD put out first party LMStudio 0.3.8 tokens/second performance numbers for the RX 7900 XTX, comparing it with the NVIDIA GeForce RTX 4080 SUPER and the RTX 4090.
When compared to the RTX 4080 SUPER, the RX 7900 XTX posts up to 34% higher performance with DeepSeek-R1-Distill-Qwen-7B, up to 27% higher performance with DeepSeek-R1-Distill-Llama-8B, and up to 22% higher performance with DeepSeek-R1-Distill-Qwen-14B. Next up, the big face-off between the RX 7900 XTX and the GeForce RTX 4090 with its 24 GB of memory. The RX 7900 XTX is shown to prevail in 3 out of 4 tests, posting up to 13% higher performance with DeepSeek-R1-Distill-Qwen-7B, up to 11% higher performance with DeepSeek-R1-Distill-Llama-8B, and up to 2% higher performance with DeepSeek-R1-Distill-Qwen-14B. It only falls behind the RTX 4090 by 4% with the larger DeepSeek-R1-Distill-Qwen-32B model.

Catch the step-by-step guide on getting DeepSeek R1 disrilled reasoning models to run on AMD hardware in the source link below.
Source: AMD Community
Add your own comment

27 Comments on AMD Details DeepSeek R1 Performance on Radeon RX 7900 XTX, Confirms Ryzen AI Max Memory Sizes

#1
wNotyarD
What? AMD for once in their life getting their timing right to capitalize on something?
Posted on Reply
#2
Neo_Morpheus
Hmm, been meaning to try this.

Thanks for the link.
Posted on Reply
#3
hsew
btarunrThe guide confirms that the new Ryzen AI Max "Strix Halo" processors come in hardwired to LPCAMM2 memory configurations of 32 GB, 64 GB, and 128 GB, and there won't be a 16 GB memory option for notebook manufacturers to cheap out with.
I combed the source page for this language or any clarification to the matter, saying it is “Hardwired” to LPCAMM2 is a bit counterintuitive. Was it supposed to read LPDDR5 instead?

Either way, The 32 GB mandatory minimum is a welcome sight. I’m a bit surprised (hence the confusion above) that 48 and 96GB capacities weren’t also mentioned as those capacities should be possible via LPCAMM2.
Posted on Reply
#4
wNotyarD
hsewI combed the source page for this language or any clarification to the matter, saying it is “Hardwired” to LPCAMM2 is a bit counterintuitive. Was it supposed to read LPDDR5 instead?

Either way, The 32 GB mandatory minimum is a welcome sight. I’m a bit surprised (hence the confusion above) that 48 and 96GB capacities weren’t also mentioned as those capacities should be possible via LPCAMM2.
Afaik, 48G isn't achievable in a quad-channel configuration (4x12G?) but 96G should be (as 4x24G is something available).
Posted on Reply
#5
igormp
hsewI combed the source page for this language or any clarification to the matter, saying it is “Hardwired” to LPCAMM2 is a bit counterintuitive. Was it supposed to read LPDDR5 instead?

Either way, The 32 GB mandatory minimum is a welcome sight. I’m a bit surprised (hence the confusion above) that 48 and 96GB capacities weren’t also mentioned as those capacities should be possible via LPCAMM2.
LPCAMM2 uses LPDDR5(X) modules still.
wNotyarDAfaik, 48G isn't achievable in a quad-channel configuration (4x12G?) but 96G should be (as 4x24G is something available).
IIRC each LPCAMM2 module is 128-bit, for strix halo you'll need 2 of those, so for 48GB you could go for 24GB modules.
However, crucial only lists 32 and 64GB modules in their page:
www.crucial.com/memory/ddr5/CT64G75C2LP5XG

So it'd mean either 64 or 128GB for strix halo. I'm too lazy to look into other manufacturers.
Posted on Reply
#6
Vayra86
I think an upgrade path for my 7900XT has just opened up right here.

Thanks AMD I guess?
Posted on Reply
#7
tpuuser256
It's crazy because the price to performance of hardware capable of running decently sized models fast enough is WAY WAY lower than what it's sold for. It should become much more affordable to run interesting models in the next few decades.
Posted on Reply
#8
AnotherReader
igormpLPCAMM2 uses LPDDR5(X) modules still.

IIRC each LPCAMM2 module is 128-bit, for strix halo you'll need 2 of those, so for 48GB you could go for 24GB modules.
However, crucial only lists 32 and 64GB modules in their page:
www.crucial.com/memory/ddr5/CT64G75C2LP5XG

So it'd mean either 64 or 128GB for strix halo. I'm too lazy to look into other manufacturers.
The 32 GB SKU might be using soldered LPDDR5X; that is the norm for laptops after all.
Posted on Reply
#9
Solid State Brain
A small niche of enthusiasts has been asking for years for more VRAM on consumer GPUs to run bigger AI models; hopefully the current DeepSeek craze is going to make manufacturers reconsider their stance of just providing the bare minimum needed for running games at the resolution the GPUs are primarily intended to be used with.
Posted on Reply
#10
AnotherReader
Solid State BrainA small niche of enthusiasts has been asking for years for more VRAM on consumer GPUs to run bigger AI models; hopefully the current DeepSeek craze is going to make manufacturers reconsider their stance of just providing the bare minimum needed for running games at the resolution the GPUs are primarily intended to be used with.
Honestly, I believe that for inference, Apple's approach is better; the unified DRAM pool allows memory capacities that consumer GPUs just can't match. A lot of people use laptops so a bigger Strix Halo with a 512-bit bus could have 256 GB of RAM with 76% of a desktop RTX 4080's bandwidth.
Posted on Reply
#11
Solid State Brain
AnotherReaderHonestly, I believe that for inference, Apple's approach is better; the unified DRAM pool allows memory capacities that consumer GPUs just can't match.
That could be a path forward too with mixture-of-expert (MoE) LLMs similar to DeepSeek V3/R1, but merely providing non-upgradable systems with relatively large amounts of RAM (e.g. 128GB) at mediocre-to-low bandwidth (~250-300 GB/s, still below the level of a low-end discrete GPU) isn't going to help a lot. Memory doesn't just have to be abundant, but fast too.
Posted on Reply
#12
hsew
igormpLPCAMM2 uses LPDDR5(X) modules still.
Right, that's why the confusion comes from the use of the phrase "hardwired to LPCAMM2 configurations of [fixed sizes]". The word "hardwired" implies that it is in fact soldered.
Posted on Reply
#13
AnotherReader
Solid State BrainThat could be a path forward too with mixture-of-expert (MoE) LLMs similar to DeepSeek V3/R1, but merely providing non-upgradable systems with relatively large amounts of RAM (e.g. 128GB) at mediocre-to-low bandwidth (~250-300 GB/s, still below the level of a low-end discrete GPU) isn't going to help a lot. Memory doesn't just have to be abundant, but fast too.
Yes, there's a tradeoff, and for inference, memory bandwidth trumps all. The trend for GPUs is clear though. GDDR leads to low memory capacities; HBM allows exceeding that capacity at infeasible cost. Upgradeable RAM allows the most capacity, but that comes at the expense of bandwidth as well.
Posted on Reply
#14
mb194dc
Great time to uncancel Navi 41 and 42 then ? Bring them to Market with 30 and 36GB of VRAM.
Posted on Reply
#15
Punkenjoy
AnotherReaderYes, there's a tradeoff, and for inference, memory bandwidth trumps all. The trend for GPUs is clear though. GDDR leads to low memory capacities; HBM allows exceeding that capacity at infeasible cost. Upgradeable RAM allows the most capacity, but that comes at the expense of bandwidth as well.
The main issue with HBM is the fact it require an interposer to sit on and communicate with the main die. That is drastically increase the cost as HBM need to be on package on silicon.

But there are work to produce 3D DRAM that wouldn't necessary be HBM in order to increase capacities. but from what i see, its still few years in the making
semiengineering.com/baby-steps-towards-3d-dram/

note that it look they are also working on stacked dram that would use the same bus size as GDDR* and would probably be a drop in solution while we wait
Posted on Reply
#16
TPUnique
wNotyarDWhat? AMD for once in their life getting their timing right to capitalize on something?
Yeah, upon reading this I was lauding their reactivity... then I remembered that it's probably thanks to the marketing department not being in charge.
Posted on Reply
#18
Wirko
PunkenjoyThe main issue with HBM is the fact it require an interposer to sit on and communicate with the main die. That is drastically increase the cost as HBM need to be on package on silicon.
There are more issues. A HBM memory cell takes up twice as much space as a DDR cell. Then there's TSV stacking, which seems to be incredibly expensive, possibly because there's insufficient manufacturing capacity everywhere.
DRAM dies are also stacked in large capacity server DIMMs. That used to be the case for really, really expensive 128 GB DIMMs and up, but now as larger capacity dies exist, it's probably 256 GB and up. Going by the price, I assume it's TSV stacking.
LPDDR dies are also stacked in some designs, for example Apple's M chips. Probably TSV again because speed matters and cost doesn't.
A case of non-TSV stacked dies (with old style wire bonding instead) would be NAND, for several reasons: lower speed, small number of wires due to 8-bit bus, and requirement for low cost.
PunkenjoyBut there are work to produce 3D DRAM that wouldn't necessary be HBM in order to increase capacities. but from what i see, its still few years in the making
semiengineering.com/baby-steps-towards-3d-dram/
Thanks for the link. Semiengineering posted thisnice overview of current tech in 2021 ... and later I ocassionally checked and found nothing. Yes, we'll wait some more for 3D. Someone will eventually modify the NAND manufacturing tech so that those capacitors, well, quickly charge and discharge. And when they succeed, they will try everything to compress four bits into one cell.
Punkenjoynote that it look they are also working on stacked dram that would use the same bus size as GDDR* and would probably be a drop in solution while we wait
What sort of stacked DRAM dou yo mean here? Again, due to high speed, it would have to be TSV stacked, so in a different price category.
Posted on Reply
#19
mkppo
The 395 looks more and more interesting by the day and I can see it replace low/mid end GPU's in the laptop space in the future. Please AMD, release one on the desktop. Or Turin Threadripper. These two are a lot more interesting than the shit these three companies are spitting out the last couple of years and i'd love to tweak them out.

Fast forward a few years and a 16 cores with V-Cache + UDNA + CAMM2 should be awesome. HBM remains a pipe dream because their prices rose pretty and TSV stacking remains prohibitively expensive.
Posted on Reply
#20
Sound_Card
mkppoThe 395 looks more and more interesting by the day and I can see it replace low/mid end GPU's in the laptop space in the future. Please AMD, release one on the desktop. Or Turin Threadripper. These two are a lot more interesting than the shit these three companies are spitting out the last couple of years and i'd love to tweak them out.

Fast forward a few years and a 16 cores with V-Cache + UDNA + CAMM2 should be awesome. HBM remains a pipe dream because their prices rose pretty and TSV stacking remains prohibitively expensive.
I'm positive that AMD and companies like Minisforum will be release mini motherboards with the SoC embedded for system builders.
Posted on Reply
#21
Wirko
mkppoPlease AMD, release one on the desktop.
And its name shall be 10980XG. It would only fit in a TR socket though, with its four channels.
mkppoCAMM2
We have yet to see what becomes of CAMM2 and LPCAMM. Either of these may become a commodity in a couple years. Or they may remain a rarity, with poor availability, mostly available through OEMs.
Posted on Reply
#22
AusWolf
The 7900 XTX being better at AI than the 4090? Good joke! :laugh: Wait... Seriously? :wtf:
Posted on Reply
#23
hatyii
So if the 7900 XTX is faster for AI than the 4090 and AMD mentions that RDNA3 specifically can run this model well because of hardware advantages over RDNA2, explain to me why is the new FSR version was supposed to be exclusive to their new GPUs? I mean even an RTX 2000 GPU can benefit of DLSS, so I'm just confused about these stuff.
Posted on Reply
#24
AusWolf
hatyiiSo if the 7900 XTX is faster for AI than the 4090 and AMD mentions that RDNA3 specifically can run this model well because of hardware advantages over RDNA2, explain to me why is the new FSR version was supposed to be exclusive to their new GPUs? I mean even an RTX 2000 GPU can benefit of DLSS, so I'm just confused about these stuff.
FSR 4 could be vastly different from DeepSeek in how it runs. RDNA 3's AI accelerators are part of the shader engine. RDNA 4 may be getting dedicated units. Who knows.

Also, DLSS hasn't changed much in its base operation, so it can run on anything with tensor cores. FSR hasn't needed AI cores so far, but FSR 4 does.

My other theory is that Nvidia hasn't touched the RT and tensor cores much since RTX 2000 (judging by performance data). We know very little about what an AI/tensor core actually is and how it works.
Posted on Reply
#25
alwayssts
mb194dcGreat time to uncancel Navi 41 and 42 then ? Bring them to Market with 30 and 36GB of VRAM.
You mean 40/48GB of ram? I doubt it was ever GDDR7, but it's possible.

I think N41 (partially) got canned because they know once people have >80TF and 24GB (essentially a 4090) most ain't upgrading for a long-long time. Those that wanted that at >$1000+ bought a 4090.
Cutting the price of 4080 from $1200 to $1000 probably also had something to do with it, as I think that's where AMD wanted to compete.
Similar reason for the gap in nV products. Why GB203 limited to <80TF (1 less cluster than half GB202 + PL locks) and doesn't have a 24GB option. Gotta milk needing those upgrades as long as possible...
Hence both wanted to get one more cycle in before that happened...or maybe just able to make it for a larger margin given the move to 3nm and 3GB GDDR7 (256-bit instead of 384-bit for 24GB spec).
Something like a $500 BOM (~GB203/N48 size; 100+ KGD per 20k wafer + ~$300 of 3GB GDDR7) makes a lot more sense than making a slightly-slower 4090 for ~$1200 MSRP.
They would've needed 12288sp @ 3640mhz to match a 4090...That's probably impossible if not not close-to-impossible to yield on 4/5nm for a gpu.
We may see w/ N48 3.4ghz is probably difficult-enough to yield within decent power. I say that because if all N48 products can't hit 3.3ghz+ they've kinda failed; might as well buy a 6800xt/7800xt.
I'll be verrryyy curious if (binned) 3x8-pin designs will be able to hit anywhere around ~3.6ghz(+/-?), as that may have been the N4 goal, both (cancelled) large and (non-cancelled) smalls, with 24gbps ram.
Still think something like a 11264sp+ 3nm design is going to be a lot of people's last stop in this market for the most part. People with a 4090 (unless they have to have the best) probably already don't care.
Making ~1920sp*6/96 ROPs is just sooo much cheaper. It would only require 3900mhz to match 4090 which I think is very doable given how current 5nm GPU designs yield against 2.93/3.24 Apple products.
We don't know how N48 yielded against the 3460-3700mhz Apple products yet, or how much power it uses, but it should be interesting. Both clock yields and the power usage for those clocks on the curve.
This could be telling who has the better idea on 3nm.
nVIDIA is probably shooting for 12288sp@3780mhz/36000 like Apple's efficient clock on N3B, while AMD could perhaps be shooting for 11520sp @ ~3.87/40000+, more-similar to Apple's 4050mhz N3P.
Whatever they do, it'll be a lot cheaper to make than a 4090 or whatever AMD wanted to do with N41...chiplet or monolithic.

At any rate, it's fascinating to see what's possible with this deepseek model; it's almost like pure hardware always wins out in the end versus software/marketing bullshit and artificial limitations!
It's amusing to see the hardware limitations exposed when not locked to their ecosystem.

Long-live the Fine Wine of actually well-matched hardware/vram that always prevails in the end.
Posted on Reply
Add your own comment
Jan 30th, 2025 16:31 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts