AMD & Nexa AI Reveal NexaQuant's Improvement of DeepSeek R1 Distill 4-bit Capabilities

T0@st · Feb 18, 2025

Nexa AI, today, announced NexaQuants of two DeepSeek R1 Distills: The DeepSeek R1 Distill Qwen 1.5B and DeepSeek R1 Distill Llama 8B. Popular quantization methods like the llama.cpp based Q4 K M allow large language models to significantly reduce their memory footprint and typically offer low perplexity loss for dense models as a tradeoff. However, even low perplexity loss can result in a reasoning capability hit for (dense or MoE) models that use Chain of Thought traces. Nexa AI has stated that NexaQuants are able to recover this reasoning capability loss (compared to the full 16-bit precision) while keeping the 4-bit quantization and all the while retaining the performance advantage. Benchmarks provided by Nexa AI can be seen below.

We can see that the Q4 K M quantized DeepSeek R1 distills score slightly less (except for the AIME24 bench on Llama 3 8b distill, which scores significantly lower) in LLM benchmarks like GPQA and AIME24 compared to their full 16-bit counter parts. Moving to a Q6 or Q8 quantization would be one way to fix this problem - but would result in the model becoming slightly slower to run and requiring more memory. Nexa AI has stated that NexaQuants use a proprietary quantization method to recover the loss while keeping the quantization at 4-bits. This means users can theoretically get the best of both worlds: accuracy and speed.

You can read more about the NexaQuant DeepSeek R1 Distills over here.

The following NexaQuants DeepSeek R1 Distills are available for download:

How to run NexaQuants on your AMD Ryzen processors or Radeon graphics card
We recommend using LM Studio for all your LLM needs.

1) Download and install LM Studio from lmstudio.ai/ryzenai
2) Go to the discover tab and paste the huggingface link of one of the nexaquants above.
3) Wait for the model to finish downloading.
4) Go back to the chat tab and select the model from the drop-down menu. Make sure "manually choose parameters" is selected.
5) Set GPU offload layers to MAX.
6) Load the model and chat away!

According to this data provided by Nexa AI, developers can also use the NexaQuant versions of the DeepSeek R1 Distills above to get generally improved performance in llama.cpp or GGUF based applications.

View at TechPowerUp Main Site | Source

Denver · Feb 18, 2025

Something doesn't add up in these tests.. how can the quantized version outperform the original? Aliens?

igormp · Feb 18, 2025

Denver said:
Something doesn't add up in these tests.. how can the quantized version outperform the original? Aliens?

My bet is that they don't simply just compress the existing weights, but rather do some extra quantization-aware training in order to keep the model's performance.
They have a blog post where we can infer the above idea from:

NexaQuant: Llama.cpp-Compatible Multimodal Model Compression with 100%+ Accuracy Recovery

Works with both text and multimodal models and can be deployed on any devices

nexa.ai

Basically, while you compress your model, you fine-tune it a bit more on the original dataset so the quantized weights can "recalibrate" on the data.

Vya Domus · Feb 18, 2025

Denver said:
Something doesn't add up in these tests.. how can the quantized version outperform the original? Aliens?

Who knows, these benchmarks suck though, there have been many quantized models which are supposedly just as good as their FP16 counterparts but in practice they're clearly not.

System Name	The TPU Typewriter
Processor	AMD Ryzen 5 5600 (non-X)
Motherboard	GIGABYTE B550M DS3H Micro ATX
Cooling	DeepCool AS500
Memory	Kingston Fury Renegade RGB 32 GB (2 x 16 GB) DDR4-3600 CL16
Video Card(s)	PowerColor Radeon RX 7800 XT 16 GB Hellhound OC
Storage	Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME SSD
Display(s)	Lenovo Legion Y27q-20 27" QHD IPS monitor
Case	GameMax Spark M-ATX (re-badged Jonsbo D30)
Audio Device(s)	FiiO K7 Desktop DAC/Amp + Philips Fidelio X3 headphones, or ARTTI T10 Planar IEMs
Power Supply	ADATA XPG CORE Reactor 650 W 80+ Gold ATX
Mouse	Roccat Kone Pro Air
Keyboard	Cooler Master MasterKeys Pro L
Software	Windows 10 64-bit Home Edition

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

AMD & Nexa AI Reveal NexaQuant's Improvement of DeepSeek R1 Distill 4-bit Capabilities

T0@st

News Editor

Denver

igormp

NexaQuant: Llama.cpp-Compatible Multimodal Model Compression with 100%+ Accuracy Recovery

Vya Domus