Tuesday, February 18th 2025

AMD & Nexa AI Reveal NexaQuant's Improvement of DeepSeek R1 Distill 4-bit Capabilities
Nexa AI, today, announced NexaQuants of two DeepSeek R1 Distills: The DeepSeek R1 Distill Qwen 1.5B and DeepSeek R1 Distill Llama 8B. Popular quantization methods like the llama.cpp based Q4 K M allow large language models to significantly reduce their memory footprint and typically offer low perplexity loss for dense models as a tradeoff. However, even low perplexity loss can result in a reasoning capability hit for (dense or MoE) models that use Chain of Thought traces. Nexa AI has stated that NexaQuants are able to recover this reasoning capability loss (compared to the full 16-bit precision) while keeping the 4-bit quantization and all the while retaining the performance advantage. Benchmarks provided by Nexa AI can be seen below.
We can see that the Q4 K M quantized DeepSeek R1 distills score slightly less (except for the AIME24 bench on Llama 3 8b distill, which scores significantly lower) in LLM benchmarks like GPQA and AIME24 compared to their full 16-bit counter parts. Moving to a Q6 or Q8 quantization would be one way to fix this problem - but would result in the model becoming slightly slower to run and requiring more memory. Nexa AI has stated that NexaQuants use a proprietary quantization method to recover the loss while keeping the quantization at 4-bits. This means users can theoretically get the best of both worlds: accuracy and speed.You can read more about the NexaQuant DeepSeek R1 Distills over here.The following NexaQuants DeepSeek R1 Distills are available for download:How to run NexaQuants on your AMD Ryzen processors or Radeon graphics card
We recommend using LM Studio for all your LLM needs.
Sources:
AMD Community, Nexa AI Blog
We can see that the Q4 K M quantized DeepSeek R1 distills score slightly less (except for the AIME24 bench on Llama 3 8b distill, which scores significantly lower) in LLM benchmarks like GPQA and AIME24 compared to their full 16-bit counter parts. Moving to a Q6 or Q8 quantization would be one way to fix this problem - but would result in the model becoming slightly slower to run and requiring more memory. Nexa AI has stated that NexaQuants use a proprietary quantization method to recover the loss while keeping the quantization at 4-bits. This means users can theoretically get the best of both worlds: accuracy and speed.You can read more about the NexaQuant DeepSeek R1 Distills over here.The following NexaQuants DeepSeek R1 Distills are available for download:How to run NexaQuants on your AMD Ryzen processors or Radeon graphics card
We recommend using LM Studio for all your LLM needs.
- 1) Download and install LM Studio from lmstudio.ai/ryzenai
- 2) Go to the discover tab and paste the huggingface link of one of the nexaquants above.
- 3) Wait for the model to finish downloading.
- 4) Go back to the chat tab and select the model from the drop-down menu. Make sure "manually choose parameters" is selected.
- 5) Set GPU offload layers to MAX.
- 6) Load the model and chat away!
3 Comments on AMD & Nexa AI Reveal NexaQuant's Improvement of DeepSeek R1 Distill 4-bit Capabilities
They have a blog post where we can infer the above idea from:
nexa.ai/blogs/nexaquant
Basically, while you compress your model, you fine-tune it a bit more on the original dataset so the quantized weights can "recalibrate" on the data.