• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

Joined
Mar 11, 2008
Messages
1,073 (0.17/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Joined
Feb 12, 2025
Messages
4 (4.00/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
I've used many models from 1B tinyllama to mixtral:8x22b.
I've found Phi4:14B and Gemma2:27B most useful.
My 7800XT runs these models Q4_K at these speeds:
phi4:27B - 42 tokens/s
gemma2:27B - 8.5 tokens/s

For reference
llama3.3:70B - 1.5 tokens/s painfully slow
llama3.2:3.2B - 115 tokens/s, super fast, but super dumb :)

I use ollama for running models, it has nice API that can be used with special Python module. I run Open WebUI in my QNAP container, that is connected to my models via same API. Another QNAP container runs nginx reverse proxy, to make the WebUI connection secure (HTTP->HTTPS).

I must admit much bigger online models like GPT-4o, Gemini 2 Pro etc are obviously superior to local ones, but the local ones get surprising amount stuff done and are fun to play with.
 

Easy Rhino

Linux Advocate
Staff member
Joined
Nov 13, 2006
Messages
15,649 (2.35/day)
Location
Mid-Atlantic
System Name Desktop
Processor i5 13600KF
Motherboard AsRock B760M Steel Legend Wifi
Cooling Noctua NH-U9S
Memory 4x 16 Gb Gskill S5 DDR5 @6000
Video Card(s) Gigabyte Gaming OC 6750 XT 12GB
Storage WD_BLACK 4TB SN850x
Display(s) Gigabye M32U
Case Corsair Carbide 400C
Audio Device(s) On Board
Power Supply EVGA Supernova 650 P2
Mouse MX Master 3s
Keyboard Logitech G915 Wireless Clicky
Software Fedora KDE Spin
I am running ollama here on my Fedora 41 desktop with an AMD RX 6750 XT. I was able to pass the Environment variable that enables that GPU despite it not directly being supported. I mostly am running deepseek-coder-v2 for AI chat inside VSCodium and Qwen2.5-coder:1.5 for code completion. I get 75-80 tokens per second which is pretty good.

edit: i get 83 tps with llama3.2:3b
 
Joined
Mar 11, 2008
Messages
1,073 (0.17/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Made a second round with the same models but with a fresh start not opening any other programs
I am not sure why my system get this "slow" regarding AI
DS 14B Q8 from 42 -> 73 token/s
Phi4 Q8 from 42 still 42 token/s
DS 32B Q6 from 3.2 to 3.3 token/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s
I understand that the browser using some VRAM, but how is that LM Studio not using priority and offloading data not currently in use to system memory?
phi4:27B - 42 tokens/s
Same on mine, even with fresh start - strange
Llama 3.3 performance also looks strange
 

Easy Rhino

Linux Advocate
Staff member
Joined
Nov 13, 2006
Messages
15,649 (2.35/day)
Location
Mid-Atlantic
System Name Desktop
Processor i5 13600KF
Motherboard AsRock B760M Steel Legend Wifi
Cooling Noctua NH-U9S
Memory 4x 16 Gb Gskill S5 DDR5 @6000
Video Card(s) Gigabyte Gaming OC 6750 XT 12GB
Storage WD_BLACK 4TB SN850x
Display(s) Gigabye M32U
Case Corsair Carbide 400C
Audio Device(s) On Board
Power Supply EVGA Supernova 650 P2
Mouse MX Master 3s
Keyboard Logitech G915 Wireless Clicky
Software Fedora KDE Spin
Made a second round with the same models but with a fresh start not opening any other programs
I am not sure why my system get this "slow" regarding AI
DS 14B Q8 from 42 -> 73 token/s
Phi4 Q8 from 42 still 42 token/s
DS 32B Q6 from 3.2 to 3.3 token/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s
I understand that the browser using some VRAM, but how is that LM Studio not using priority and offloading data not currently in use to system memory?

Same on mine, even with fresh start - strange
Llama 3.3 performance also looks strange

Try llama3.2:3b and see what you get?
 
Joined
Feb 12, 2025
Messages
4 (4.00/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s

Llama 3.3 performance also looks strange
That model needs about 42GB of VRAM to run properly on a PC. A lot of model is loaded into about 10x slower system RAM, so performance falls kinda off the cliff here. Fluctuations in performance are to be expected, depending if answer came mostly from VRAM or RAM.
Measuring LLM TPS always varies a bit. If question is complex and requires long answer it takes LLM longer. If LLM temperature is set to creative, it will give different answers to same question, making exact measuring complicated.
Phi and Llama for language question usually.
Phi4 and low level Llamas are terrible at my native language.

Last years I've not been much of a gamer, but with LLMs I am tempted to upgrade my GPU to 24GB 7900XTX or 32GB version of 9070XT if that rumor comes alive. All current nVidia offerings seem to be low bang for the buck options from LLM perspective, but GB10 based desktop computer is very interesting.
 
Joined
Mar 11, 2008
Messages
1,073 (0.17/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Joined
Feb 12, 2025
Messages
4 (4.00/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
Joined
Jul 19, 2015
Messages
1,019 (0.29/day)
Location
Nova Scotia, Canada
Processor Ryzen 5 5600 @ 4.65GHz CO -30
Motherboard AsRock X370 Taichi
Cooling Cooler Master Hyper 212 Plus
Memory 32GB 4x8 G.SKILL Trident Z 3200 CL14 1.35V
Video Card(s) PCWINMAX RTX 3060 6GB Laptop GPU (80W)
Storage 1TB Kingston NV2
Display(s) LG 25UM57-P @ 75Hz OC
Case Fractal Design Arc XL
Audio Device(s) ATH-M20x
Power Supply Evga SuperNova 1300 G2
Mouse Evga Torq X3
Keyboard Thermaltake Challenger
Software Win 11 Pro 64-Bit
Tried playing around with LM studio a bit with my 3060 6GB Laptop GPU & Instinct MI25 16GB together. It was a lot easier to set up than I expected it to be.

Performance on the smaller model is quite lacking with the MI25 because the 3060 can't use it's Tensor cores I think. The 22GB VRAM pool makes up for that on larger models though.

Vulkan (3060 + MI25)
DS Qwen 7B 4Q_K_M ~20t/s
DS Qwen 32B 4Q_K_M ~8t/s

CUDA (3060 only)
DS Qwen 7B 4Q_K_M ~43t/s
DS Qwen 32B 4Q_K_M ~2.1t/s (Obviously way too big for it)
 
Top