• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Joined
Feb 12, 2025
Messages
8 (2.67/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
I've used many models from 1B tinyllama to mixtral:8x22b.
I've found Phi4:14B and Gemma2:27B most useful.
My 7800XT runs these models Q4_K at these speeds:
phi4:27B - 42 tokens/s
gemma2:27B - 8.5 tokens/s

For reference
llama3.3:70B - 1.5 tokens/s painfully slow
llama3.2:3.2B - 115 tokens/s, super fast, but super dumb :)

I use ollama for running models, it has nice API that can be used with special Python module. I run Open WebUI in my QNAP container, that is connected to my models via same API. Another QNAP container runs nginx reverse proxy, to make the WebUI connection secure (HTTP->HTTPS).

I must admit much bigger online models like GPT-4o, Gemini 2 Pro etc are obviously superior to local ones, but the local ones get surprising amount stuff done and are fun to play with.
 

Easy Rhino

Linux Advocate
Staff member
Joined
Nov 13, 2006
Messages
15,649 (2.35/day)
Location
Mid-Atlantic
System Name Desktop
Processor i5 13600KF
Motherboard AsRock B760M Steel Legend Wifi
Cooling Noctua NH-U9S
Memory 4x 16 Gb Gskill S5 DDR5 @6000
Video Card(s) Gigabyte Gaming OC 6750 XT 12GB
Storage WD_BLACK 4TB SN850x
Display(s) Gigabye M32U
Case Corsair Carbide 400C
Audio Device(s) On Board
Power Supply EVGA Supernova 650 P2
Mouse MX Master 3s
Keyboard Logitech G915 Wireless Clicky
Software Fedora KDE Spin
I am running ollama here on my Fedora 41 desktop with an AMD RX 6750 XT. I was able to pass the Environment variable that enables that GPU despite it not directly being supported. I mostly am running deepseek-coder-v2 for AI chat inside VSCodium and Qwen2.5-coder:1.5 for code completion. I get 75-80 tokens per second which is pretty good.

edit: i get 83 tps with llama3.2:3b
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Made a second round with the same models but with a fresh start not opening any other programs
I am not sure why my system get this "slow" regarding AI
DS 14B Q8 from 42 -> 73 token/s
Phi4 Q8 from 42 still 42 token/s
DS 32B Q6 from 3.2 to 3.3 token/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s
I understand that the browser using some VRAM, but how is that LM Studio not using priority and offloading data not currently in use to system memory?
phi4:27B - 42 tokens/s
Same on mine, even with fresh start - strange
Llama 3.3 performance also looks strange
 

Easy Rhino

Linux Advocate
Staff member
Joined
Nov 13, 2006
Messages
15,649 (2.35/day)
Location
Mid-Atlantic
System Name Desktop
Processor i5 13600KF
Motherboard AsRock B760M Steel Legend Wifi
Cooling Noctua NH-U9S
Memory 4x 16 Gb Gskill S5 DDR5 @6000
Video Card(s) Gigabyte Gaming OC 6750 XT 12GB
Storage WD_BLACK 4TB SN850x
Display(s) Gigabye M32U
Case Corsair Carbide 400C
Audio Device(s) On Board
Power Supply EVGA Supernova 650 P2
Mouse MX Master 3s
Keyboard Logitech G915 Wireless Clicky
Software Fedora KDE Spin
Made a second round with the same models but with a fresh start not opening any other programs
I am not sure why my system get this "slow" regarding AI
DS 14B Q8 from 42 -> 73 token/s
Phi4 Q8 from 42 still 42 token/s
DS 32B Q6 from 3.2 to 3.3 token/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s
I understand that the browser using some VRAM, but how is that LM Studio not using priority and offloading data not currently in use to system memory?

Same on mine, even with fresh start - strange
Llama 3.3 performance also looks strange

Try llama3.2:3b and see what you get?
 
Joined
Feb 12, 2025
Messages
8 (2.67/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s

Llama 3.3 performance also looks strange
That model needs about 42GB of VRAM to run properly on a PC. A lot of model is loaded into about 10x slower system RAM, so performance falls kinda off the cliff here. Fluctuations in performance are to be expected, depending if answer came mostly from VRAM or RAM.
Measuring LLM TPS always varies a bit. If question is complex and requires long answer it takes LLM longer. If LLM temperature is set to creative, it will give different answers to same question, making exact measuring complicated.
Phi and Llama for language question usually.
Phi4 and low level Llamas are terrible at my native language.

Last years I've not been much of a gamer, but with LLMs I am tempted to upgrade my GPU to 24GB 7900XTX or 32GB version of 9070XT if that rumor comes alive. All current nVidia offerings seem to be low bang for the buck options from LLM perspective, but GB10 based desktop computer is very interesting.
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Joined
Feb 12, 2025
Messages
8 (2.67/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
Joined
Jul 19, 2015
Messages
1,020 (0.29/day)
Location
Nova Scotia, Canada
Processor Ryzen 5 5600 @ 4.65GHz CO -30
Motherboard AsRock X370 Taichi
Cooling Cooler Master Hyper 212 Plus
Memory 32GB 4x8 G.SKILL Trident Z 3200 CL14 1.35V
Video Card(s) PCWINMAX RTX 3060 6GB Laptop GPU (80W)
Storage 1TB Kingston NV2
Display(s) LG 25UM57-P @ 75Hz OC
Case Fractal Design Arc XL
Audio Device(s) ATH-M20x
Power Supply Evga SuperNova 1300 G2
Mouse Evga Torq X3
Keyboard Thermaltake Challenger
Software Win 11 Pro 64-Bit
Tried playing around with LM studio a bit with my 3060 6GB Laptop GPU & Instinct MI25 16GB together. It was a lot easier to set up than I expected it to be.

Performance on the smaller model is quite lacking with the MI25 because the 3060 can't use it's Tensor cores I think. The 22GB VRAM pool makes up for that on larger models though.

Vulkan (3060 + MI25)
DS Qwen 7B 4Q_K_M ~20t/s
DS Qwen 32B 4Q_K_M ~8t/s

CUDA (3060 only)
DS Qwen 7B 4Q_K_M ~43t/s
DS Qwen 32B 4Q_K_M ~2.1t/s (Obviously way too big for it)
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Just learned about agentica-org_DeepScaleR-1.5B-Preview-GGUF and how great is this!
So I downloaded the 32F version since even that only takes 7.11GB
Even with 32F still quite fast with 92 token/s

Also to the other forum members!
I would love to see some 4080 and 4090 (or newer) stats here!
 
Joined
Feb 21, 2006
Messages
2,287 (0.33/day)
Location
Toronto, Ontario
System Name The Expanse
Processor AMD Ryzen 7 5800X3D
Motherboard Asus Prime X570-Pro BIOS 5013 AM4 AGESA V2 PI 1.2.0.Cc.
Cooling Corsair H150i Pro
Memory 32GB GSkill Trident RGB DDR4-3200 14-14-14-34-1T (B-Die)
Video Card(s) XFX Radeon RX 7900 XTX Magnetic Air (24.12.1)
Storage WD SN850X 2TB / Corsair MP600 1TB / Samsung 860Evo 1TB x2 Raid 0 / Asus NAS AS1004T V2 20TB
Display(s) LG 34GP83A-B 34 Inch 21: 9 UltraGear Curved QHD (3440 x 1440) 1ms Nano IPS 160Hz
Case Fractal Design Meshify S2
Audio Device(s) Creative X-Fi + Logitech Z-5500 + HS80 Wireless
Power Supply Corsair AX850 Titanium
Mouse Corsair Dark Core RGB SE
Keyboard Corsair K100
Software Windows 10 Pro x64 22H2
Benchmark Scores 3800X https://valid.x86.fr/1zr4a5 5800X https://valid.x86.fr/2dey9c 5800X3D https://valid.x86.fr/b7d
1739480424006.png
 
Last edited:

johnspack

Here For Good!
Joined
Oct 6, 2007
Messages
6,050 (0.95/day)
Location
Nelson B.C. Canada
System Name System2 Blacknet , System1 Blacknet2
Processor System2 Threadripper 1920x, System1 2699 v3
Motherboard System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s) System2 GTX 980Ti System1 GTX 970
Storage System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s) 1x27" 1440 display 1x 24" 1080 display
Case System2 Some Nzxt case with soundproofing...
Audio Device(s) Asus Xonar U7 MKII
Power Supply System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse Logitech G900 Chaos Spectrum
Keyboard Ducky
Software Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores It's linux baby!
Decided I'd try to play with this under Arch. I'm guessing my maxwell card can't be used. Had to turn off gpu offloading for it to work, but is quite slow. Took 24
seconds to answer: is the world round? heh. I guess I have to download extra llms manually and add them?
Heh, nope, go to hugging face, click on use this model, and it will download inside of lm studio. Pretty slick!
 
Last edited:
Joined
Nov 23, 2023
Messages
40 (0.09/day)
Cydonia-22B-v1.2-GGUF.
Made a second round with the same models but with a fresh start not opening any other programs
I am not sure why my system get this "slow" regarding AI
DS 14B Q8 from 42 -> 73 token/s
Phi4 Q8 from 42 still 42 token/s
DS 32B Q6 from 3.2 to 3.3 token/s
Llama3.3 70B Q3 from 1.9 to 2.6 token/s
I understand that the browser using some VRAM, but how is that LM Studio not using priority and offloading data not currently in use to system memory?

Same on mine, even with fresh start - strange
Llama 3.3 performance also looks strange
You should definitely be getting faster speeds if your VRAM can hold it. Don't know how it works in LMS, but in Kobold you just use --lowvram and offload as many layers as possible. If the Q6 doesn't fit in VRAM, try out the smallest Q4 model instead. You should also use the newer i-quants if you've gotta go down to the Q3 quants and below.
I'm getting 4.39T/s generated with fifty layers offloaded --lowvram on DeepSeek-R1-Distill-Qwen-32B-IQ4_XS, your hardware should be able to do at least this much.

Decided I'd try to play with this under Arch. I'm guessing my maxwell card can't be used. Had to turn off gpu offloading for it to work, but is quite slow. Took 24
seconds to answer: is the world round? heh. I guess I have to download extra llms manually and add them?
Heh, nope, go to hugging face, click on use this model, and it will download inside of lm studio. Pretty slick!
Your cards should be fine, actually. If you've got compute capability 5.0 or higher (Maxwell is 5.2), you can use your cards. Make sure to --lowvram in Kobold or whatever you use to get better performance.
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
@Makaveli You are right, it's easier to have a screencap, I just also wanted easy access for others so I linked the ones which I use more often!
1739509709361.png

You should definitely be getting faster speeds if your VRAM can hold it. Don't know how it works in LMS, but in Kobold you just use --lowvram and offload as many layers as possible. If the Q6 doesn't fit in VRAM, try out the smallest Q4 model instead. You should also use the newer i-quants if you've gotta go down to the Q3 quants and below.
I'm getting 4.39T/s generated with fifty layers offloaded --lowvram on DeepSeek-R1-Distill-Qwen-32B-IQ4_XS, your hardware should be able to do at least this much.
I never use Q3 quants Q4 is the lowest, but those are also not that accurate so Q6 or better what I prefer.
Those slowdowns happen when I have many browser pages open with tabs, which should not really effect my free VRAM since those are not in foreground.
Maybe AMD should spend more in their GPU driver department...
 
Joined
Nov 23, 2023
Messages
40 (0.09/day)
I never use Q3 quants Q4 is the lowest, but those are also not that accurate so Q6 or better what I prefer.
Oh, it says Q3 for the 70b in your post, so I thought it was that. I'm VRAM constrained and the Q4s perform the best for the size anyways, so I just default to IQ4_XS.

I've seen a couple crazy people happily running exl2 q2 models of Llama 70b, I'd say it's worth a shot since it's so much faster. Larger models respond better to lower quants, too.

I agree with AMD needing to spend more time on their software, though. The instability sucks hard when you're generating while using an AMD card as your display output. I don't get these problems with NVIDIA or Intel.
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Oh, it says Q3 for the 70b in your post, so I thought it was that. I'm VRAM constrained and the Q4s perform the best for the size anyways, so I just default to IQ4_XS.

I've seen a couple crazy people happily running exl2 q2 models of Llama 70b, I'd say it's worth a shot since it's so much faster. Larger models respond better to lower quants, too.

I agree with AMD needing to spend more time on their software, though. The instability sucks hard when you're generating while using an AMD card as your display output. I don't get these problems with NVIDIA or Intel.
Yeah, it says in the second post Q3 - it is a typo, my bad!
At least is was correct in the very first post :D

Made a little test and I don't get it, I was thinking that more layers in the GPU equals better performance, but I had this
21/80 - 1.49
23/80 - 1.52
24/80 - 1.55 -> Peak performance
25/80 - 1.42
27/80 - 1.29
28/80 - 1.23 -> Max layers to get loaded on GPU
How is this a thing?
1739515520648.png
 
Joined
Nov 23, 2023
Messages
40 (0.09/day)
Yeah, it says in the second post Q3 - it is a typo, my bad!
At least is was correct in the very first post :D

Made a little test and I don't get it, I was thinking that more layers in the GPU equals better performance, but I had this
21/80 - 1.49
23/80 - 1.52
24/80 - 1.55 -> Peak performance
25/80 - 1.42
27/80 - 1.29
28/80 - 1.23 -> Max layers to get loaded on GPU
How is this a thing?
View attachment 384810
It might be because the card's doing swapping or something if you're also using it for video out. Does LMS not have a no KV offload option? I think the 7000 series might be able to do flash attention too, so you could get some better performance if you turn it on.
I feel like you could be getting 2T/s or so.
 
Joined
Jan 14, 2019
Messages
14,397 (6.47/day)
Location
Midlands, UK
Processor Various Intel and AMD CPUs
Motherboard Micro-ATX and mini-ITX
Cooling Yes
Memory Overclocking is overrated
Video Card(s) Various Nvidia and AMD GPUs
Storage A lot
Display(s) Monitors and TVs
Case It's not about size, but how you use it
Audio Device(s) Speakers and headphones
Power Supply 300 to 750 W, bronze to gold
Mouse Wireless
Keyboard Mechanic
VR HMD Not yet
Software Linux gaming master race
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
 

the54thvoid

Super Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
13,295 (2.40/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) Gainward RTX4070ti
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?

I think for purposes Google used to serve (ask a question - get an answer, not a sales pitch). I've used ChatGPT for grammar related queries, tax and wage info etc. It's useful if you have a valid question.

BUT - this thread is for locally-hosted based LLM discussion, not LLM use in general.
 
Joined
Mar 11, 2008
Messages
1,085 (0.18/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
It is great for checking grammar.
when I don't quite know what to say helps find the way
when I think something and I wish to test it quickly
There are quite many usecase for the general users, but if you are here reading this topic, it is likely you are not a "default user" anyway

It might be because the card's doing swapping or something if you're also using it for video out. Does LMS not have a no KV offload option? I think the 7000 series might be able to do flash attention too, so you could get some better performance if you turn it on.
I feel like you could be getting 2T/s or so.
I watching youtube in the background bit still, it should be a linear toll on performance.
 
Joined
Nov 23, 2023
Messages
40 (0.09/day)
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
AI girlfriends :pimp:
It is great for checking grammar.
when I don't quite know what to say helps find the way
when I think something and I wish to test it quickly
There are quite many usecase for the general users, but if you are here reading this topic, it is likely you are not a "default user" anyway


I watching youtube in the background bit still, it should be a linear toll on performance.
Eh, well, testing it out on my end I'm getting around 1.3T/s with --lowvram and 25 layers offloaded with the IQ4_XS version. Looks like my hunch was wrong. I've got only around 15GB of VRAM I can use right now though, you might get different results.
 
Joined
Jan 14, 2019
Messages
14,397 (6.47/day)
Location
Midlands, UK
Processor Various Intel and AMD CPUs
Motherboard Micro-ATX and mini-ITX
Cooling Yes
Memory Overclocking is overrated
Video Card(s) Various Nvidia and AMD GPUs
Storage A lot
Display(s) Monitors and TVs
Case It's not about size, but how you use it
Audio Device(s) Speakers and headphones
Power Supply 300 to 750 W, bronze to gold
Mouse Wireless
Keyboard Mechanic
VR HMD Not yet
Software Linux gaming master race
I think for purposes Google used to serve (ask a question - get an answer, not a sales pitch). I've used ChatGPT for grammar related queries, tax and wage info etc. It's useful if you have a valid question.

BUT - this thread is for locally-hosted based LLM discussion, not LLM use in general.
That's what I mean. I get the idea of Google AI search and ChatGPT (even though they seem to give inaccurate answers from time to time), but why would you want to run something like that locally?

It is great for checking grammar.
when I don't quite know what to say helps find the way
How is it better at grammar than already existing solutions that we've been using for decades?

when I think something and I wish to test it quickly
What "something" do you test with it?

There are quite many usecase for the general users, but if you are here reading this topic, it is likely you are not a "default user" anyway
I'd like to believe so. :)

I'm just trying to find a use case for it. I'm not nitpicking, really. :ohwell:
 
Joined
Jun 22, 2012
Messages
322 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
That's what I mean. I get the idea of Google AI search and ChatGPT (even though they seem to give inaccurate answers from time to time), but why would you want to run something like that locally?

For me it's a combination of many things:

- Privacy (e.g. not wanting to give your phone number and other personal details to other companies, or perhaps you want to have the LLM process personal/confidential information that you don't want or can't send online, etc).
- Much faster response times.
- No risk of getting your account banned or worse if you play around with the model or try to get around its "guardrails" ⇒ customizability (e.g. perhaps I don't want a boring-sounding AI assistant, or I want it to be mean, etc.)
- Always available on your PC even when you don't have an Internet connection.

I'm currently moderately satisfied with Mistral-Small-24B-Instruct-2501 which got released a couple weeks ago.
 
Joined
Jan 14, 2019
Messages
14,397 (6.47/day)
Location
Midlands, UK
Processor Various Intel and AMD CPUs
Motherboard Micro-ATX and mini-ITX
Cooling Yes
Memory Overclocking is overrated
Video Card(s) Various Nvidia and AMD GPUs
Storage A lot
Display(s) Monitors and TVs
Case It's not about size, but how you use it
Audio Device(s) Speakers and headphones
Power Supply 300 to 750 W, bronze to gold
Mouse Wireless
Keyboard Mechanic
VR HMD Not yet
Software Linux gaming master race
For me it's a combination of many things:

- Privacy (e.g. not wanting to give your phone number and other personal details to other companies, or perhaps you want to have the LLM process personal/confidential information that you don't want or can't send online, etc).
That sounds interesting. Can you explain? :)

- No risk of getting your account banned or worse if you play around with the model or try to get around its "guardrails" ⇒ customizability (e.g. perhaps I don't want a boring-sounding AI assistant, or I want it to be mean, etc.)
What do you mean? What account?

- Always available on your PC even when you don't have an Internet connection.
Doesn't it need the internet to get the answers?
 
Joined
Jun 22, 2012
Messages
322 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
Most sufficiently capable cloud-based LLMs require you to create a personal account on their website and often (always?) to give them your phone number. This is supposedly for preventing abuse (service rate limits, criminal or disallowed content generation, etc).

Local LLMs do not need Internet access to respond to your queries. They do their processing locally on your GPU or your CPU (if you're very patient).
 
Top