• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

Joined
Nov 23, 2023
Messages
358 (0.71/day)
IIRC, you cannot 'spread' LLM workloads across seperate VRAM segments.
You're mistaken. Layers are not swapped between GPUs by default. Each card will do it's own thing.
I know. I've given up on looking towards AM5 until a new PROMOTORY is released. The standardized fanout for PCIe on AM5 is unironically, literally, and un-amusingly "retarded".
It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.

Everybody who worked on X870 should get fired.
 
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.
The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.
 
Joined
Mar 11, 2008
Messages
1,211 (0.19/day)
Location
Hungary / Budapest
System Name Kincsem
Processor AMD Ryzen 9 9950X
Motherboard ASUS ProArt X870E-CREATOR WIFI
Cooling Be Quiet Dark Rock Pro 5
Memory Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s) Sapphire AMD RX 7900 XT Pulse
Storage Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s) Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case Cooler Master CM 690 III
Power Supply Seasonic 1300W 80+ Gold Prime
Mouse Logitech G502 Hero
Keyboard HyperX Alloy Elite RGB
Software Windows 10-64
Benchmark Scores https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc
5800X3D+7900XTX+9070XT,
Choose your PSU for max wattage of the planned rig +20~25%
So for that maybe consider larger than 1000W

ProART X670 or X870 is great if you wanna use multiple GPU-s
But dividing LLM-s between multiple GPUs is not that easy
 

johnspack

Here For Good!
Joined
Oct 6, 2007
Messages
6,069 (0.95/day)
Location
Nelson B.C. Canada
System Name System2 Blacknet , System1 Blacknet2
Processor System2 Threadripper 1920x, System1 2699 v3
Motherboard System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s) System2 GTX 980Ti System1 GTX 970
Storage System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s) 1x27" 1440 display 1x 24" 1080 display
Case System2 Some Nzxt case with soundproofing...
Audio Device(s) Asus Xonar U7 MKII
Power Supply System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse Logitech G900 Chaos Spectrum
Keyboard Ducky
Software Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores It's linux baby!
Well that's nice, Koboldcpp now supports Gemma-3. Running gemma-3-4b-it-16bf and it's screaming fast. Bigger models still beat up on my system, but that's expected.
 
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
But dividing LLM-s between multiple GPUs is not that easy
What do you mean? Something like ollama does this by default as soon as you load any model that's bigger than the free vram of a single GPU. I even posted some benchmarking numbers some pages back with my 2x3090s and some larger models.
 
Joined
Aug 20, 2007
Messages
22,077 (3.43/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen 9 9950X
Motherboard MSI MAG X670E Tomahawk Wifi
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 128GB (4x 32GB) G.Skill Flare X5 @ DDR5-4000(Running 1:1:1 w/FCLK)
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64
Joined
Apr 18, 2019
Messages
2,778 (1.27/day)
Location
PNW, USA
System Name Metalia
Processor AMD Ryzen 7 5800X3D
Motherboard Asus TuF Gaming X570-PLUS
Cooling ID Cooling 280mm AIO w/ Arctic P14s
Memory 2x32GB DDR4-3600
Video Card(s) Sapphire Pulse RX 9070 XT
Storage Optane P5801X 400GB, Samsung 990Pro 2TB
Display(s) LG ‎32GS95UV 32" OLED 240/480hz 4K/1080P Dual Mode
Case Geometric Future M8 Dharma
Audio Device(s) Xonar Essence STX
Power Supply Seasonic Focus GX-1000 Gold
Mouse Attack Shark R3 Magnesium - White
Keyboard Keychron K8 Pro - White - Tactile Brown Switch
Software Windows 10 IoT Enterprise LTSC 2021
Lact is a better OC/monitoring tool for AMD IMO. Also supports Intel cards oddly enough.
I have a 10GB B570 that I irresponsibly pressed buy it now on late last month :laugh: (I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)
 
Joined
Nov 23, 2023
Messages
358 (0.71/day)
The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
I have a 10GB B570 that I irresponsibly pressed buy it now on late last month :laugh: (I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)
They're trash for LLMs. Prompt processing takes forever with Vulkan. SD performance should be respectable though, probably faster than any RDNA 2 and older card.
 
Joined
Apr 18, 2019
Messages
2,778 (1.27/day)
Location
PNW, USA
System Name Metalia
Processor AMD Ryzen 7 5800X3D
Motherboard Asus TuF Gaming X570-PLUS
Cooling ID Cooling 280mm AIO w/ Arctic P14s
Memory 2x32GB DDR4-3600
Video Card(s) Sapphire Pulse RX 9070 XT
Storage Optane P5801X 400GB, Samsung 990Pro 2TB
Display(s) LG ‎32GS95UV 32" OLED 240/480hz 4K/1080P Dual Mode
Case Geometric Future M8 Dharma
Audio Device(s) Xonar Essence STX
Power Supply Seasonic Focus GX-1000 Gold
Mouse Attack Shark R3 Magnesium - White
Keyboard Keychron K8 Pro - White - Tactile Brown Switch
Software Windows 10 IoT Enterprise LTSC 2021
SD performance should be respectable though, probably faster than any RDNA 2 card and below
Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.

Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
Looking at the 'retail-available' AI/ML-leaning cards, these last few gens (and actual market pricing)....
They kinda expect people that NEED the expansion, will pay for it. IMHO, yet again we got generational feature regression, industry-wide.
 
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
 
Joined
Nov 23, 2023
Messages
358 (0.71/day)
Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.
Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
 
Joined
Apr 18, 2019
Messages
2,778 (1.27/day)
Location
PNW, USA
System Name Metalia
Processor AMD Ryzen 7 5800X3D
Motherboard Asus TuF Gaming X570-PLUS
Cooling ID Cooling 280mm AIO w/ Arctic P14s
Memory 2x32GB DDR4-3600
Video Card(s) Sapphire Pulse RX 9070 XT
Storage Optane P5801X 400GB, Samsung 990Pro 2TB
Display(s) LG ‎32GS95UV 32" OLED 240/480hz 4K/1080P Dual Mode
Case Geometric Future M8 Dharma
Audio Device(s) Xonar Essence STX
Power Supply Seasonic Focus GX-1000 Gold
Mouse Attack Shark R3 Magnesium - White
Keyboard Keychron K8 Pro - White - Tactile Brown Switch
Software Windows 10 IoT Enterprise LTSC 2021
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers. :(
-merely because I'm aware of the severity, there.

Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.
Crud. Was hoping to slot it in as a secondary.
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)
edit: Just needs to be the only card *when installed initially*. -got it.
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
Pedantically browsing the AM5 boardlist googledoc that's oft circulated, can find oneself more than a couple decently expandable mobos.
More often than not though, they're OoS or no longer Retail Available.
 
Last edited:
Joined
Nov 23, 2023
Messages
358 (0.71/day)
Crud. Was hoping to slot it in as a secondary.
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)
It only needs to be the only card during install. After install, you're fine running it as a secondary with --use-ipex.
 
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
There aren't many options for AM5 to begin with, sadly.
Nonetheless, there are some other options, but they are both even more expensive in my region and have a worse PCIe layout.
A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers. :(
-merely because I'm aware of the severity, there
I guess Australia has it worse, from what I've seen in the 5000 posts from some aussies. Major problem is that the minimum wage here is way lower.
To put into perspective, I paid ~$750 for my x670e proart, and a 5090 goes for ~$4k currently.
 

johnspack

Here For Good!
Joined
Oct 6, 2007
Messages
6,069 (0.95/day)
Location
Nelson B.C. Canada
System Name System2 Blacknet , System1 Blacknet2
Processor System2 Threadripper 1920x, System1 2699 v3
Motherboard System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s) System2 GTX 980Ti System1 GTX 970
Storage System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s) 1x27" 1440 display 1x 24" 1080 display
Case System2 Some Nzxt case with soundproofing...
Audio Device(s) Asus Xonar U7 MKII
Power Supply System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse Logitech G900 Chaos Spectrum
Keyboard Ducky
Software Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores It's linux baby!
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
 
Joined
Nov 23, 2023
Messages
358 (0.71/day)
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
If it's slow it's because you're using too big of a quant or you're not offloading properly.

Turn on flash attention, set the layers manually, use --lowvram, use a lower quant, repeat until your speed improves.

I seriously hope you're using the bf16 of the mmproj.
 
Joined
Feb 12, 2025
Messages
28 (0.47/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
Turns out it's really disappointing for its price, it's 256-bit 8533MHz, so 273GB/s

Bummer
Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark

Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
I'd personally try it with ollama & gemma3:12B (Q_4_k) and analyze image file from command line
ollama run gemma3:12B --verbose
>>>describe this image C:\path\to\myimage.jpeg
(or /home/myimage.png etc)
If performance is satisfactory, slap GUI on top of ollama. Open WebUI for container folks and for example Msty for those liking to install stuff. This way can analyze files directly from images, WebUI even has small ChromaDB under hood, to speed file analyze up. Super useful IMHO.

Edit: now I got curios about gemma3:12b image analyzing performance with my 7800XT. Attaching the image it analyzed.

>>> what is in this image C:\temp\SC2-UltronJunk.png
Added image 'C:\temp\SC2-UltronJunk.png'
Here's a breakdown of what's in the image:

* **Character:** A man in a futuristic uniform is standing with his arms crossed. He appears to be a character
from a video game, likely a space-themed one.
* **Background:** The background shows a view through a window, revealing a planet or celestial body with a
reddish hue.
* **Text:** There's text overlayed on the image, which reads: "SUMMARY: In the future, please don't waste our
valuable time on this so-called 'Ultron'."
* **Interface Elements:** There are interface elements visible, including a "Starbase" label and other icons,
suggesting this is a screenshot from a video game.

The overall impression is that this is a screenshot from a space-themed video game, possibly a strategy or
simulation game.

total duration: 11.226313s
load duration: 3.681836s
prompt eval count: 667 token(s)
prompt eval duration: 2.6451341s
prompt eval rate: 252.16 tokens/s
eval count: 179 token(s)
eval duration: 4.8552628s
eval rate: 36.87 tokens/s
 

Attachments

  • SC2-UltronJunk.png
    SC2-UltronJunk.png
    376.7 KB · Views: 19
Last edited:
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark
It gets worse, Nvidia's version with 4TB of storage goes for $4k. Asus does have an 1tb model that goes for $3k.
 

johnspack

Here For Good!
Joined
Oct 6, 2007
Messages
6,069 (0.95/day)
Location
Nelson B.C. Canada
System Name System2 Blacknet , System1 Blacknet2
Processor System2 Threadripper 1920x, System1 2699 v3
Motherboard System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s) System2 GTX 980Ti System1 GTX 970
Storage System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s) 1x27" 1440 display 1x 24" 1080 display
Case System2 Some Nzxt case with soundproofing...
Audio Device(s) Asus Xonar U7 MKII
Power Supply System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse Logitech G900 Chaos Spectrum
Keyboard Ducky
Software Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores It's linux baby!
Yep, using minicpm and the matching minicpm-mmproj-f16 model, many times faster for images. Quite the learning curve....
Now running minicpm ggml-model-f16 with mmproj-model-f16, and still really fast but smarter. Also can do handwriting recognition.
Need to test that a bit more.
 
Last edited:
Joined
Feb 12, 2025
Messages
28 (0.47/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
 
Joined
Nov 23, 2023
Messages
358 (0.71/day)
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.
 
Joined
May 10, 2023
Messages
824 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
Even a distill would be too big, at 109B params that would still require over 50GB of VRAM at Q4. If you move down to Q2 or something like that, then it becomes doable with just a couple 24GB GPUs, but at this point one gets to wonder if the quality will be any good.
Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.
Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.
 
Joined
Feb 12, 2025
Messages
28 (0.47/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
I am pretty sure there will be some sort of reduced experts distill of Llama 4 for LLM enthusiast with gaming GPUs.

I have one relativistic speed spaceship exercise I am giving LLMs to solve. Most cloud based >100B models get it right, including 405B parameter Llama3.1. I expected Llama 4 Maverick to solve this easily, but it got only 1/2 exercise right (thought time flows slower for the observer, not for the crew). Llama 4 Scout was totally off and gave very different answers each time. Even my local 27B Gemma3 solved the problem more accurately :eek: Sometimes models get (partially) dumber, same happened with Gemini 1.5 Pro and 2.0 Pro.
1.5 got flawless answer, while 2.0 was off by 182% at first and 598% by second calculation of that exercise (Edit: Gemini 2.5 Pro experimental got the answers right again).

H100 is just 80GB, pretty easy to hit with a couple of 3090s
Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.
 
Last edited:
Joined
Nov 23, 2023
Messages
358 (0.71/day)
Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.
Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.
Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.
GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.
 
Joined
Feb 12, 2025
Messages
28 (0.47/day)
Location
EU
Processor AMD 5600X
Motherboard ASUS TUF GAMING B550M-Plus WiFi
Cooling be quiet! Dark Rock 4
Memory G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s) Sapphire Pulse RX 7800XT 16GB
Storage Kingston KC3000 2TB + QNAP TBS-464
Display(s) LG 35" LCD 35WN75C-B 3440x1440
Case Kolink Bastion RGB Midi-Tower
Power Supply Enermax Digifanless 550W
Mouse Razer Deathadder v2
Benchmark Scores phi4 - 42.00 tokens/s
GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.
When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.
 
Top