No, we really don't. DeepSeek has proven very well that one needs only a 6GB GPU and a RaspberryPi4 or 5 to get the job done in a very good way. Your "request" is not very applicable to the general consumer anyway.
Deepseek hasn't proven anything. Their actual impressive model is 671B params in size, which requires at least 350GB of VRAM/RAM to run, that's not modest.
The models you are talking about that ran on a 6GB GPU and a raspberry pi are the distilled models, which are the ones based on existing models (llama and qwen).
Larger models of the same generation always give have better quality than smaller ones.
Of course that as time improves, the smaller models improve as well, but so do their larger counterparts.
Seriously, how many people need AI at all? Hmm? What would they use it for?
(those are rhetorical questions, meaning they do not need answers)
With the above said, I do agree with those rhetorical questions. It's just too much entitlement for something that's a hobby.
If not a hobby, then one should have enough money to pony up on professional stuff.
Right now Apple M chips with the unified memory are crushing local LLM development. I think as AI devs flock to apple devices nvidia will react and release their N1X chips with unified memory or start offering higher GB consumer cards.
Nvidia has that Digitis product now.
Deepseek 14B model is capable of fitting into the 4090's buffer, but it's far inferior to the 32B model that's available (like if you ask it to code a typescript website, it will create jsx files and make a bunch of basic mistakes) IMO the 32B model is better than chat 4o. 32B runs brutally slow and only uses 60% GPU since it runs out of framebuffer -- I would love to be able to run the larger models.
Q4 quants are a thing. Problem comes to the 70B models, those end up requiring 2 GPUs even at Q4.
but even then you only have 24GB or 32gb if you pony up $2k and can even find a 5090.
if you get a refurb M chip you can get 64GB unified ~ 50-55GB usable to load in the model for $2500 or over 85GB if you can get the 96GB version for $3200.
For the same money you would build a 5090 rig. -- Granted the models will run alot slower, but if you're looking for ram size m4 max might be the best price/performance.
2x3090s should cost less than a 5090 and would give you 48GB, while being way faster than any M4 Max.
For >50GB models, yeah, going for unified memory is the most cost-effective way currently.
If AMD really wanted to take some market share they could do it with some high density cards, then people who don't want to pony up nvidia will have to put up and hopefully improve the software stack.
Strix Halo with 128GB should fill this niche nicely.
Too bad it doesn't seem it'll have higher RAM models available, a 192/256GB model would be hella cool.
Call me stupid, but I think I do, cause I haven't the faintest idea what someone would need LLMs for on a home PC.
I may not be your average user, but I use it as a coding assistant most of the time (with a mix of claude/gpt4 as well), and for some academic projects (some related to chatbots, others related to RAG stuff).
Using it as a "helper" while writing academic papers is pretty useful as well.