Yea, it is discouraging that the free models are all pretty bad. I mean, none of them are all that great at boiler plate. Copilot recognizes what I am trying to do and it adjusts on the fly. Pretty great for getting the boring stuff out of the way.
I have a sub to chatGPT. I havent had time to spin any of the more vram intensive models up, but I did go into some of the mid range ones that say require north of 8. They get better; but some like deepseek, just arent it for me. It seems to weigh certain answers above others. I will try to explain.
For example, one of my tests is to ask for a batch file. I specify batch in my prompt. Deepseek in everycase would try to give me powershell. Powershell is more modern, better documented and preffered today. The script itself was fine, but I pretty much give it a -100 because I specified BATCH. It would give it to me, but I had to insist a few more times.
The others specifically starcoder didnt have that "presumptuous" issue. But it started to fall flat when I started asking it more complex things.
I am hoping when I have time and start running the big big models it improves. Llama isnt bad, but it does start to poop syntax issues when things get hard. It loved putting ";" where it doesn't belong.
That said, I think a lot of these take a learning curve from those that use them too. You have to know how to "speak" to your model of choice.