i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b
apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable
Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.
I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance
-
I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.
GPT-3.5 has better world knowledge than some 70B models, and a few even larger.
Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.
That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.
"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.
I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.
So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.
I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)
I should try Kimi K2 too.
You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.
Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.
That's where the moat, no, the chasm, between local setups and a public API lies.
No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.
That's where local models fold.
Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.
You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.
I suppose it's an open question whether there is another free lunch or whether the 30B models in a year will be not much better than our current ones.
If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.
Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)
Even if it does poorly in all areas (like Llama 4 [0]), there is still a lot the community and industry can learn from even an uncompetitive model.
[0] Llama 4 technically has a massive 10M token context as a differentiator, however in my experience, it is not reliably usable beyond 100k.
Another reason people are 'hyped' for open models is that access to them can not be taken away or price gauged at the whim of the provider, and that their use can not be restricted in arbitrary ways, although I'm sure that on the latter part they will have a go at it through regulation.
Grab'em while you can.
Not their proprietary model, but maybe other open source models, or closed source models of their competitors. That way they can first ensure they are the only player on both sides, and then can kneecap their open source models just enough to drive the revenue to their proprietary one.
I have Ollama installed (only a small proportion of their clients would have a large enough GPU for this) and have download deepseek and played with it, but I still pay for an OpenAI subscription because I want the speed of a hosted model, and never mind the luxuries of things like Codex's diffs/pull request support, agents on new models, deep research etc. - I use them all at least weekly.
Are you using it everyday for programming? If so, how much more or less does it cost you per month? More or less than $100?
Ah; this definitely makes sense! I do this myself and then paste back only the relevant part of the log so as to limit this. I suspect I am being more conservative than others.
They are fully trying to be a consumer product, developer services be damned. But they can’t just get rid of the API because it’s a good incremental source of revenue, and thanks to the Microsoft deal, all that revenue would end up in Azure. Maintaining their API is basically just a way to get a slice of that revenue.
But if they open sourced everything, it might sour the relationship more with Microsoft, who would lose azure revenue and might be willing to part ways. It would also ensure that they compete on consumer product quality not (directly) model quality. At this point, they could basically put any decent model in their app and maintain the user base, they don’t actually need to develop their own.