Comment by Uehreka - Hacker Neue

Uehreka Nov 6, 2025 parent

If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).

I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.

I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.

maciejgryka Nov 6, 2025

I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.

Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/

XenophileJKO Nov 6, 2025

Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.

One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.

It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.

Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.

Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.

martinald Nov 6, 2025

The issue isn't even 'quality' per se (for many tasks a small model would do fine), its for "agentic" workflows it _quickly_ runs out of context. Even 32GB VRAM is really very limiting.

And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.

HarHarVeryFunny Nov 6, 2025

Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.

I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).

Aurornis Nov 6, 2025

> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.

That's not a business model choice, though. That's a reality of running SOTA models.

If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.

Majromax Nov 6, 2025

> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.

cubefox Nov 7, 2025

> > If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

> First, they do this; that's why they release models at different price points.

No, those don't deliver the same output. The cheaper models are worse.

> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.

gunalx Nov 7, 2025

But they also squesed a 80% cut in O3 at some point, supposedly purely on inference or infra optimization

anabis Nov 11, 2025

> delivering 97% of the performance at 10% of the cost is a distraction.

Not if you are running RL on that model, and need to do many roll-outs.

Uehreka OP Nov 6, 2025

No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.

ctoth Nov 7, 2025

Unless you're programming a racing sim or maybe a CRUD app for a local Kennel Club, perhaps?

I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous