Comment by hodgehog11

hodgehog11 Dec 4, 2025 parent

There has always been pressure to do so, but there are fundamental bottlenecks in performance when it comes to model size.

What I can think of is that there may be a push toward training for exclusively search-based rewards so that the model isn't required to compress a large proportion of the internet into their weights. But this is likely to be much slower and come with initial performance costs that frontier model developers will not want to incur.

jiggawatts Dec 4, 2025

> exclusively search-based rewards so that the model isn't required to compress a large proportion of the internet into their weights.

That just gave me an idea! I wonder how useful (and for what) a model would be if it was trained using a two-phase approach:

1) Put the training data through an embedding model to create a giant vector index of the entire Internet.

2) Train a transformer LLM but instead only utilising its weights, it can also do lookups against the index.

Its like a MoE where one (or more) of the experts is a fuzzy google search.

The best thing is that adding up-to-date knowledge won’t require retraining the entire model!

Grosvenor Dec 4, 2025

Yeah that was my unspoken assumption. The pressure here results in an entirely different approach or model architecture.

If openAI is spending $500B then someone can get ahead by spending $1B which improves the model by >0.2%

I bet there's a group or three that could improve results a lot more than 0.2% with $1B.

parineum Dec 4, 2025

> so that the model isn't required to compress a large proportion of the internet into their weights.

The knowledge compressed into an LLM is a byproduct of training, not a goal. Training on internet data teaches the model to talk at all. The knowledge and ability to speak are intertwined.

thisrobot Dec 4, 2025

I wonder if this maintains the natural language capabilities which are what LLM's magic to me. There is a probably some middle ground, but not having to know what expressions, or idiomatic speech an LLM will understand is really powerful from a user experience point of view.

UncleOxidant Dec 4, 2025

Or maybe models that are much more task-focused? Like models that are trained on just math & coding?

agoodusername63 Dec 4, 2025

isn't that what the mixture of experts trick that all the big players do is? Bunch of smaller, tightly focused models

irthomasthomas Dec 5, 2025

Not exactly. MoE uses a router model to select a subset of layers per token. This makes them faster but still requires the same amount of RAM.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous