That just gave me an idea! I wonder how useful (and for what) a model would be if it was trained using a two-phase approach:
1) Put the training data through an embedding model to create a giant vector index of the entire Internet.
2) Train a transformer LLM but instead only utilising its weights, it can also do lookups against the index.
Its like a MoE where one (or more) of the experts is a fuzzy google search.
The best thing is that adding up-to-date knowledge won’t require retraining the entire model!
If openAI is spending $500B then someone can get ahead by spending $1B which improves the model by >0.2%
I bet there's a group or three that could improve results a lot more than 0.2% with $1B.
The knowledge compressed into an LLM is a byproduct of training, not a goal. Training on internet data teaches the model to talk at all. The knowledge and ability to speak are intertwined.
What I can think of is that there may be a push toward training for exclusively search-based rewards so that the model isn't required to compress a large proportion of the internet into their weights. But this is likely to be much slower and come with initial performance costs that frontier model developers will not want to incur.