Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering"
As a really stupid example: the sets of integers less than 2, 8, 5, and 30 can all be embedded in the set of integers less than 50, but that doesn’t require that the set of integer is finite. You can always get a bigger one that embeds the smaller.
It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.
It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.
If all need just 16 dimensions if we ever make one that needs 17 we know we are making progress instead of running in circles.
Apparently it doesn't at least not in our models with our training applied to our tasks.
So if we expand one of those 3 things and notice that 17-th vector makes a difference then we are having progress.