- On the contrary, I think it demonstrates an inherent limit to the kind of tasks / datasets that human beings care about.
It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.
It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.
- Each fine tune drags the model weights away from the base model in a certain direction.
Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.
The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.
Another way to say it is that you can compress fine tune weights into a vector of 40 floats.
Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?
I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.
I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.
- Where do lock free algorithms fall in this analysis?
- Their VGGT, Dinov3, and segment anything models are pretty impressive.
- The problem becomes complicated once the large discrete objects are not actuated. Even worse if the large discrete objects are not consistently observable because of occlusions or other sensor limitations. And almost impossible if the large discrete objects are actuated by other agents with potentially adversarial goals.
Self driving cars, an application in which physics is simple and arguably two dimensional, have taken more than a decade to get to a deployable solution.
- The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206
> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.
They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.
- First I thought this would be just another gradient descent tutorial for beginners. But the article goes quite deep into gradient descent dynamics, looking into third order approximations of the loss function and eventually motivating a concept called "central flows." Their central flow model was able to predict loss graphs for various training runs across different neural network architectures.
- Can someone explain the bit counting argument in the reinforcement learning part?
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
- These problems seem to have the flavor of interview questions I heard for quant positions.
- Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
- Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.
- I’m not sure that LLMs are solely autocomplete. The next token prediction task is only for pretraining. After that I thought you apply reinforcement learning.
- Did this end up working? It sounds plausible but it needs some empirical validation.
- Yeah I try to make sure I do the extern c. I’m also on x86 so I just pretend that alignment is not an issue and I think it works.
- CBOR has some stuff that is nice but would be annoying to reimplement. Like using more bytes to store large numbers than small ones. If you need a quick multipurpose binary format, CBOR is pretty good. The only alternative I’d make manually is just memcpy the bytes of a C struct directly to disk and hope that I won’t encounter a system with different endianness.
- Something seems off with equation (5).
Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.
I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.
- The ultimate compression is to send just the user inputs and reconstitute the game state on the other end.
- I think you are missing that d, x, and y are variables that get optimized over. Any choice of d lower than the the solution to 1) is infeasible. Any d higher than the solution to 1) is suboptimal.
edit: I see now that the problem 2) is missing d in the subscript of optimization variables. I think this is a typo.
Using swipe, no space bar after kill: Kill maps Jill myself Jill myself
Using swipe, manually pressing space bar after kill: Kill mussels Kill mussels Kill mussels