Profile: markisus - Hacker Neue

markisus

Joined Jul 29, 2012 868 karma

markisus Dec 11, 2025

I’ve confirmed this on my iphone as well.
Using swipe, no space bar after kill: Kill maps Jill myself Jill myself
Using swipe, manually pressing space bar after kill: Kill mussels Kill mussels Kill mussels
markisus Dec 9, 2025

On the contrary, I think it demonstrates an inherent limit to the kind of tasks / datasets that human beings care about.
It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.
It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.
markisus Dec 9, 2025

Each fine tune drags the model weights away from the base model in a certain direction.
Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.
The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.
Another way to say it is that you can compress fine tune weights into a vector of 40 floats.
Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?
I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.
I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.
markisus Dec 8, 2025

Where do lock free algorithms fall in this analysis?
markisus Dec 7, 2025

Their VGGT, Dinov3, and segment anything models are pretty impressive.
markisus Nov 13, 2025

The problem becomes complicated once the large discrete objects are not actuated. Even worse if the large discrete objects are not consistently observable because of occlusions or other sensor limitations. And almost impossible if the large discrete objects are actuated by other agents with potentially adversarial goals.
Self driving cars, an application in which physics is simple and arguably two dimensional, have taken more than a decade to get to a deployable solution.
markisus Oct 7, 2025

The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206
> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.
They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.
markisus Oct 7, 2025

First I thought this would be just another gradient descent tutorial for beginners. But the article goes quite deep into gradient descent dynamics, looking into third order approximations of the loss function and eventually motivating a concept called "central flows." Their central flow model was able to predict loss graphs for various training runs across different neural network architectures.
markisus Oct 4, 2025

Can someone explain the bit counting argument in the reinforcement learning part?
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
markisus Sep 21, 2025

These problems seem to have the flavor of interview questions I heard for quant positions.
markisus Sep 11, 2025

Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
markisus Aug 17, 2025

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.
markisus Aug 14, 2025

I’m not sure that LLMs are solely autocomplete. The next token prediction task is only for pretraining. After that I thought you apply reinforcement learning.
markisus Aug 8, 2025

Did this end up working? It sounds plausible but it needs some empirical validation.
markisus Jul 31, 2025

Yeah I try to make sure I do the extern c. I’m also on x86 so I just pretend that alignment is not an issue and I think it works.
markisus Jul 30, 2025

CBOR has some stuff that is nice but would be annoying to reimplement. Like using more bytes to store large numbers than small ones. If you need a quick multipurpose binary format, CBOR is pretty good. The only alternative I’d make manually is just memcpy the bytes of a C struct directly to disk and hope that I won’t encounter a system with different endianness.
markisus Jul 30, 2025

Something seems off with equation (5).
Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.
I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.
markisus Jul 29, 2025

The ultimate compression is to send just the user inputs and reconstitute the game state on the other end.
markisus Jul 9, 2025

I think you are missing that d, x, and y are variables that get optimized over. Any choice of d lower than the the solution to 1) is infeasible. Any d higher than the solution to 1) is suboptimal.
edit: I see now that the problem 2) is missing d in the subscript of optimization variables. I think this is a typo.

This user hasn’t submitted anything.