Preferences

cheesecompiler parent
The reverse is possible too: throwing massive compute at a problem can mask the existence of a simpler, more general solution. General-purpose methods tend to win out over time—but how can we be sure they’re truly the most general if we commit so hard to one paradigm (e.g. LLMs) that we stop exploring the underlying structure?

falcor84
The way I see this, from the explore-exploit point of view, it's pretty rational to put the vast majority of your effort into the one action that has shown itself to bring the most reward, while spending a small amount of effort exploring other ones. Then, if and when that one action is no longer as fruitful compared to the others, you switch more effort to exploring, now having obtained significant resources from that earlier exploration, to help you explore faster.
willvarfar
I think the "bitter lesson" is that, while startup A is trying to tune and optimise what they do in order to be able to train their model with hardware quantity B, there is another startup called C that will be lucky enough to have B*2 hardware (credits etc) and will not try so hard to optimise and will reach the end quicker?

Of course deepseek was forced to take the optimisation approach but got to the end in time to stake a claim. So ymmv.

CS is full of trivial examples of this. You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.

The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.

xg15
> You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.

Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.

I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.

svachalek
I think it can be argued that ChatGPT 4.5 was that situation.
jimbokun
And whether that understanding will be done by humans or the AIs themselves.
dan-robertson
Do you have a good reference for sims merge sort? The only examples I found are pairwise-merging large numbers of streams but it seems pretty hard to optimise the late steps where you only have a few streams. I guess you can do some binary-search-in-binary-search to change a merge of 2 similarly sized arrays into two merges of similarly sized arrays into sequential outputs and so on.

More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.

logicchains
We can be sure via analysis based on computational theory, e.g. https://arxiv.org/abs/2503.03961 and https://arxiv.org/abs/2310.07923 . This lets us know what classes of problems a model is able to solve, and sufficiently deep transformers with chain of thought have been shown to be theoretically capable of solving a very large class of problems.
yorwba
Keep in mind that proofs of transformers being able to solve all problems in some complexity class work by taking a known universal algorithm for that complexity class and encoding it as a transformer. In every such case, you'd be better off using the universal algorithm you started with in the first place.

Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.

dsr_
A random number generator is guaranteed to produce a correct solution to any problem, but runtime usually does not meet usability standards.

Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.

tsimionescu
Note that these theorems show that there exists a transformer that can solve these problems, they tell you nothing about whether there is any way to train that transformer using gradient descent from some data, and even if you could, they don't tell you how much data and of what kind you would need to train them on.
cheesecompiler OP
But this uses the transformers model to justify its own reasoning strength which might be a blindspot, which is my original point. All the above shows is that transformers can simulate solving a certain set of problems. It doesn't show that they are the best tool for the job.

This item has no comments currently.