It should be required to point to the “solution” and maybe how it works to say “he just sucks” or “this was solved before”.
IMO the problem with current models is that they don’t learn categorically like: lions are animals, animals are alive. goats are animals, goats are alive too. So if lions have some property like breathing and goats also have it, it is likely that other similar things have the same property.
Or when playing a game, a human can come up with a strategy like: I’ll level this ability and lean on it for starting, then I’ll level this other ability that takes more time to ramp up while using the first one, then change to this play style after I have the new ability ready. This might be formulated completely based on theoretical ideas about the game, and modified as the player gets more experience.
With current AI models as far as I can understand, it will see the whole game as an optimization problem and try to find something at random that makes it win more. This is not as scalable as combining theory and experience in the way that humans do. For example a human is innately capable of understanding there is a concept of early game, and the gains made in early game can compound and generate a large lead. This is pattern matching as well but it is on a higher level .
Theory makes learning more scalable compared to just trying everything and seeing what works
This comment, with the exception of the random claim of "he is just bad at this", reads like a thinly veiled appeal to authority. I mean, you're complaining about people pointing out prior work, reviewing the approach, and benchmarking the output.
I'm not sure you are aware, but those items (bibliographical review, problem statement, proposal, comparison/benchmarks) are the very basic structure of an academic paper, which each and every single academic paper on any technical subject are required to present in order to be publishable.
I get that there must be a positive feedback element to it, but pay attention to your own claim: "He is trying to solve this to advance the field." How can you tell whether this really advances the field if you want to shield it from any review or comparison? Otherwise what's the point? To go on and claim that ${RANDOM_CELEB} parachuted into a field and succeeded at first try where all so-called researchers and experts failed?
Lastly, "he is just bad at this". You know who is bad at research topics? Researchers specialized on said topic. Their job is to literally figure out something they don't know. Why do you think someone who just started is any different?
A serious attempt at video/vision would involve some probabilistic latent space that can be noised in ways that make sense for games in general. I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game. I think you could prompt veo3 to play any game for a few seconds and it will generally make sense even though it is not fine tuned.
See here for example:
[1] https://arxiv.org/pdf/2410.18072
[2] https://arxiv.org/pdf/2411.02914v1
[3] https://openai.com/index/video-generation-models-as-world-si...
But even if you knew nothing about this topic, the observation that you simply couldn't store the necessary amount of video data in a model such that it could simply regurgitate it should give you a big clue as to what is happening.
[0] https://deepmind.google/discover/blog/genie-2-a-large-scale-...
In the same way that keeping a dream journal is basically doing investigative journalism, or talking to yourself is equivalent to making new friends, maybe.
The difference is that while they may both produce similar, "plausible" output, one does so as a result of processes that exist in relation to an external reality.
It doesn't. And you said it yourself:
> generating a video under prompt constraints is basically playing a game.
No. It's neither generating a game (that people can play) nor is it playing a game (it's generating a video).
Since it's not a model of the world in any sense of the word, there are issues with even the most basic object permanenece. E.g. here's veo3 generating a GTA-style video. Oh look, the car spins 360 and ends up on a completely different street than the one it was driving down previously: https://www.youtube.com/watch?v=ja2PVllZcsI
Also, prompting doesn't work as you imply it does.
Besides static puzzles (like a maze or jigsaw) I don't believe this analogy holds? A model working with prompt constraints that aren't evolving or being added over the course of "navigating" the generation of the model's output means it needs to process 0 new information that it didn't come up with itself — playing a game is different from other generation because it's primarily about reacting to input you didn't know the precise timing/spatial details of, but can learn that they come within a known set of higher order rules. Obviously the more finite/deterministic/predictably probabilistic the video game's solution space, the more it can be inferred from the initial state, aka reduce to the same type of problem as generating a video from a prompt), which is why models are still able to play video games. But as GP pointed out, transfer function negative in such cases — the overarching rules are not predictable enough across disparate genres.
> I think you could prompt veo3 to play any game for a few seconds
I'm curious what your threshold for what constitutes "play any game" is in this claim? If I wrote a script that maps button combinations to average pixel color of a portion of the screen buffer, by what metric(s) would veo3 be "playing" the game more or better than that script "for a few seconds"?
edit: removing knee-jerk reaction language
I am just saying we have proof that it can understand complex worlds and sets of rules, and then abide by them. It doesn't know how to use a controller and it doesn't know how to explore the game physics on its own, but those steps are much easier to implement based on how coding agents are able to iterate and explore solutions.
If I were to hand you a version of a 2d platformer (lets say Mario) where the gimmick is that you're actually playing the fourier transform of the normal game, it would be hopeless. You might not ever catch on that the images on screen are completely isomorphic to a game you're quite familiar with and possibly even good at.
But some range of spatial transform gimmicks are cleanly intuitive. We've seen this with games like vvvvvv and braid.
So the general rule seems to be that intelligence is transferable to situations that are isomorphic up to certain "natural" transforms, but not to "matching any possible embedding of the same game in a different representation".
Our failure to produce anything more than hyper-specialists forces us to question exactly is meant by the ability to generalize other than just "mimicking an ability humans seem to have".
Except that's of course superficial nonsense. Position space isn't an accident of evolution, one of many possible encodings of spatial data. It's an extremely special encoding: The physical laws are local in position and space. What happens on the moon does not impact what happens when I eat breakfast much. But points arbitrarily far in momentum space do interact. Locality of action is a very very deep physical principle, and it's absolutely central to our ability to reason about the world at all. To break it apart into independent pieces.
So I strongly reject your example. It makes no sense to present the pictures of a video game in Fourier space. Its highly unnatural for very profound reasons. Our difficulty stems entirely from the fact that our vision system is built for interpreting a world with local rules and laws.
I also see no reason that an AI could successfully transfer between the two representations easily. If you start from scratch it could train on the Fourier space data, but that's more akin to using different eyes, rather than transfer.
This is a problem because we are approaching AI from an angle of no a priori assumptions about the variations on the pattern that it should be able to generalize to. We just imagine that there's some magic way to recognize any isomorphic representation and transfer our knowledge to the new variables, when the reality is we can only recognize when the domain being transferred to is only different in a narrow set of ways like being upside down or on a bent surface. The set of possible variations on a 2d platformer we can generalize well enough to just pick up and play is a tiny subset of all the ways you could map the pixels on the screen to something else without technically losing information.
We could probably make an AI that bakes in the sort of assumptions where it can easily generalize what it learns to fourier space representations of the same data, but then it probably wouldn't be good at generalizing the same sorts of things we are good at generalizing.
My point (hypothesis really) is that the ability to "generalize in general" is a fiction. We can't do it either. But the sort of things we can generalize are exactly the sort that tend to occur in nature anyway so we don't notice the blind spot in what we can't do because it never comes up.
Not Zelda. That game is highly nonlinear and its measurable goals (triforce pieces) are long-term objectives that take a lot of gameplay to obtain. As far as I’m aware, no AI has been able to make even modest progress without any prior knowledge of the game itself.
Yet many humans can successfully play and complete the first dungeon without any outside help. While completing the full game is a challenge that takes dedication, many people achieved it long before having access to the internet and its spoiler resources.
So why is this? Why are humans so much better at Zelda than AIs? I believe that transfer knowledge has a lot to do with it. For starters, Link is approximately human (technically Hylian, but they are considered a race of humans, not a separate species) which means his method of sensing and interacting with his world will be instantly familiar to humans. He’s not at all like an earthworm or an insect in that regard.
Secondly, many of the objects Link interacts with are familiar to most modern humans today: swords, shields, keys, arrows, money, bombs, boomerangs, a ladder, a raft, a letter, a bottle of medicine, etc. Since these objects in-game have real world analogues, players will already understand their function without having to figure it out. Even the triforce itself functions similarly to a jigsaw puzzle, making it obvious what the player’s final objective should be. Furthermore, many players would be familiar with the tropes of heroic myths from many cultures which the Zelda plot closely adheres to (undertake a quest of personal growth, defeat the nemesis, rescue the princess).
All of this cultural knowledge is something we take for granted when we sit down to play Zelda for the first time. We’re able to transfer it to the game without any effort whatsoever, something I have yet to witness an AI achieve (train an AI on a general cultural corpus containing all of the background cultural information above and get it to transfer that knowledge into gameplay as effectively as an unspoiled Zelda beginner).
As for the Fourier transform, I don’t know. I do know that the Legend of Zelda has been successfully completed while playing entirely blindfolded. Of course, this wasn’t with Fourier transformed sound, though since the blindfolded run relies on sound cues I imagine a player could adjust to the Fourier transformed sound effects.
It sounds like the "best" AI without constraint would just be something like a replay of a record speedrun rather than a smaller set of heuristics of getting through a game, though the latter is clearly much more important with unseen content.
John Carmack founded Keen technology in 2022 and has been working seriously on AI since 2019. From his experience in the video game industry, he knows a thing or two about linear algebra and GPUs, that is the underlying maths and the underlying hardware.
So, for all intent and purposes, he is an "AI guy" now.
He has built an AI system that fails to do X.
That does not mean there isn't an AI system that can do X. Especially considering that a lot is happening in AI, as you say.
Anyway, Carmack knows a lot about optimizing computations on modern hardware. In practice, that happens to be also necessary for AI. However, it is not __sufficient__ for AI.
Perhaps you have put your finger on the fatal flaw ...
You are holding the burden of proof here...
Maybe this is formulated a bit harshly, but let us respect the logic here.
God I hate sounding like this. I swear I'm not too good for John Carmack, as he's infinitely smarter than me. But I just find it a bit weird.
I'm not against his discovery, just against the vibe and framing of the op.
One phenomena that bared this to me, in a substantive way, was noticing an increasing # of reverent comments re: Geohot in odd places here, that are just as quickly replied to by people with a sense of how he works, as opposed to the keywords he associates himself with. But that only happens here AFAIK.
Yapping, or, inducing people to yap about me, unfortunately, is much more salient to my expected mindshare than the work I do.
It's getting claustrophobic intellectually, as a result.
Example from the last week is the phrase "context engineering" - Shopify CEO says he likes it better than prompt engineering, Karpathy QTs to affirm, SimonW writes it up as fait accompli. Now I have to rework my site to not use "prompt engineering" and have a Take™ on "context engineering". Because of a couple tweets + a blog reverberating over 2-3 days.
Nothing against Carmack, or anyone else named, at all. i.e. in the context engineering case, they're just sharing their thoughts in realtime. (i.e. I don't wanna get rolled up into a downvote brigade because it seems like I'm affirming the loose assertion Carmack is "not an AI guy", or, that it seems I'm criticizing anyone's conduct at all)
EDIT: The context engineering example was not in reference to another post at the time of writing, now one is the top of front page.
The difference here is that your example shows a trivial statement and a change period of 3 days, whereas what Carmack is doing is taking years.
Not sure why justanotherjoe is a credible resource on who is and isn’t expert in some new dialectic and euphemism for machine state management. You’re that nobody to me :shrug:
Yann LeCun is an AI guy and has simplified it as “not much more than physical statistics.”
WWhole lot of AI is decades old info theory books applied to modern computer.
Either a mem value is or isn’t what’s expected. Either an entire matrix of values is or isn’t what’s expected. Store the results of some such rules. There’s your model.
The words are made up and arbitrary because human existence is arbitrary. You’re being sold on a bridge to nowhere.
That's just what I think anyway.
[1] https://instadeep.com/2021/10/a-simple-introduction-to-meta-...
And like I get it, it’s fun to complain about the obnoxious and irrational AGI people. But the discussion about how people are using these things in their everyday lives is way more interesting.
I'm wondering whether one has tested with the same model but on two situations:
1) Bring it to superhuman level in game A and then present game B, which is similar to A, to it.
2) Present B to it without presenting A.
If 1) is not significantly better than 2) then maybe it is not carrying much "knowledge", or maybe we simply did not program it correctly.
Doesn't seem unreasonable that the same holds in a gaming setting, that one should train on many variations of each level. Change the lengths of halls connecting rooms, change the appearance of each room, change power-up locations etc, and maybe even remove passages connecting rooms.
[1]: https://physics.allen-zhu.com/part-3-knowledge/part-3-1
I guess its a totaly different level of control: instead of immediately choosing a certain button to press, you need to set longer term goals. "press whatever sequence over this time i need to do to end up closer to this result"
There is some kind of nested multidimensional thing to train on here instead of immediate limited choices
Of course, this because I have spent a lot of time TRAINING to play chess and basically none training to play go.
I am good on guitar because I started training young but can't play the flute or piano to save my life.
Most complicated skills have basically no transfer or carry over other than knowing how to train on a new skill.
If i give you a chess set with dwarf themed pieces and different colored squares, you could play immediately.
A lot of intelligence is just pattern matching and being quick about it.
Current AI only does one of those (pattern matching, not evolution), and the prospects of simulating evolution is kind of bleak, given I don’t think we can simulate a full living cell yet from scratch? Building a world model requires life (or something that has undergone a similar evolutionary survivorship path), not something that mimics life.
But producing something useful is a totally different thing from producing resilience in physical reality. That takes a world model, and I guess my suspicion is that an entity can’t build a world model without a long history of surviving in that world.
Put another way, you can never replicate what it’s like to burn your hand on the fire using only words. You could have a million people tell a child about what fire is like, the dangers of it, the power of it, the pain of it. But they will never develop an innate understanding of it that helps them navigate the real world.
Until they stick their hand in the fire. Then they know.
We train the models on what are basically shadows, and they learn how to pattern match the shadows.
But the shadows are only depictions of the real world, and the LLMs never learn about that.
To mitigate this you have to include the other categories in your finetune training dataset so it doesn't lose the existing knowledge. Otherwise, the backpropagation and training will favour weights that reflect the new data.
In the game example having the weights optimized for game A doesn't help with game B. It would be interesting to see if training for both game A and B help it understand concepts in both.
Similarly with programming languages it would be interesting to see if training it with multiple languages if it can extract concepts like if statements and while loops.
IIUC from the observations with multilingual LLMs you need to have the different things you are supporting in the training set together. Then the current approach is able to identify similar concepts/patterns. It's not really learning these concepts but is learning that certain words often go together or that a word in one language is similar to another.
It would be interesting to study multilingual LLMs for their understanding of those languages in the case where the two languages are similar (e.g. Scottish and Irish Gaelic; Dutch and Afrikaans; etc.), are in the same language family (French, Spanish, Portuguese), or are in different language families (Italian, Japanese, Swahili), etc.
Supposedly it does both A and B worse. That's their problem statement essentially. Current SOTA models don't behave like humans would. If you took a human that's really good at A and B, chances are they're gonna pick up C much quicker than a random person off the street that hasn't even seen Atari before. With SOTA models, the random "person" does better at C than the A/B master.
AI has beat the best human players in Chess, Go, Mahjong, Texas hold'em, Dota, Starcraft, etc. It would be really, really surprising that some Atari game is the holy grail of human performance that AI cannot beat.
In other words, the Starcraft AIs that win do so by microing every single unit in the entire game at the same time, which is pretty clever, but if you reduce them to interfacing with the game in the same way a human does, they start losing.
One of my pet peeves when we talk about the various chess engines is yes, given a board state they can output the next set of moves to beat any human, but can they teach someone else to play chess? I'm not trying to activate some kinda "gotcha" here, just getting at what does it actually mean to "know how to play chess". We'd expect any human that claimed to know how to play to be able to teach any other human pretty trivially.
Where can I read about these experiments?
The only thing I've seen approximating generalization has appeared in symbolic AI cases with genetic programming. It's arguably dumb luck of the mutation operator, but oftentimes a solution is found that does work for the general case - and it is possible to prove a general solution was found with a symbolic approach.
Less quality of life focused, I don’t believe that the models he uses for this research are capable of more. Is it really that revealing?
The original paper "Playing Atari with Deep Reinforcement Learning" (2013) from Deepmind describes how agents can play Atari games, but these agents would have to be specifically trained on every individual game using millions of frames. To accomplish this, simulators were run in parallel, and much faster than in real-time.
Also, additional trickery was added to extract a reward signal from the games, and there is some minor cheating on supplying inputs.
What Carmack (and others before him) is interested in, is trying to learn in a real-life setting, similar to how humans learn.
It’s apparently much easier to scare the masses with visions of ASI, than to build a general intelligence that can pick up a new 2D video game faster than a human being.