Which is precisely why Richard Sutton doesn't think LLMs will evolve to AGI[0]. LLMs are based on mimicry, not experience, so it's more likely (according to Sutton) that AGI will be based on some form of RL (reinforcement learning) and not neural networks (LLMs).
More specifically, LLMs don't have goals and consequences of actions, which is the foundation for intelligence. So, to your point, the idea of a "skill" is more akin to a reference manual, than it is a skill building exercise that can be applied to developing an instrument, task, solution, etc.
He is right that non-RL'd LLMs are just mimicry, but the field already moved beyond that.
While that might true, it fundamentally means it's not going to ever replicate human or provide super intelligence.
Many people would argue that's a good thing
At the very end of an extremely long and sophisticated process, the final mapping is softmax transformed and the distribution sampled. That is one operation among hundreds of billions leading up to it.
It’s like saying is a jeopardy player is random word generating machine — they see a question and they generate “what is “ followed by a random word—random because there is some uncertainty in their mind even in the final moment. That is both technically true, but incomplete, and entirely missing the point.
That might be true, but we're talking about the fundamentals of the concept. His argument is that you're never going to reach AGI/super intelligence on an evolution of the current concepts (mimicry) even through fine tuning and adaptions - it'll like be different (and likely based on some RL technique). At least we have NO history to suggest this will be case (hence his argument for "the bitter lesson").
But this is easier said than done. Current models require vastly more learning events than humans, making direct supervision infeasable. One strategy is to train models on human supervisors, so they can bear the bulk of the supervision. This is tricky, but has proven more effective than direct supervision.
But, in my experience, AIs don't specifically struggle with the "qualitative" side of things per-se. In fact, they're great at things like word choice, color theory, etc. Rather, they struggle to understand continuity, consequence and to combine disparate sources of input. They also suck at differentiating fact from fabrication. To speculate wildly, it feels like it's missing the the RL of living in the "real world". In order to eat, sleep and breath, you must operate within the bounds of physics and society and live forever with the consequences of an ever-growing history of choices.
Which eventually forces you to take a step back and start questioning basic assumptions until (hopefully) you get a spark of realization of the flaws in your original plan, and then recalibrate based on that new understanding and tackle it totally differently.
But instead I watch Claude struggling to find a directory it expects to see and running random npm commands until it comes to the conclusion that, somehow, node_modules was corrupted mysteriously and therefore it needs to wipe everything node related and manually rebuild the project config by vague memory.
Because no big deal, if it’s wrong it’s the human's problem to untangle and Anthropic gets paid either way so why not try?
While we might agreed that language is foundational to what it is to be human, it's myopic to think its the only thing. LLMs are based on training sets of language (period).
Coding is an interesting example because as we change levels of abstraction from the syntax of a specific function to, say, the architecture of a software system, the ability to measure verifiable correctness declines. As a result, RL-tuned LLMs are better at creating syntactically correct functions but struggle as the abstraction layer increases.
In other fields, it is very difficult to verify correctness. What is good art? Here, LLMs and their ilk can still produce good output, but it becomes hard to produce "superhuman" output, because in nonverifiable domains their capability is dependent on mimicry; it is RL that gives the AI the ability to perform at superhuman levels. With RL, rather than merely fitting its parameters to a set of extant data it can follow the scent of a ground truth signal of excellence. No scent, no outperformance.
> More specifically, LLMs don't have goals and consequences of actions, which is the foundation for intelligence.
Citation?
And I associate that part to AGI being able to do cutting edge research and explore new ideas like humans can. Where, when that seems to “happen” with LLMs it’s been more debatable. (e.g. there was an existing paper that the LLM was able to tap into)
I guess another example would be to get an AGI doing RL in realtime to get really good at a video game with completely different mechanics in the same way a human could. Today, that wouldn’t really happen unless it was able to pre-train on something similar.
ChatGPT broke upen the dam to massive budget on AI/LM and LLM will probably be a puzzle peace to AGI. But otherwise?
I mean it should be clear that we have so much work to do like RL (which now happens btw. on massive scale because you thumb up or down every day), thinking, Model of Experts, toolcalling and super super critical: Architecture.
Compute is a hard upper limit too.
And the math isn't done either. The performance of Context length has advanced, we also saw other approcheas like a diffusion based models.
Whenever you hear the leading experts talking, they mention world models.
We are still in a phase were we have plenty of very obivous ideas people need to try out.
But alone the quality of whispher, llm as an interface and tool calling can solve problems with robotics and stuff, no one was able to solve that easy ever before.
You may disagree with this take but its not uninformed. Many LLMs use self‑supervised pretraining followed by RL‑based fine‑tuning but that's essentially it - it's fine tuning.
Also how do you think the most successful RL models have worked? AlphaGo/AlphaZero both use Neural Networks for their policy and value networks which are the central mechanism of those models.
On the other hand, LLMs have a programatic context with consistent storage and the ability to have perfect recall, they just don't always generate the expected output in practice as the cost to go through ALL context is prohibitive in terms of power and time.
Skills.. or really just context insertion is simply a way to prioritize their output generation manually. LLM "thinking mode" is the same, for what it's worth - it really is just reprioritizing context - so not "starting from scratch" per se.
When you start thinking about it that way, it makes sense - and it helps using these tools more effectively too.
I’d been re-teaching Claude to craft Rest-api calls with curl every morning for months before i realized that skills would let me delegate that to cheaper models, re-using cached-token-queries, and save my context window for my actual problem-space CONTEXT.
what the fuck, there is absolutely no way this was cheaper or more productive than just learning to use curl and writing curl calls yourself. Curl isn't even hard! And if you learn to use it, you get WAY better at working with HTTP!
You're kneecapping yourself to expend more effort than it would take to just write the calls, helping to train a bot to do the job you should be doing
You are bad at reading comprehension. My comment meant I can tell Claude “update jira with that test outcome in a comment” and, Claude can eventually figure that out with just a Key and curl, but that’s way too low level.
What I linked to literally explains that, with code and a blog post.
Not really. It's a consequential issue. No matter how big or small the context window is, LLMs simply do not have the concept of goals and consequences. Thus, it's difficult for them to acquire dynamic and evolving "skills" like humans do.
Of course OpenAI and Anthropic want to be able to reuse the same servers/memory for multiple users, otherwise it would be too expensive.
Could we have "personal" single-tenant setups? Where the LLM incorporates every previous conversation?
Not OP, but this is the part that I take issue with. I want to forget what tools are there and have the LLM figure out on its own which tool to use. Having to remember to add special words to encourage it to use specific tools (required a lot of the time, especially with esoteric tools) is annoying. I’m not saying this renders the whole thing “useless” because it’s good to have some idea of what you’re doing to guide the LLM anyway, but I wish it could do better here.
ooh, it does call make when I ask it to compile, and is able to call a couple other popular tools without having to refer to them by name. if I ask it to resize an image, it'll call imagemagik, or run ffmpeg and I don't need to refer to ffmpeg by name.
so at the end of the day, it seems they are their training data, so better write a popular blog post about your one-off MCP and the tools it exposes, and maybe the next version of the LLM will have your blog post in the training data and will automatically know how to use it without having to be told
I installed ImageMagik on Windows.
Created a ".claude/skills/Image Files/" folder
Put an empty SKILLS.md file in it
and told Claude Code to fill in the SKILLS.md file itself with the path to the binaries.
and it created all the instructions itself including examples and troubleshooting
and in my project prompted
"@image.png is my base icon file, create all the .ico files for this project using your image skill"
and it all went smoothly
For folks who this seems elusive for, it's worth learning how the internals actually work, helps a great deal in how to structure things in general, and then over time as the parent comment said, specifically for individual cases.
The description is equivalent to your short term memory.
The skill is like your long term memory which is retrieved if needed.
These should both be considered as part of the AI agent. Not external things.
You probably mean "starting from square one" but yeah I get you