Or is there a more subtle issue which prevents or makes this hard?
Is there something fundamentally impossible about having a model detecting the amount of Rs in 'strawberry' to be a string search operation and in some sandbox execute something like:
% echo "strawberry" | tr -dc "r" | wc -c
3
It seems agents do this already, but regular GPT style environments seem to lack it?Anyway,. let me refresh my page, as I am sure while typing this some new model architecture is dropping. ;)
Start with Jacob.
Jacob’s son → call him A.
A’s son → call him B.
B’s son → call him C.
C’s son → call him D (this is “the son of Jacob’s son’s son’s son”).
Now the question asks for the paternal great-great-grandfather of D:
D’s father → C
D’s grandfather → B
D’s great-grandfather → A
D’s great-great-grandfather → Jacob
Answer: Jacob
Also next time you should bother to at least copy paste your questions into any recent LLM, since they can all solve it without issue. But hallucations like this are common with non-reasoning HN users.
Don’t think so. Humans solve that puzzle in a very different way than LLMs ”reason” about it.
(DeepThink did wonder if it was supposed to be him afterwards or if it was a trick.)
Adding a second question like ”Is Abraham included in the family tree?” still makes it regress into mentioning Isaac, Judah, Joseph, 12 sons and whatnot.
There may be additional feedback loops, but fundamentally, that is what it is doing. Sure, it will show you what steps it takes to arrive at a conclusion, but it is just predicting the steps, the conclusion and the potential validity of the aforementioned based on its training data, not actually evaluating the logic or the truthiness of the output.
If you don’t believe me, ask your ”reasoning” LLM this question: What’s the name of the paternal great-great-grandfather of the son of Jacob’s son’s son’s son?