It's not strange at all. I am playing with lambda calculus and combinatory logic now, as a base for mathematics (my interest is to understand rigorous thinking). You can express any computation using just S and K combinators, however, there is a price to that - the computations will be rather slow. So to make the computation faster, we can use additional combinators and rules to speed things up (good example is clapp() function in https://github.com/tromp/AIT/blob/master/uni.c).
Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).
Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).