I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)
So it is not that LLMs can't be better at tasks, it is that they have specific limits that are hard to discern as pattern matching on the entire world of data is kind of an opaque tool in which we can not easily perceive where the walls are and it falls completely off the rails.
Since it is not true intelligence, but a good mimic at times, we will continue to struggle with unexpected failures as it just doesn't have understanding for the task given.
Fallacy of affirmation of the consequent.
At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.
I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)
Comparing humans to transformers is actually an instance of the phenomenon; we have an incomplete model of "intelligence" and we posit that humans have it but our model is only partially grounded. We assume humans ~100% have intelligence, are unsure of which animals might be intelligent, and are arguing about whether it's even well-typed to talk about transformer/LLM intelligence.
Non-religious people are not exempt. Everyone has a worldview (or prior commitments, if you like) through which they understand the world. If you encounter something that contradicts your worldview directly, even repeatedly, you are far more likely to "hallucinate" an understanding of the experience that allows your worldview to stay intact than to change your worldview.
I posit that humans are unable to function without what amounts to a religion of some sort -- be it secular humanism, nihilism, Christianity, or something else. When one is deposed at a societal level, another rushes in to fill the void. We're wired to understand reality through definite answers to big questions, whatever those answers may be.
So too are humans, it turns out.
We can understand concepts from the rules. LLMs must train on millions of examples. A human can play a game of chess from reading the instruction manual without ever witnessing a single game. This is distinctly different than pattern matching AI.
All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.
o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.
But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.