Oh and I agree so much. I just shared a quick first observation in a real-world testing scenario (BTW re-ran Sonnet 4.5 with the same prompt, not much changed). I just keep seeing how LLM providers keep optimizing for benchmarks, but then I cannot reproduce their results in my projects.
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times