It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
https://x.com/METR_Evals/status/2002203627377574113
> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
What an insane take for anybody uses these models daily.
LM Arena shows Claude Opus 4.5 on top
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
See: https://artificialanalysis.ai