I find it’s predictive of relative performance in other tasks I use LLMs for. Claude is the best. The only shortcoming is its peculiar verbosity.
Definitely superior to anything OpenAI has and miles beyond the “open weights” alternatives like Llama.
For example, even the new 3.5 Sonnet can't solve this reliably:
> Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?
In fact, not only its solution is wrong, but it can't figure out why it's wrong on its own if you ask it to self-check.
In contrast, GPT-4o always consistently gives the correct response.