This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...
Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:
https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...