Comment by whywhywhywhy

whywhywhywhy Sep 30, 2025 parent

Kinda pointless listening to the opinions of people who've used previews because it's not gonna be the same model you'll experience once it gets downgraded to be viable under mass use and the benchmarks influencers use are all in the training data now and tested internally so any sort of testing like pelicans on bikes is just PR at this point.

benterix Sep 30, 2025

Yeah I remember these GPT-5 demos from influencers like "it practically created a whole 3D modeller for me" and then once we got the real thing it sometimes looked like a dumbed down version of the previous iteration.

simonw Sep 30, 2025

I learned that lesson from GPT-5, where the preview was weeks long and the models kept changing during that period.

This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.

The only thing I published that I ran against the preview model was the Claude code interpreter example.

I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...

belter 5 days ago

Testing this, its way more aggressive on throttle back than previous model, and message token lengths. Constantly stops in the middle of an action if its not a simple request. I presume you did not have resource limitations during the preview?

simonw 5 days ago

No, the preview was effectively unlimited usage (for two days).

whywhywhywhy OP Sep 30, 2025

Yesterday someone posted an example of the same prompt but changing it to a human and it was basically trash, the example you've posted actually looks good all things considered. So yeah I do think its something they train on, same way they train on things in the benchmarks.

simonw Sep 30, 2025

The easy way to tell is to try it yourself - run "Generate an SVG of a pelican riding a bicycle" and then try "Generate an SVG of an otter riding a skateboard" and see if the quality of the images seems similar.

solarwindy 2 days ago

How about a narwhal spacewalking from the ISS, with Earth visible below (specifically the Niger delta)?

https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...

Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:

https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...

fragmede Sep 30, 2025

Well, if they produced a really really really good image for pelicans on bicycles and nothing else, then their cheating would be obvious, so it makes sense to cheat just a little bit, across the board (if we want to assume they're cheating).

This item has no comments currently.