Preferences

I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/

It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).

It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:

  Checkout https://github.com/simonw/llm and run the tests with
  
  pip install -e '.[test]'
  pytest
I then had it iterate on a pretty complex database refactoring task, described in my post.

whywhywhywhy
Kinda pointless listening to the opinions of people who've used previews because it's not gonna be the same model you'll experience once it gets downgraded to be viable under mass use and the benchmarks influencers use are all in the training data now and tested internally so any sort of testing like pelicans on bikes is just PR at this point.
benterix
Yeah I remember these GPT-5 demos from influencers like "it practically created a whole 3D modeller for me" and then once we got the real thing it sometimes looked like a dumbed down version of the previous iteration.
simonw OP
I learned that lesson from GPT-5, where the preview was weeks long and the models kept changing during that period.

This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.

The only thing I published that I ran against the preview model was the Claude code interpreter example.

I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...

belter
Testing this, its way more aggressive on throttle back than previous model, and message token lengths. Constantly stops in the middle of an action if its not a simple request. I presume you did not have resource limitations during the preview?
simonw OP
No, the preview was effectively unlimited usage (for two days).
whywhywhywhy
Yesterday someone posted an example of the same prompt but changing it to a human and it was basically trash, the example you've posted actually looks good all things considered. So yeah I do think its something they train on, same way they train on things in the benchmarks.
simonw OP
The easy way to tell is to try it yourself - run "Generate an SVG of a pelican riding a bicycle" and then try "Generate an SVG of an otter riding a skateboard" and see if the quality of the images seems similar.
solarwindy
How about a narwhal spacewalking from the ISS, with Earth visible below (specifically the Niger delta)?

https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...

Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:

https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...

fragmede
Well, if they produced a really really really good image for pelicans on bicycles and nothing else, then their cheating would be obvious, so it makes sense to cheat just a little bit, across the board (if we want to assume they're cheating).
icelancer
Huh. Doesn't work over here.

---

SYSTEM:

Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.

The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?

USER: Correct. You run them using your code tooling.

SYSTEM:

Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:

I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch

The user wants me to:

Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest

But I don't have a tool that can:

Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment

The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R

icelancer
When pressed:

---

My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.

Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.

What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.

EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.

simonw OP
Yes! Forgot to mention that, and it's weird that you can have one but not both enabled. I wrote more about that here: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
sophiebits
You need to enable the new code interpreter mode: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
mrheosuper
Interesting, enable those setting and the claude on claude.ai become claude code, and it try to run everything on claude container like it owns the machine. I don't want that.
magicalist
> I told it to Give me a zip file of everything you have done so far—you can explore the contents of the file it made me in this Gist.

For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?

simonw OP
I was impressed. Integrating it into my project would require me to spend a bunch more time reviewing it, but it did a better job than I was expecting. The tests it wrote were a particularly useful exploration of the problem space: https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b...
Snuggly73
Ignoring the tests, the first change was adding a single parent id column and the second "more complex" refactoring added few more hash columns to the table (after you've specified that you wanted them, i.e. not an open-ended question)

Its a very impressive model, but I think we have different views on what is complex.

ethanpil
Had a similar issue - wanted to get all the files from the response without too much work, so I opened a new tab and vibe coded this in about 4 minutes. Tested it on exactly 1 case: a previous Sonnet 4.5 response, and worked well.

https://github.com/ethanpil/claude-files-creator

deevus
I was worried for a minute that the implementation wasn't production ready. Thankfully, Claude mentioned it right at the end.
fragmede
Your notes on 4.5 were very interesting, but you asked it a question that only you/someone who already knows the code could ask. I don't though, so I asked it at a higher level: Claude, add tree-structured conversations to https://github.com/simonw/llm. Claude responded with a whole design doc, starting with database schema change (using the same column name even!). https://claude.ai/share/f8f0d02a-3bc1-4b48-b8c7-aa75d6f55021 As I don't know your code, that design doc looks cromulent, but you'd have to read it for yourself to decided how well it did with that higher level of ask.
gloosx
I am curious how the sandbox handles potentially malicious code. For example, what would happen if someone tried to run something like a crypto miner or a DDoS script?
simonw OP
Code running in the Claude environment can't make outbound network connections to anything more than a tiny list of domains (GitHub, PyPI, NPM) so DDoS targets are limited.

If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.

gloosx
This would be interesting to test in a controlled way, curious what the CPU/time/memory quotas per run, isolation method, and syscall filtering look like.

Would appreciate pointers to any documentation or tests.

luckydata
new models are always magical, let's see how it feels after the cost cutting measures get implemented in 2-3 months.
startupsfail
safety/security patches
mclau153
Could it pip install pygame and make a game playable in browser?
lilouartz
That's cool Didn't know it is capable of doing it!
kshacker
Sorry I downvoted this by mistake. Got confused between collapse and the down arrow (which I rarely use). Does someone know how to reverse a downvote. I do not see it.
Fishkins
You should see an "unvote" or "undown" link to the right of the timestamp (i.e. the opposite side from where the vote arrows were). It's fairly subtle.
kurtis_reed
Why did you have access to a preview?
simonw OP
I get access to previews from OpenAI, Anthropic and Gemini pretty often. They're usually accompanied by an NDA and an embargo date - in this case the embargo was 10am Pacific this morning.

I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.

dzhiurgis
Soooo that leaves xAI that had conditions
poopiokaka (dead)
Redster
Simonw is a cheerful and straightforward AI journalist who likes to show and not just tell. He has done a good job aggregating and documenting the progress of LLM tools and models. As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.

Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.

tootie
Simon has a popular blog, but he's also co-creator of Django and very well-known in the Python community.
michaelt
> As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.

And the wisest part is if he writes something they don't like, they can cut off that advanced access.

As is the longstanding tradition in games journalism, travel journalism, and suchlike.

simonw OP
If they do that I'll go back to writing about them after they ship. Not a big loss for me at all.
tripzilch
I get it, you would trust yourself if you said that, but it doesn't really matter whether you say that or not, what counts for your ongoing credibility if you will preface every future blog post with, whether you got special access, a special deal, sponsorship, or the fact that you didn't get any of those things.

You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.

And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.

knowsuchagency
I like Simon, but he's not a journalist. A journalist would not have gone to OpenAI to glaze the GPT-5 release with Theo. I don't say this to discount Simon -- I appreciate his writing and analysis but a journalist, he isn't.
simonw OP
I don't call myself a journalist, partly because no publication is paying me to do any of this!

If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.

Redster
That's a fair point. I feel like he's more than a blogger and am not sure the best term!
LudwigNagasena
An influencer.
asadotzler
AI blogger seems more appropriate than journalist.
are you aware of any "ai journalists"? Because simonw does great work, so perhaps blogger is what people should aspire towards?
simonw OP
I actually talk to journalists on the AI beat quite often - I've had good conversations with them at publications including The Economist and NY Times and Washington Post and ArsTechnica.

They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.

I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...

grim_io
Not sure what an AI journalist is supposed to be or do, but a lack of one does not promote someone who is not it automatically into the position.
landl0rd
Kylie Robison recently moved to Wired and is a solid "AI journalist".
rapfaria
His "pelican riding a bicycle" tests are now a classic and AI shops are benchmaxxing for it
simonw OP
They need to benchmaxxx a whole lot harder, the illustrations still all universally suck!
I fully expect a model to output a SVG made up of 1000x1000 rectangles (i.e. pixels) representing a raster image of a beautifully hand-drawn pelican riding a bicycle any day now :)
simonw OP
I got an amazing result from ChatGPT a while back - an SVG with a perfect illustration of a pelican riding a bicycle.

It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!

sixeyes
ive got such pixelated rectangle SVG's a few times.

also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.

astrange
If they were testing that it'd work more often.

Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.

Sharlin
Well, I definitely hope they aren't trying to teach LLMs directions between locations, given how idiotic use of compute and parameter space that would be. We already have excellent AIs for route planning. What they ought to optimize for is, of course, finally teaching them to say they don't know, or just automatically opting to call a route-planning API if the user asks for directions.
minimaxir
Simon tends to write up reports of new LLM releases (with great community respect) and it's much easier with lead time if the provider is able to set up a preview endpoint.
criddell
I believe the criticism is that he's reporting on a pre-release LLM which isn't the same as the one you and I are going to be using a few weeks from now after they've downgraded it enough to work at scale.
lossolo
The same reason YouTube reviewers and influencers get access to hardware or games before release. In this case, the person is a passionate blogger.
runjake
simonw is Simon Willison, who’s well known for a number of things. But these days, he’s well known for his AI centric blog and his tools. The AI companies give him early access to stuff.

https://simonwillison.net/

kissgyorgy
If you want to keep up with AI progress and model updates, simonw is the man to follow!
lomase
They are an AI evangelist that told me I can replace any technical book created with an LLM.

They are a nice person.

rhizome
You are correct, sir!
mvdtnz (dead)

This item has no comments currently.