This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...
Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:
https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...
---
SYSTEM:
Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?
USER: Correct. You run them using your code tooling.
SYSTEM:
Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch
The user wants me to:
Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest
But I don't have a tool that can:
Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment
The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
---
My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
Its a very impressive model, but I think we have different views on what is complex.
If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
And the wisest part is if he writes something they don't like, they can cut off that advanced access.
As is the longstanding tradition in games journalism, travel journalism, and suchlike.
You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.
And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.
If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!
Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.
They are a nice person.
We end up banning that sort of account, so if you would please review https://news.ycombinator.com/newsguidelines.html and stick to the rules from now on, that'd be good.
It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
I then had it iterate on a pretty complex database refactoring task, described in my post.