Comment by aliljet - Hacker Neue

aliljet Sep 29, 2025 parent

These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...

I wonder if the 1m token context length is coming for this ride too?

data-ottawa Sep 29, 2025

Anecdotally this new Sonnet model is massively falling apart on my tool call based workflows.

I’m having to handhold it through analysis tasks.

At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.

I have no idea what’s going on with those benchmarks if this is real world use.

This item has no comments currently.