Preferences

These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...

I wonder if the 1m token context length is coming for this ride too?


data-ottawa
Anecdotally this new Sonnet model is massively falling apart on my tool call based workflows.

I’m having to handhold it through analysis tasks.

At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.

I have no idea what’s going on with those benchmarks if this is real world use.

This item has no comments currently.