Anecdotally this new Sonnet model is massively falling apart on my tool call based workflows.
I’m having to handhold it through analysis tasks.
At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.
I have no idea what’s going on with those benchmarks if this is real world use.
I wonder if the 1m token context length is coming for this ride too?