Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.
Unless the main area of improvement was tools and scaffolding rather than the model itself.
Unless the main area of improvement was tools and scaffolding rather than the model itself.