Preferences

Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.


This item has no comments currently.