Comment by shanecp - Hacker Neue

shanecp Sep 30, 2025 parent

What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.

Bjorkbat Sep 30, 2025

Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.

This item has no comments currently.