Building something from scratch where there are plenty of examples public on github seems to be the easiest case. Put these agents on a real existing codebase and ask them to fix a bug and they become useless.
I think this would vary a lot between "real" code basis. I have had a lot of success when using somewhat stricter frameworks, with typed interfaces, and requiring well defined unit tests, and modules which ecapsulate a lot of logic.
Basically like Java Spring Boot or NestJS type projects.
Sure, it takes some creative prompting, and a lot of turns to get it to settle on the proper coordinate system for the whole thing, but it goes ahead and does it.
This took me two days so far. Unfortunate, the scope of the thing is now so large that the quality rapidly starts to degrade.