Preferences

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.


abdullin
This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.
diggan OP
Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.
abdullin
I guess, it depends on the case and the approach.

It works really nice with the following approach (distilled from experiences reported by multiple companies)

(1) Augment codebase with explanatory texts that describe individual modules, interfaces and interactions (something that is needed for the humans anyway)

(2) Provide Agent.MD that describes the approach/style/process that the AI agent must take. It should also describe how to run all tests.

(3) Break down the task into smaller features. For each feature - ask first to write a detailed implementation plan (because it is easier to review the plan than 1000 lines of changes. spread across a dozen files)

(4) Review the plan and ask to improve it, if needed. When ready - ask to draft an actual pull request

(5) The system will automatically use all available tests/linting/rules before writing the final PR. Verify and provide feedback, if some polish is needed.

(6) Launch multiple instances of "write me an implementation plan" and "Implement this plan" task, to pick the one that looks the best.

This is very similar to git-driven development of large codebases by distributed teams.

Edit: added newlines

diggan OP
> distilled from experiences reported by multiple companies

Distilled from my experience, I'd still say that the UX is lacking, as sequential chat just isn't the right format. I agree with Karpathy that we haven't found the right way of interacting with these OSes yet.

Even with what you say, variations were implemented in a rush. Once you've iterated with one variation you can not at the same time iterate on another variant, for example.

abdullin
Yes. I believe, the experience will get better. Plus more AI vendors will catch up with OpenAI and offer similar experiences in their products.

It will just take a few months.

TeMPOraL
Worked in such a codebase for about 5 years.

No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.

diggan OP
There a few of us around, but it's not a lot, agree. It really is an uphill battle trying to get development teams to design and implement test suites the same way they do with other "more important" code.

This item has no comments currently.