For instance, instead of a web designer AI, start with an agent to generate tests for a human building a web component. Then add an agent to generate components for a human building a design system. Then add an agent to generate a design system using those agents for a human building a web page. Then add an agent to build entire page layouts using a design system for a human building a website.
Even if there’s a 20% failure rate that needs human intervention, that’s still 5x developer productivity. When the failure rate gets low enough, move up the stack.
So for basic CRUD style web apps it is great. But then again so is a template.
But the minute you are dealing with newer libraries, less popular languages e.g Rust or Scala it just falls apart where for me it constantly hallucinates methods, imports etc.
But their ability to do complex logic falls apart, and their limits are pretty much a hard wall when reached.
Another issue with complex projects is that llms will not tell you what you don't know. They will happily go about designing crappy code if you ask them for a crappy solution and they don't have the ability to recommend a better path forward unless explicitly prompted.
That said, I had Claude generate most of a tile-based 2D pixel art rendering engine[1] for me, but again, once things got complicated I had to go and start hand fixing the code because Claude was no longer able to make improvements.
I've seen these failure modes across multiple problem domains, from CSS (alternating between two broken css styles, neither came close to fixing the issue) to backend, to rendering code (trying to get character sprites correctly on the tiles)
[1] https://www.generativestorytelling.ai/town/index.html notice the tons of rendering artifacts. I've realized I'm going to need to rewrite a lot of how rendering happens to resolve them. Claude wrote 80% of the original code but by the time I'm done fixing everything maybe only 30% or so of Claude's code will remain.