Preferences

One problem that I run into with LLM code generation on large projects is that at some point the LLM runs into a problem it just cannot fix no matter how it is prompted. This manifest in a number of ways. Sometimes it is by bouncing back and forth between two invalid solutions while other times it is bouncing back and forth fixing one issue and while breaking something else in another part of the code.

Another issue with complex projects is that llms will not tell you what you don't know. They will happily go about designing crappy code if you ask them for a crappy solution and they don't have the ability to recommend a better path forward unless explicitly prompted.

That said, I had Claude generate most of a tile-based 2D pixel art rendering engine[1] for me, but again, once things got complicated I had to go and start hand fixing the code because Claude was no longer able to make improvements.

I've seen these failure modes across multiple problem domains, from CSS (alternating between two broken css styles, neither came close to fixing the issue) to backend, to rendering code (trying to get character sprites correctly on the tiles)

[1] https://www.generativestorytelling.ai/town/index.html notice the tons of rendering artifacts. I've realized I'm going to need to rewrite a lot of how rendering happens to resolve them. Claude wrote 80% of the original code but by the time I'm done fixing everything maybe only 30% or so of Claude's code will remain.


Same. I was writing my own language compiler with MLIR/C++ and GPT was ok-ish to dive into the space initially but ran out of steam pretty quickly and the recommendations were so off at one point (invented MLIR features, invented libraries, incorrect understanding of the framework, etc) that I had to go back to the drawing board, RTFM, and basically do everything I would have done without GPT to begin with. I've seen similar issues in other problem domains as well just like you. It doesn't surprise me though.
I’ve observed this too. I’m sceptical of the all-in-one builders, I think the most likely route to get there is for LLMs to eat the smaller tasks as part of a developer workflow, with humans wiring them together, then expand with specialised agents to move up the stack.

For instance, instead of a web designer AI, start with an agent to generate tests for a human building a web component. Then add an agent to generate components for a human building a design system. Then add an agent to generate a design system using those agents for a human building a web page. Then add an agent to build entire page layouts using a design system for a human building a website.

Even if there’s a 20% failure rate that needs human intervention, that’s still 5x developer productivity. When the failure rate gets low enough, move up the stack.

I’ve found that getting the ai to write unit tests is almost more useless than getting it to write the code. If I’m writing a test suite, the code is non trivial, and the edge cases are something I need to think about deeply to really make sure I’ve covered, which is absolutely not something an llm will do. And, most of the time, it’s only by actually writing the tests that I actually figure out all of the possible edge cases, if I just handed the job off to an llm I’m very confident that my defect rate would balloon significantly.
I've found that LLMs are only as good as the code it is trained on.

So for basic CRUD style web apps it is great. But then again so is a template.

But the minute you are dealing with newer libraries, less popular languages e.g Rust or Scala it just falls apart where for me it constantly hallucinates methods, imports etc.

I spent weeks trying to get GPT4 to get me through some gnarly Shapeless (Scala) issues. After that failure, I realized how real the limitations are. They really cannot produce original work and as far as niche languages, hallucinates all the time to the point of being completely unusable.
Hallucination from software generation AI is worse than software written by a beginning programmer, because at least they can analyse and determine the truth of their mistakes.
They can do surprisingly things if prompted correctly.

But their ability to do complex logic falls apart, and their limits are pretty much a hard wall when reached.

And so another level of abstraction in software development is created, but this time with an unknown level of accuracy. Call me old-school, but I like a debuggable, explainable and essentially provably reliable result. When a good developer can code while keeping the whole problem accurately in their heads, the code is worth its wait (deliberate sic, thank you) in gold.
Which is to say, the job of programming is safe, for now.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal