This is a common take but it hasn't been my experience. LLMs produce results that vary from expert all the way to slightly better than markov chains. The average result might be equal to a junior developer, and the worst case doesn't happen that often, but the fact that it happens from time to time makes it completely unreliable for a lot of tasks.
Junior developers are much more consistent. Sure, you will find the occasional developer that would delete the test file rather than fixing the tests, but either they will learn their lesson after seeing your wth face or you can fire them. Can't do that with llms.
- Language
- Total LOC
- Subject matter expertise required
- Total dependency chain
- Subjective score (audited randomly)
And we can start doing some analysis. Otherwise we're pissing into ten kinds of winds.
My own subjective experience is earth shattering at webapps in html and css (because I'm terrible and slow at it), and annoyingly good but a bit wrong usually in planning and optimization in rust and horribly lost at systems design or debugging a reasonably large rust system.
Besides one point: junior developers can learn from their egregious mistakes, llms can't no matter how strongly worded you are in their system prompt.
In a functional work environment, you will build trust with your coworkers little by little. The pale equivalent in LLMs is improving system prompts and writing more and more ai directives that might or might not be followed.
If I was tutoring a junior developer, and he accidentally deleted the whole source tree or something egregious, that would be a milestone learning point in his career, and he would never ever do it again. But if the LLM does it accidentally, it will be apologetic, but after the next context window clear, it has the same chances of doing it again.
I think if you set off an LLM to do something, and it does a "egregious mistake" in the implementation, and then you adjust the system prompt to explicitly guard against that or go towards a different implementation and you restart from scratch again yet it does the exact same "egregious mistake", then you need to try a different model/tool than the one you've tried that with.
It's common with smaller models, or bigger models that are heavily quanitized that they aren't great at following system/developer prompts, but that really shouldn't happen with the available SOTA models, I haven't had something ignored like that in years by now.
But is this like steel production or piloting (few highly trained experts are in the loop) or more like warehouse work (lots of automation removed any skills like driving or inventory work etc).
Or rather, it's more like a contractor. If I don't like the job they did, I don't give them the next job.
If you think LLMs are “better programmers than you,” well, I have some disappointing news for you that might take you a while to accept.