Comment by epolanski - Hacker Neue

epolanski Dec 9, 2025 parent

No I'm not, I'm just sick of these edgy takes where AI does not improve productivity when it obviously does.

Even if you limit your AI experience to finding information online through deep research it's such a time saver and productivity booster that makes a lot of difference.

The list of things it can do for you is massive, even if you don't have it write a single line of code.

Yet the counter argument is like "bu..but..my colleague is pushing slop and it's not good at writing code for me", come on, then use it at things it's good at, not things you don't find it satisfactory.

lunar_mycroft Dec 9, 2025

It "obviously" does based on what, exactly? For most devs (and it appears you, based on your comments) the answer is "their own subjective impressions", but that METR study (https://arxiv.org/pdf/2507.09089) should have completely killed any illusions that that is a reliable metric (note: this argument works regardless of how much LLMs have improved since the study period, because it's about how accurate dev's impressions are, not how good the LLMs actually were).

johnsmith1840 Dec 9, 2025

It's a good study. I also believe it is not an easy skill to learn. I would not say I have 10x output but easily 20%

When I was early in use of it I would say I sped up 4x but now after using it heavily for a long time some days it's 20% other days -20%

It's a very difficuly technology to know when you're one or the other.

The real thing to note is when you "feel" lazy and using AI you are almost certainly in the -20% category. I've had days of not thinking and I have to revert all the code from that day because AI jacked it up so much.

To get that speed up you need to be truly focused 100% or risk death by a thousand cuts.

keeda Dec 9, 2025

Yes, self-reported productivity is unreliable, but there have been other, larger, more rigorous, empirical studies on real-world tasks which we should be talking about instead. The majority of them consistently show a productivity boost. A thread that mentions and briefly discusses some of those:

https://www.hackerneue.com/item?id=45379452

lunar_mycroft Dec 9, 2025

Some (partial) counter points:

- I think given public available metrics, it's clear that this isn't translating into more products/apps getting shipped. That could be because devs are now running into other bottlenecks, but it could also indicate that there's something wrong with these studies.

- Most devs who say AI speeds them up assert numbers much higher than what those studies have shown. Much of the hype around these tools is built on those higher estimates.

- I won't claim to have read every study, but of the ones I have checked in the past, the more the methodology impressed me the less effect it showed.

- Prior to LLMs, it was near universally accepted wisdom that you couldn't really measure developer productivity directly.

- Review is imperfect, and LLMs produce worse code on average than human developers. That should result in somewhat lowered code quality with LLM usage (although that might be an acceptable trade off for some). The fact that some of these studies didn't find that is another thing that suggests there shortcomings in said studies.

keeda Dec 9, 2025

> - Most devs who say AI speeds them up assert numbers much higher than what those studies have shown.

I am not sure how much is just programmers saying "10x" because that is the meme, but if at all realistic numbers are mentioned, I see people claiming 20 - 50%, which lines up with the studies above. E.g. https://www.hackerneue.com/item?id=45800710 and https://www.hackerneue.com/item?id=46197037

> - Prior to LLMs, it was near universally accepted wisdom that you couldn't really measure developer productivity directly.

Absolutely, and all the largest studies I've looked at mention this clearly and explain how they try to address it.

> Review is imperfect, and LLMs produce worse code on average than human developers.

Wait, I'm not sure that can be asserted at all. Anecdotally not my experience, and the largest study in the link above explicitly discuss it and find that proxies for quality (like approval rates) indicate more improvement than a decline. The Stanford video accounts for code churn (possibly due to fixing AI-created mistakes) and still finds a clear productivity boost.

My current hypothesis, based on the DORA and DX 2025 reports, is that quality is largely a function of your quality control processes (tests, CI/CD etc.)

That said, I would be very interested in studies you found interesting. I'm always looking for more empirical evidence!

dns_snek Dec 10, 2025

> I see people claiming 20 - 50%, which lines up with the studies above

Most of those studies either measure productivity using useless metrics like lines of code, number of PRs, or whose participants are working for organizations that are heavily invested in future success of AI.

One of my older comments addressing a similar list of studies: https://www.hackerneue.com/item?id=45324157

keeda Dec 10, 2025

As mentioned in the thread I linked, they acknowlege the productivity puzzle and try to control for it in their studies. It's worth reading them in detail, I feel like many of them did a decent job controlling for many factors.

For instance, when measure the number of PRs they ensure that each one goes through the same review process whether AI-assisted or not, ensuring these PRs meet the same quality standards as humans.

Furthermore, they did this as a randomly controlled trial comparing engineers without AI to those with AI (in most cases, the same ones over time!) which does control for a lot of the issues with using PRs in isolation as a holistic view of productivity.

>... whose participants are working for organizations that are heavily invested in future success of AI.

That seems pretty ad hom, unless you want to claim they are faking the data. Along with co-authors who are from premier institutes like NBER, MIT, UPenn, Princeton, etc.

And here's the kicker: they all converge on a similar range of productivity boost, such as the Stanford study:

> https://www.youtube.com/watch?v=tbDDYKRFjhk (from Stanford, not an RCT, but the largest scale with actual commits from 100K developers across 600+ companies, and tries to account for reworking AI output. Same guys behind the "ghost engineers" story.

The preponderence of evidence paints a very clear picture. The alternative hypothesis is that ALL these institutes and companies are colluding. Occam's razor and all that.

lunar_mycroft Dec 10, 2025

> if at all realistic numbers are mentioned, I see people claiming 20 - 50%

IME most people claim small integer multiples, 2-5x.

> all the largest studies I've looked at mention this clearly and explain how they try to address it.

Yes, but I think pre-AI virtually everyone reading this would have been very skeptical about their ability to do so.

> My current hypothesis, based on the DORA and DX 2025 reports, is that quality is largely a function of your quality control processes (tests, CI/CD etc.)

This is pretty obviously incorrect, IMO. To see why, let's pretend it's 2021 and LLMs haven't come out yet. Someone is suggesting no longer using experienced (and expensive) first world developers to write code. Instead, they suggest hiring several barely trained boot camp devs (from low cost of living parts of the world so they're dirt cheap) for every current dev and having the latter just do review. They claim that this won't impact quality because of the aforementioned review and their QA process. Do you think that's a realistic assessment? If and on the off chance you think it is, why didn't this happen on a larger scale pre-LLM?

The resolution here is that while quality control is clearly important, it's imperfect, ergo the quality of the code before passing through that process still matters. Pass worse code in, and you'll get worse code out. As such, any team using the method described above might produce more code, but it would be worse code.

> the largest study in the link above explicitly discuss it and find that proxies for quality (like approval rates) indicate more improvement than a decline

Right, but my point is that that's a sanity check failure. The fact that shoving worse at your quality control system will lower the quality of the code coming out the other side is IMO very well established, as is the fact that LLM generated code is still worse than human generated (where the human knows how to write the code in question, which they should if they're going to be responsible for it). It follows that more LLM code generation will result in worse code, and if a study finds the opposite it's very likely that the it made some mistake.

As an analogy, when a physics experiment appeared to find that neutrino travel faster than the speed of light in a vacuum, the correct conclusion was that there had almost certainly been a problem with the experiment, not that neutrinos actually travel faster than the speed of light. That was indeed the explanation. (Note that I'm not claiming that "quality control processes cannot completely eliminate the effect of input code quality" and "LLM generated code is worse than human generated code" are as well established as relativity.)

3 More Comments →

hu3 Dec 9, 2025

not OP but I have a hard metric for you.

AI multiplied the amount of code I committed last month by 5x and it's exactly the code I would have written manually. Because I review every line.

model: Claude Sonnet 3.5/4.5 in VSCode GitHub Copilot. (GPT Codex and Gemini are good too)

lunar_mycroft Dec 9, 2025

I have no reason to think you're lying about the first part (although I'd point there's several ways that metric could be misleading, and approximately every piece of evidence available suggests it doesn't generalize), but the second part is very fishy. There's really no way for you to know whether or not you'd have written the same code or effectively the same code after reviewing existing code, especially when that review must be fairly cursory (because in order to get the speed up you claim, you must be spending much less time reviewing the code than it would have taken to write). Effectively, what you've done is moved the subjectivity from "how much does this speed me up?" to "is the output the same as if I had done it manually?"

hu3 Dec 9, 2025

> There's really no way for you to know whether or not you'd have written the same code or effectively the same code after reviewing existing code.

There is in my case because it's just CRUD code. The pattern looks exactly like the code I wrote the month prior.

And this is where LLMs excel at, in my experience. "Given these examples, extrapolate to these other cases."

Libidinalecon Dec 10, 2025

I am not even a software engineer but from using the models so much I think you are confined to a specific niche that happens to be well represented in the training data so you have a distorted perspective on the general usefulness of language models.

For some things LLMs are like magic. For other things LLMs are maddeningly useless.

The irony to me is anyone who says something like "you don't know how to use the LLM" actually hasn't explored the models enough to understand their strengths/weaknesses and how random and arbitrary the strengths and weakness are.

Their use cases happen to line up with the strengths of the model and think it is something they are doing special themselves when it is not.

douglasisshiny Dec 10, 2025

>No I'm not, I'm just sick of these edgy takes where AI does not improve productivity when it obviously does.

Feel free to cite said data you've seen supporting this argument.

This item has no comments currently.