There are a ton of models out there, ran in a ton of different ways, that can be used in different ways with different harnesses, and people use different workflows. There is just so many variables involved, that I don't think it's neither fair nor accurate for anyone to claim "This is obviously better" or "This is obviously impossible".
I've been in situations where I hit my head against some hard to find bug for days, then I put "AI" (but what? No one knows) to it and it solves it in 20 minutes. I've also asked "AI" to do trivial work that it still somehow fucked up, even if I could probably have asked a non-programmer friend to do it and they'd be able to.
The variance is great, and the fact that system/developer/user prompts matter a lot for what the responses you get, makes it even harder to fairly compare things like this without having the actual chat logs in front of you.
this strikes me as a very important thing to reflect on. when the automobile was invented, was the apparent benefit so incredibly variable?
Yes, lots of people were very vocally against horseless-carriages, as they were called at the time. Safety and public nuisance concerns were widespread, the cars were very noisy, fast, smoky and unreliable. Old newspapers are filled with opinions about this, from people being afraid of horseless-carriages spooking other's horses and so on. The UK restricted the adoption of cars at one point, and some Canton in Switzerland even banned cars for a couple of decades.
Horseless-carriages was commonly ridiculed for being just for "reckless rich hobbyists" and similar.
I think the major difference is that cars produced immediate, visible externalities, so it was easy for opposition to focus on public safety in public spaces. In contrast, AI has less physically visible externalities, although they are as important, or maybe even more important, than the ones cars introduced.
but if it empirically works, does it matter if the "intelligence" doesn't "understand" it?
Does a chess engine "understand" the moves it makes?
It's an useless philosophical discussion.
Late 2025 models very rarely hallucinate nonexistent core library functionality - and they run inside coding agent harnesses so if they DO they notice that the code doesn't work and fix it.
Agentic LLMs will notice if something is crap and won't compile and will retry, use the tools they have available to figure out what's the correct way, edit and retry again.
That is just not true, assuming you have a modicum of competence (which I assume you do). AIs suck at all these tasks; they are not even as good as an inexperienced human.