Preferences

mentalgear parent
Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.

Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.

Maybe this changes in SWE with more automated tests in verifiable simulators, but the real world is far to complex to simulate in its vastness.


diggan
> Meanwhile

What do you mean "meanwhile", that's exactly (among other things) the kind of stuff he's talking about? The various frictions and how you need to approach it

> anyone actually working in the space

Is this trying to say that Karpathy doesn't "actually work" with LLMs or in the ML space?

I feel like your whole comment is just reacting to the title of the YouTube video, rather than actually thinking and reflecting on the content itself.

demaga
I'm pretty sure "actually work" part refers to SWE space rather than LLM/ML space
Seanambers
Seems to me that this is just another level of throwing compute at the problem.

Same way programs was way more efficient before and now they are "bloated" with packages, abstractions, slow implementations of algos and scaffolding.

The concept of what is good software development might be changing as well.

LLMs might not write the best code, but they sure can write a lot of it.

ApeWithCompiler
A manager in our company introduced Gemini as a chat bot coupled to our documentation.

> It failed to write out our company name.The rest was flawed with hallucinations also, hardly worth to mention.

I wish this is a rage bait towards others, but what should me feelings be? After all this is the tool thats sold to me, I am expected to work with.

gorbachev
We had exactly the opposite experience. CoPilot was able to answer questions accurately and reformatted the existing documentation to fit the context of users' questions, which made the information much easier to understand.

Code examples, which we offer as sort of reference implementations, were also adopted to fit the specific questions without much issues. Granted these aren't whole applications, but 10 - 25 line examples of doing API setup / calls.

We didn't, of course, just send users' questions directly to CoPilot. Instead there's a bit of prompt magic behind the scenes that tweaks the context so that CoPilot can produce better quality results.

ramon156
The real question is how long it'll take until they're not brittle
Or will they ever be reliable. Your question is already making an assumption.
vFunct
Its perfectly reliable for the things you know it to be, such as operations within its context window size.

Don't ask LLMs to "Write me Microsoft Excel".

Instead, ask it to "Write a directory tree view for the Open File dialog box in Excel".

Break your projects down into the smallest chunks you can for the LLMs. The more specific you are, the more reliable it's going to be.

The rest of this year is going to be companies figuring out how to break down large tasks into smaller tasks for LLM consumption.

diggan
They're reliable already if you change the way you approach them. These probabilistic token generators probably never will be "reliable" if you expect them to 100% always output exactly what you had in mind, without iterating in user-space (the prompts).
I also think they might never become reliable.
There is a bar below which they are reliable.

"Write a Python script that adds three numbers together".

Is that bar going up? I think it probably is, although not as fast/far as some believe. I also think that "unreliable" can still be "useful".

diggan
But what does that mean? If you tell the LLM "Say just 'hi' without any extra words or explanations", do you not get "hi" back from it?
dist-epoch
I remember when people were saying here on HN that AIs will never be able to generate picture of hands with just 5 fingers because they just "don't have common sense"
yahoozoo
“Treat it like a junior developer” … 5 years later … “Treat it like a junior developer”
agile-gift0262
while True:

  print("This model that just came out changes everything. It's flawless. It doesn't have any of the issues the model from 6 months ago had. We are 1 year away from AGI and becoming jobless")
  sleep(timedelta(days=180).total_seconds)
TeMPOraL
Usable LLMs are 3 years old at this point. ChatGPT, not Github Copilot, is the marker.
LtWorf
Usable for fun yes.
guappa
hombre_fatal
On the other hand, posts like this are like watching someone writing ask jeeves search queries into google 20 years ago and then gesturing how google sucks while everyone else in the room has figured out how to be productive with it and cringes at his "boomer" queries.

If you're still struggling to make LLMs useful for you by now, you should probably ask someone. Don't let other noobs on HN +1'ing you hold you back.

mirrorlake
Perhaps consider making some tutorials, then, and share your wealth of knowledge rather than calling people stupid.
hombre_fatal
I wouldn't want to rob them of the satisfaction of self-reliance.
sensanaty
There's also those instances where Microsoft unleashed Copilot on the .NET repo, and it resulted in the most hilariously terrible PRs that required the maintainers to basically tell Copilot every single step it should take to fix the issue. They were basically writing the PRs themselves at that point, except doing it through an intermediary that was much dumber, slower and less practical than them.

And don't get me started on my own experiences with these things, and no, I'm not a luddite, I've tried my damndest and have followed all the cutting-edge advice you see posted on HN and elsewhere.

Time and time again, the reality of these tools falls flat on their face while people like Andrej hype things up as if we're 5 minutes away from having Claude become Skynet or whatever, or as he puts it, before we enter the world of "Software 3.0" (coincidentally totally unrelated to Web 3.0 and the grift we had to endure there, I'm sure).

To intercept the common arguments,

- no I'm not saying LLMs are useless or have no usecases

- yes there's a possibility if you extrapolate by current trends (https://xkcd.com/605/) that they indeed will be Skynet

- yes I've tried the latest and greatest model released 7 minutes ago to the best of my ability

- yes I've tried giving it prompts so detailed a literal infant could follow along and accomplish the task

- yes I've fiddled with providing it more/less context

- yes I've tried keeping it to a single chat rather than multiple chats, as well as vice versa

- yes I've tried Claude Code, Gemini Pro 2.5 With Deep Research, Roocode, Cursor, Junie, etc.

- yes I've tried having 50 different "agents" running and only choosing the best output form the lot.

I'm sure there's a new gotcha being written up as we speak, probably something along the lines of "Well for me it doubled my productivity!" and that's great, I'm genuinely happy for you if that's the case, but for me and my team who have been trying diligently to use these tools for anything that wasn't a microscopic toy project, it has fallen apart time and time again.

The idea of an application UI or god forbid an entire fucking Operating System being run via these bullshit generators is just laughable to me, it's like I'm living on a different planet.

diggan
You're not the first, nor the last person, to have a seemingly vastly different experience than me and others.

So I'm curious, what am I doing differently from what you did/do when you try them out?

This is maybe a bit out there, but would you be up for sending me like a screen recording of exactly what you're doing? Or maybe even a video call sharing your screen? I'm not working in the space, have no products or services to sell, only curious is why this gap seemingly exists between you and me, and my only motive would be to understand if I'm the one who is missing something, or there are more effective ways to help people understand how they can use LLMs and what they can use them for.

My email is on my profile if you're up for it. Invitation open for others in the same boat as parent too.

bsenftner
I'm a greybeard, 45+ years coding, including active in AI during the mid 80's and used it when it applied throughout my entire career. That career being media and animation production backends, where the work is both at the technical and creative edge.

I currently have an AI integrated office suite, which has attorneys, professional writers, and political activists using the system. It is office software, word processing, spreadsheets, project management and about two dozen types of AI agents that act as virtual co-workers.

No, my users are not programmers, but I do have interns; college students with anything from 3 to 10 years experience writing software.

I see the same AI use problem issues with my users, and my interns. My office system bends over backwards to address this, but people are people: they do not realize that AI does not know what they are talking about. They will frequently ask questions with no preamble, no introduction to the subject. They will change topics, not bothering to start a new session or tell the AI the topic is now different. There is a huge number of things they do, often with escalating frustration evident in their prompts, that all violate the same basic issue: the LLM was not given a context to understand the subject at hand, and the user is acting like many people and when explaining they go further, past the point of confusion, now adding new confusion.

I see this over and over. It frustrates the users to anger, yet at the same time if they acted, communicated to a human, in the same manner they'd have a verbal fight almost instantly.

The problem is one of communications. ...and for a huge number of you I just lost you. You've not been taught to understand the power of communications, so you do not respect the subject. How to communication is practically everything when it comes to human collaboration. It is how one orders their mind, how one collaborates with others, AND how one gets AI to respond in the manner they desire.

But our current software development industry, and by extension all of STEM has been short changed by never been taught how to effectively communicate, no not at all. Presentations and how to sell are not effective communications, that's persuasion, about 5% of what it takes to convey understanding in others which then unblocks resistance to changes.

sensanaty
So AI is simultaneously going to take over everyone's job and do literally everything, including being used as application UI somehow... But you have to talk to it like a moody teenager at their first job lest you get nothing but garbage? I have to put just as much (and usually, more) effort talking to this non-deterministic black box as I would to an intern who joined a week ago to get anything usable out of it?

Yeah, I'd rather just type things out myself, and continue communicating with my fellow humans rather than expending my limited time on this earth appeasing a bullshit generator that's apparently going to make us all jobless Soon™

TeMPOraL
> But you have to talk to it like a moody teenager at their first job lest you get nothing but garbage?

No, you have to talk to it like to an adult human being.

If one's doing so and still gets garbage results from SOTA LLMs, that to me is a strong indication one also cannot communicate with other human beings effectively. It's literally the same skill. Such individual is probably the kind of clueless person we all learn to isolate and navigate around, because contrary to their beliefs, they're not the center of the world, and we cannot actually read their mind.

bsenftner
Consider that these AIs are trained on human communications, they mirror that communication. They are literally damaged document repair models, they use what they are given to generate a response - statistically. The fact that a question generates text that appears like an answer is an exploited coincidence.

It's a perspective shift few seem to have considered: if one wants an expert software developer from their AI, they need to create an expert software developer's context by using expert developer terminology that is present in the training data.

One can take this to an extreme, and it works: read the source code of an open source project and get and idea of both the developer and their coding style. Write prompts that mimic both the developer and their project, and you'll find that the AI's context now can discuss that project with surprising detail. This is because that project is in the training data, the project is also popular, meaning it has additional sites of tutorials and people discussing use of that project, so a foundational model ends up knowing quite a bit, if one knows how to construct the context with that information.

This is, of course, tricky with hallucination, but that can be minimized. Which is also why we will all become aware of AI context management if we continue writing software that incorporates AIs. I expect context management is what was meant by prompt engineering. Communicating within engineering disciplines has always been difficult.

diggan
But parent explicitly mentioned:

> - yes I've tried giving it prompts so detailed a literal infant could follow along and accomplish the task

Which you are saying that might have missed in the end regardless?

bsenftner
I'd like to see the prompt. I suspect that "literal infant" is expected to be a software developer without preamble. The initial sentence to an LLM carries far more relevance, it sets the context stage to understand what follows. If there is no introduction to the subject at hand, the response will be just like anyone fed a wall of words: confusion as to what all this is about.
diggan
You and me both :) But I always try to read the comments here with the most charitable interpretation I can come up with.
I've got a working theory that models perform differently when used in different timezones... As in during US working hours they dont work as well due to high load. When used at 'offpeak' hours not only are they (obviously) snappier but the outputs appear to be a higher standard. Thought this for a while but now noticing with Claude4 [thinking] recently. Textbook case of anecdata of course though.
diggan
Interesting thought, if nothing less. Unless I misunderstand, it would be easy to run a study to see if this is true; use the API to send the same but slightly different prompt (as to avoid the caches) which has a definite answer, then run that once per hour for a week and see if the accuracy oscillates or not.
Yes good idea - although it appears we would also have to account for the possibility of providers nerfing their models. I've read others also think models are being quantized after a while to cut costs.
jim180
Same! I did notice, a couples of months ago, that same prompt in the morning failed and then, later that day, when starting from scratch with identical prompts, the results were much better.
To add to this, I ran into a lot of issues too. And similar when using cursor... Until I started creating a mega list of rules for it to follow that attaches to the prompts. Then outputs improved (but fell off after the context window got too large). At that stage I then used a prompt to summarize, to continue with a new context.
I think part of the problem is that code quality is somewhat subjective and developers are of different skill levels.

If you're fine with things that kinda working okay and you're not the best developer yourself then you probably think coding agents work really really well because the slop they produce isn't that much worse than yourself. In fact I know a mid-level dev who believes agent AIs write better code than himself.

If you're very critical of code quality then it's much tougher... This is even more true in complex codebases where simply following some existing pattern to add a new feature isn't going to cut it.

The degree to which it helps any individual developer will vary, and perhaps it's not that useful for yourself. For me over the last few months the tech has got to the point where I use it and trust it to write a fair percentage of my code. Unit tests are an example where I find it does a really good job.

sensanaty
Listen, I won't pretend to be The God Emperor Of Writing Code or anything of the sort, I'm realistically quite mediocre/dead average in the grand scheme of things.

But literally yesterday, with Claude Code running 4 opus (aka: The latest and greatest, to intercept the "dId YoU tRy X" comment) which has full access to my entire Vue codebase at work, that has dedicated rules files I pass to it, that can see the fucking `.vue` file extension on every file in the codebase, after prompting it to "generate this vue component that does X, Y and Z" spat out React code at me.

You don't have to be Bjarne Stroustrup to get annoyed at this kinda stuff, and it happens constantly for a billion tiny things on the daily. The biggest pushers of AI have finally started admitting that it's not literally perfect, but am I really supposed to pretend that this workflow of having AIs generate dozens of PRs where a single one is somewhat acceptable is somehow efficient or good?

It's great for random one-offs, sure, but is that really deserving of this much insane, blind hype?

diggan
> If you're very critical of code quality then it's much tougher

I'm not sure, I'm hearing developers I know are sloppy and produce shit code both having no luck with LLMs, and some of them having lots of luck with them.

On the other side, those who really think about the design/architecture and are very strict (which is the group I'd probably put myself into, but who wouldn't?) are split in a similar way.

I don't have any concrete proof, but I'm guessing "expectations + workflow" differences would explain the vast difference in perception of usefulness.

Unironically, your comment mirrors my opinion as of last month.

Since then I've given it another try last week and was quite literally mind blown how much it improved in the context of Vibe coding (Claude code). It actually improved so much that I thought "I would like to try that on my production codebase", (mostly because I want if to fail, because that's my job ffs) but alas - that's not allowed at my dayjob.

From the limited experience I could gather over the last week as a software dev with over 10 yrs of experience (along with another 5-10 doing it as a hobby before employment) I can say that I expect our industry to get absolutely destroyed within the next 5 yrs.

The skill ceiling for devs is going to get mostly squashed for 90% of devs, this will inevitably destroy our collective bargaining positions. Including for the last 10%, because the competition around these positions will be even more fierce.

It's already starting, even if it's currently very misguided and mostly down to short-sightedness.

But considering the trajectory and looking at how naive current llms coding tools are... Once the industry adjusts and better tooling is pioneered... it's gonna get brutal.

And most certainly not limited to software engineering. Pretty much all desk jobs will get hemorrhaged as soon as a llm-player basically replaces SAP with entirely new tooling.

Frankly, I expect this to go bad, very very quickly. But I'm still hoping for a good ending.

This item has no comments currently.