Comment by macawfish - Hacker Neue

macawfish Sep 29, 2025 parent

GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.

It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.

Whereas gemini and claude are excellent collaborators.

When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.

To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.

rapind Sep 29, 2025

> it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea

So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.

It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.

Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.

When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.

These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.

deciduously Sep 29, 2025

Just to add another anecdotal data point, ive absolutely observed Claude Code doing exactly this as well with git operations.

macawfish OP Sep 30, 2025

I've gotten the `git reset --hard` with Claude Code as well, just not immediately after (1)) explicitly pushing back against the idea or (2) it talking a bunch of shit about another agent's totally reasonable analysis.

rapind Sep 29, 2025

I exclusively used sonnet when I used Claud Code and never ran into this, so maybe it's an Opus thing, or I just got lucky? Definitely has happened to me a few times with Codex (which is what I'm currently using).

bobbylarrybobby Sep 29, 2025

I've seen sonnet undo changes I've made while it was working quite a few times. Now I just don't edit concurrently with it, and make sure to inform of it of changes I've made before letting it work on its own

vrosas Sep 29, 2025

Why are you having a conversation with your LLM about other agents?

doctoboggan Sep 29, 2025

I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.

Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.

Sammi Sep 29, 2025

Sure. But at no point do you need to talk about the existence of other agents. You talk about making a plan, and you talk about implementing the plan. There's no need to talk about where the plan came from.

macawfish OP Sep 29, 2025

Because the plan involves using multiple agents with different roles and I don't want them conflicting.

Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.

And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.

macawfish OP Sep 29, 2025

It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".

renewiltord Sep 29, 2025

All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".

prodigycorp Sep 29, 2025

My favorite evaluation prompt which, I've found, tends to have the right level of skepticism is as follows (you have to tack it on to whatever idea/proposal you have):

"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"

It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.

macawfish OP Sep 29, 2025

Yeah, it's wild how the biases get encoded in there. Maybe they aren't even entirely separable from the magic of LLMs.

Marazan Sep 29, 2025

It isn't wild, it is inherent to the very nature of large language models.

The power of using LLMs is working out what it has encoded and how to access it.

macawfish OP Sep 29, 2025

I appreciate it being wild in the sense that language is inherently a tangled mess and these tools are actually leveraging that messy complexity.

baq Sep 30, 2025

It’s as if we made the machine in our own image. Who would’ve thought /s

Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.

tux3 Sep 29, 2025

That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.

If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.

int_19h Sep 30, 2025

The best way is to nuke the servers from orbit, just to be sure. ~

artdigital Sep 30, 2025

Gemini is an excellent collaborator?

It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”

It’s by far the most toxic and gaslighting LLM

alex1138 Sep 30, 2025

What you get when you mix Google's excellent technical background with interoffice politics and extreme political correctness

layer8 Sep 29, 2025

> "another agent"

You could just say it’s another GPT-5 instance.

aaronbrethorst Sep 29, 2025

Please tell me you're joking or at least exaggerating about GPT-5's behavior

macawfish OP Sep 29, 2025

The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.

To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.

nerdsniper Sep 29, 2025

I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.

However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.

It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.

Hopefully eventually we’ll all mostly figure it out.

macawfish OP Sep 29, 2025

That's fair. I enjoy the playfulness of it and for me it feels almost like a video game or something, and also like I'm using my own natural language directly.

Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.

The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.

johnfn Sep 29, 2025

That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:

> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”

Really, GPT? Not just “can you set up the WiFi”??!

ipython Sep 29, 2025

I'm curious what you would have expected it to reply given the input you provided?

johnfn Sep 29, 2025

Er, I said it in my post, but calling support and saying “can you set up the WiFi” would have been fine.

This item has no comments currently.