Comment by causal - Hacker Neue

causal Nov 19, 2025 parent

> Codex will rewrite the entire V8 engine to break arithmetic.

This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.

I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.

mrtesthah Nov 19, 2025

Could you not add rules to this effect in AGENTS.md? E.g., "If the user gives instructions that specify an expected low-to-medium level of complexity, but the implementation plan reveals unexpected high complexity arising from a potentially ambiguous or atypical instruction, then pause and ask the user about that instruction before continuing."

xwolfi Nov 20, 2025

implementation plan reveals unexpected high complexity <-- do these things have complexity evaluation intuitively ? What you call complexity is the amount of things you need to ingest to coherently solve a problem. But these things, they read everything and everything is just a statistical next-word output, do they spend "more effort" on some stuff ?

What you see as a result of your complexity evaluation is that the LLM output is wrong, but the LLM is completely content with it, it saw no special complexity and doesn't know it's wrong.

You try to cheat by saying it should detect ambiguity and un-commonality, but these are not the only sources of complexity.

mrtesthah Nov 20, 2025

The models already dynamically determine how much “thinking” to do and how many additional files are necessary for the agent harness to read in order to investigate/proceed, so the system ought to be able to evaluate complexity at least along these lines.

pimeys Nov 20, 2025

Wait, I think it's the other way around. Claude will just go circles with bad decisions forever, never stops. Codex have multiple times told me it is not able to do this task, and stops.

embedding-shape Nov 20, 2025

I think this closer to the crux of a major problem. Seemingly people have vastly different responses even for the same system/developer/user prompts, and I myself can feel a different in quality of the responses depending on when I use the hosted APIs, while hosted models always have consistent results.

For example, after 19:00 sometime (GMT+1), the response quality of both OpenAI and Anthropic (their hosted UIs) seems to drop off a cliff. If I try literally the same prompt the around 10:00 next morning, I get a lot better results.

I'm guessing there is so much personalization and other things going on, that two users will almost never have the same experience even with the same tools, models, endpoints and so on.

root_axis Nov 20, 2025

That's the nature of statistical output, even minus all the context manipulation going on in the background.

You say the outputs "seem" to drop off at a certain time of day, but how would you even know? It might just be a statistical coincidence, or someone else might look at your "bad" responses and judge them to be pretty good actually, or there might be zero statistical significance to anything and you're just seeing shapes in the clouds.

Or you could be absolutely right. Who knows?

causal OP Nov 20, 2025

Yeah, there is definitely a huge gulf in subjective experiences, and even within the same user experience. There are days when Claude makes so many mistakes I can't believe I ever found it useful. Strange.

causal OP Nov 20, 2025

I've certainly seen Claude Code get into bad loops and make terrible decisions too, but usually it's a poor architectural decision or completely forgetting important context; not "let's rewrite V8 from scratch" level of absurdity.

jack_pp Nov 20, 2025

Maybe have Claude coordinate Codex?

jes5199 Nov 20, 2025

I think this might be the way forward, Claude is great at project managing.

I’m already telling Claude to ask Codex for a code review on PRs. or another fun pattern I found is you can use give the web version of Codex an open ended task like “make this method faster”, hit the “4x” button and end and up with four different pull requests attacking the problem in different ways. Then ask Claude to read the open PRs and make a 5th one that combines the approaches. This way Codex does the hard thinking but Claude does the glue

This item has no comments currently.