Comment by CuriouslyC

CuriouslyC Sep 29, 2025 parent

The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.

patates Sep 29, 2025

In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.

garciasn Sep 29, 2025

In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.

This has been outstanding for what I have been developing AI assisted as of late.

XenophileJKO Sep 29, 2025

I would believe this. In regular conversational use with the Gemini family of models, I've noticed they regularly have issues with context blending.. i.e. confusing what you said and they said and causality.

I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.

baq Sep 30, 2025

+1 but recently been experimenting with gpt-5–high for the plan part and it’s scary good sometimes.

jjani Sep 29, 2025

Gemini 2.5 Pro = Long context king, image input king

GPT-5 = Overengineering/complexity/"enterprise" king

Claude = "Get straightforwaed shit done efficiently" king

CuriouslyC OP Sep 29, 2025

On the plus side, GPT5 is very malleable, so you CAN prompt it away from that, whereas it's very hard to prompt Claude into producing hard code: even with a nearly file by file breakdown of a task, it'll occasionally run into an obstacle and just give up and make a mock or top implementation, basically diverge from the entire plan, then do its own version.

jjani Sep 30, 2025

Absolutely, sometimes you want, or indeed need such complexity. Some work in settings where they would want it all of the time. IMHO, most people, most of the time don't really want it, and don't want to have to prompt it every time to avoid it. That's why I think it's still very useful to build up experience with the three frontier models, so you can choose according to the situation.

int_19h Sep 30, 2025

I think a lot of it has to do with the super long context that it has. For extended sessions and/or large codebases that can fill up surprisingly quickly.

That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".

One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.

CuriouslyC OP Sep 29, 2025

Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.

Workaccount2 Sep 29, 2025

Its mostly because it is so damn good with long contexts. It can stay on the ball even at 150k whereas other models really wilt around 50-75k.

epolanski Sep 29, 2025

They are very good with C too, but it helps that there's gazzilions of lines of C out there.

sixothree Sep 29, 2025

You definitely need some context management like Serena.

CuriouslyC OP Sep 29, 2025

Even with Serena and detailed plans crafted by Gemini that lay out file-by-file changes, Claude will sometimes go off the rails. Claude is very task-completion driven, and it's willing to relax the constraints of the task to complete in the face of even slight adversity. I can't tell you the number of times I've had Claude try to install a python computational library, get an error, then either try to hand-roll the algorithm (in PYTHON) or just return a hard coded or mock result. The worst part is that Claude will tell you that it completed the task as instructed in the final summary; Claude lying is a meme for a reason.

sixothree Sep 29, 2025

I have to agree with pretty much all of this. Specifically, I've had Claude fail at creating a database migration using tooling then go on to create the migration manually. My only reaction to anyone doing this, be it human or computer, is "You did WHAT!?".

This item has no comments currently.