Comment by mynti - Hacker Neue

mynti Nov 18, 2025 parent

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

Workaccount2 Nov 18, 2025

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

aerhardt Nov 18, 2025

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

enraged_camel Nov 18, 2025

>> Codex has been good enough to me and it’s much cheaper.

It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

dudeinhawaii Nov 19, 2025

Not to start a war but I've had 'fast' Claude write reams of slop code that I then have had to work with Codex to remove. Add this to the pile of "yeah but I saw the opposite with <insert model>" - but that's been my 2 cents.

Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.

The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.

Let me see the benchmarks in 3 months.

enraged_camel Nov 19, 2025

Claude can definitely write a lot of not-great code, but IME that's easy enough to mitigate by having it write a planning document first, then implement it step by step based on a to-do list on that planning document. Cursor's plan mode works great for this. It lets you review the outline at the start, then review each bit as the model writes it.

That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.

mock-possum Nov 19, 2025

My issue with codex is needing to run it in wsl in windows, due to it spamming confirmation requests for running even the safest of commands (eg list directory contents, read file, git status) which in turn adds an extra layer of complexity hooking it up via MCP to anything running in windows outside of wsl (like say figma)

In Claude on the other hand, MCP connections really do seem to ‘just work’

htrp Nov 18, 2025

more playing to their strengths. a giant chunk of their usage data is basically code gen

Miraste Nov 18, 2025

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

vharish Nov 18, 2025

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

xnx Nov 18, 2025

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

cmrdporcupine Nov 19, 2025

Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.

felipeerias Nov 18, 2025

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

_factor Nov 18, 2025

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

Palmik Nov 18, 2025

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas Nov 18, 2025

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

Palmik Nov 18, 2025

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

lucassz Nov 18, 2025

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

enraged_camel Nov 18, 2025

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

HereBePandas Nov 18, 2025

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

embedding-shape Nov 18, 2025

> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.

Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

tosh Nov 18, 2025

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

raducu Nov 18, 2025

> This might also hint at SWE struggling to capture what “being good at coding” means.

My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

Squarex Nov 19, 2025

It is just Python and Django. It might indicate qualities in other technologies, but it is not very good benchmark.

JacobAsmuth Nov 18, 2025

50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.

aoeusnth1 Nov 18, 2025

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

varispeed Nov 18, 2025

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

baq Nov 18, 2025

I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.

alyxya Nov 18, 2025

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

macrolime Nov 18, 2025

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

I_am_tiberius Nov 19, 2025

I don't know if this is true but I believe Anthropic has for a long time illegally used user prompts for training, without user consent.

HereBePandas Nov 18, 2025

[comment removed]

Palmik Nov 18, 2025

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

HereBePandas Nov 18, 2025

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

Palmik Nov 18, 2025

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

jbellis Nov 19, 2025

swebench is (1) terrible and (2) saturated

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous