It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.
I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.
Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.
It may be cheaper but it's much, much slower, which is a total flow killer in my experience.
Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.
The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.
Let me see the benchmarks in 3 months.
That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.
In Claude on the other hand, MCP connections really do seem to ‘just work’
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.
I did not bother verifying the other claims.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
What do you mean by "standard eval harness"?
Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.
That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.
Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.
Evals are hard.
My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.
Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.
It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.
GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.