Preferences

Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :

- Gemini 3.0 Pro : 84.8

- DeepSeek 3.2 : 83.6

- GPT-5.1 : 69.2

- Claude Opus 4.5 : 67.4

- Kimi-K2 (1.2T) : 42.0

- Mistral Large 3 (675B) : 41.9

- Deepseek-3.1 (670B) : 39.7

The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.


How is there such a gap between Gemini 3 vs GPT 5.1/Opus 4.5? What is Gemini 3 crushing the others on?
Could be optimized for benchmarks, but Gemini 3 has been stellar for my tasks so far.

Maybe an architectural leap?

I believe it is the system instructions that make the difference for Gemini, as I use Gemini on AI Studio with my system prompts to get it to do what I need it to do, which is not possible with gemini.google.com's gems
Gamed tests?
I always joke that Google pays for a dedicated developer to spend their full time just to make pelicans on bicycles look good. They certainly have the cash to do it.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal