Comment by arnaudsm - Hacker Neue

arnaudsm Dec 2, 2025 parent

Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :

- Gemini 3.0 Pro : 84.8

- DeepSeek 3.2 : 83.6

- GPT-5.1 : 69.2

- Claude Opus 4.5 : 67.4

- Kimi-K2 (1.2T) : 42.0

- Mistral Large 3 (675B) : 41.9

- Deepseek-3.1 (670B) : 39.7

The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.

jasonjmcghee Dec 2, 2025

How is there such a gap between Gemini 3 vs GPT 5.1/Opus 4.5? What is Gemini 3 crushing the others on?

arnaudsm OP Dec 2, 2025

Could be optimized for benchmarks, but Gemini 3 has been stellar for my tasks so far.

Maybe an architectural leap?

netdur Dec 2, 2025

I believe it is the system instructions that make the difference for Gemini, as I use Gemini on AI Studio with my system prompts to get it to do what I need it to do, which is not possible with gemini.google.com's gems

gishh Dec 2, 2025

Gamed tests?

rdtsc Dec 2, 2025

I always joke that Google pays for a dedicated developer to spend their full time just to make pelicans on bicycles look good. They certainly have the cash to do it.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous