- Says the CEO of MySpace.
- https://epoch.ai/benchmarks/simplebench
Just do it.
I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.
I don't know how people use one service can manage...
- Also
https://artificialanalysis.ai/evaluations/omniscience
Prepare to be amazed
- This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )
- Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.
- Interesting.
Another take-
"Two F1 fan surgeons found a way to visit Ferrari headquarters as a business trip."
- And the Xhigh version is only available via API, not chatgpt.
- I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.
- Why no grok 4.1 reasoning?
- You don't have to abandon privacy when using an eye - use a service that accesses enterprise APIs, which have good privacy policies. I use the service from the guys who create the This day in AI podcast called smithery.ai -we are access to all of the sota models so we can flip between any model including lots of open source ones within one chat or within multiple chats and compared the same query, using various MCPs and lots of other features. If you're interested have a look at the discord to simtheory.ai (I have no connection to the service or to the creators)
- I would but his right "eyebrow" is too distracting
- This is all based on the LLM architecture that likely can't reach AGI.
If they aren't developing in parallel an alternative architecture than can reach AGI, when a/some companies develop such a new model, OpenAI are toast and all those juicy contracts are kaput.
- RIP privacy.
Hello Stasi Google and its full personalised file on XorNot.
Google knows when you're about to sneeze.
- PSA Don't use chrome.
- Here it makes a text based video editor that works:
- Used an AI to populate some of 5.1 thinking's results.
Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes
Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%
ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning
GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)
AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly
MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus
MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)
ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%
CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A
- Used an AI to populate some of 5.1 thinking's results.
Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking
---------------------------|--------------|----------------|-------------------|---------|------------------
Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%
ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%
GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%
AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%
MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%
MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%
ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%
CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A
OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A
Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A
LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A
Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A
SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A
t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A
Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A
FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A
SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A
MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A
Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A
MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A
Argh it doesn't come out write in HN
Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)