Profile: scrollop - Hacker Neue

Yes! Gemini 3 pro is significantly slower than opus (surprisingly) , and prefer opus' output.

Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)

Says the CEO of MySpace.

Just do it.

I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.

I don't know how people use one service can manage...

scrollop 5 days ago

Looking at this they are:

https://artificialanalysis.ai/evaluations/omniscience

https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582

scrollop 5 days ago

Also

https://artificialanalysis.ai/evaluations/omniscience

Prepare to be amazed

scrollop 5 days ago

This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )

https://epoch.ai/benchmarks/simplebench

scrollop 5 days ago

Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.

https://artificialanalysis.ai/evaluations/omniscience

scrollop Dec 13, 2025

Interesting.

Another take-

"Two F1 fan surgeons found a way to visit Ferrari headquarters as a business trip."

scrollop Dec 13, 2025

And the Xhigh version is only available via API, not chatgpt.

scrollop Dec 12, 2025

I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.

scrollop Dec 11, 2025

Why no grok 4.1 reasoning?

scrollop Dec 9, 2025

You don't have to abandon privacy when using an eye - use a service that accesses enterprise APIs, which have good privacy policies. I use the service from the guys who create the This day in AI podcast called smithery.ai -we are access to all of the sota models so we can flip between any model including lots of open source ones within one chat or within multiple chats and compared the same query, using various MCPs and lots of other features. If you're interested have a look at the discord to simtheory.ai (I have no connection to the service or to the creators)

scrollop Dec 6, 2025

I would but his right "eyebrow" is too distracting

scrollop Dec 2, 2025

This is all based on the LLM architecture that likely can't reach AGI.

If they aren't developing in parallel an alternative architecture than can reach AGI, when a/some companies develop such a new model, OpenAI are toast and all those juicy contracts are kaput.

scrollop Dec 2, 2025

RIP privacy.

Hello Stasi Google and its full personalised file on XorNot.

Google knows when you're about to sneeze.

scrollop Nov 22, 2025

PSA Don't use chrome.

scrollop Nov 18, 2025

Here it makes a text based video editor that works:

https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797

scrollop Nov 18, 2025

Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

scrollop Nov 18, 2025

Used an AI to populate some of 5.1 thinking's results.

---------------------------|--------------|----------------|-------------------|---------|------------------

Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

Argh it doesn't come out write in HN