Preferences


I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.

Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?

Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.

One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.

I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.

> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant

This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.

For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.

I'm confused about the "Accuracy vs Cost" section. Why is Gemini 3 Pro so cheap? It's basically the cheapest model in the graph (sans Llama 4 and Mistral Large 3) by a wide margin, even compared to Gemini 3 Flash. Is that an error?
It's not an error, Gemini 3 Pro is just somehow able to complete the benchmark while using way fewer tokens than any other model. Gemini 3 Flash is way cheaper per token, but it also tends to generate a ton of reasoning tokens to get to its answer.

They have a similar chart that compares results across all their benchmarks vs. cost and 3 Flash is about half as expensive as 3 Pro there despite being four times cheaper per token.

This item has no comments currently.