> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out
I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”
Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.
I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”
Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.