A competitive geoguesser clearly got there through memorizing copious internet searching. So comparing knowledge retained in the trained model to knowledge retained in the brain feels surprisingly fair.
Conversely, the model sharing, “I found the photo by crawling Instagram and used an email MCP to ask the user where they took it. It’s in Austria” is unimpressive
So independent from where it helps actually improve performance, the cheating/not cheating question makes for an interesting question of what we consider to be the cohesive essence of the model.
For example, RAG against a comprehensive local filesystem would also feel like cheating to me. Like a human geoguessing in a library filled with encyclopedias. But the fact that vanilla O3 is impressive suggests I somehow have an opaque (and totally poorly informed) opinion of the model boundary, where it’s a legitimate victory if the model was birthed with that knowledge baked in, but that’s it.
Conversely, the model sharing, “I found the photo by crawling Instagram and used an email MCP to ask the user where they took it. It’s in Austria” is unimpressive
So independent from where it helps actually improve performance, the cheating/not cheating question makes for an interesting question of what we consider to be the cohesive essence of the model.
For example, RAG against a comprehensive local filesystem would also feel like cheating to me. Like a human geoguessing in a library filled with encyclopedias. But the fact that vanilla O3 is impressive suggests I somehow have an opaque (and totally poorly informed) opinion of the model boundary, where it’s a legitimate victory if the model was birthed with that knowledge baked in, but that’s it.