I wonder how this would apply with vision models? I tried with a few example of single images and they appear to do well. I did a few toy examples and they seem to do pretty well (Claude + Gemini) with spotting differences. An example image: https://www.pinterest.com/pin/127578601938412480/
They seem to struggle more when you flip the image around (finding fewer differences, and potentially halluciating)
They seem to struggle more when you flip the image around (finding fewer differences, and potentially halluciating)