Sounds a bit silly to write it out, but the diagram did a great job removing ambiguity when you expect someone to be laying on the ground in a tight place looking backwards, upside down.
Also feels important to note that in the theatre, there is stage-right and stage-left, jargon to disambiguate even though the jargon expects you to know the meaning to understand it.
I guess car people use “driver side” and passenger side”, but the same car might be sold in mirror image versions
The mistake is in the prompting (not enough information). The AI did the best it could
"What's the biggest known planet" "Jupiter" "NO I MEANT IN THE UNIVERSE!"
Heck, humans are so flawed, they'll put the things in the wrong eye socket even knowing full well exactly where they should go - something a computer literally couldn't do.
In other examples, almost every single person has had the experience of saying, "turn right", "oh I meant left sorry, I knew it was right too, I don't know why I said left". Even the most sophisticated humans have made this error. A computer would never.
Humans are deeply flawed and after pre-selection require expensive training to perform complex tasks at a never perfect success rate.
So the understanding that AI and HI are different entities altogether with only a subset of communication protocols between them will become more and more obvious, like some comments here are already implicitly telling.
I dunno... I feel pretty confident 99% percent of people would do the same thing, and put the strawberry in the eye socket to our left, the viewer's.
You really have to be trained explicitly to put yourself in the subject's shoes, and very few people are. To me, the model is correctly following the instructions most people will mean.
And it's not even incorrect. "The left x" is linguistically ambiguous. If you say "the left flower", it's obviously the flower to our left. So when you say "the left eye socket", the eye socket to our left is a valid interpretation. If they had said their or its left eye socket, then it's more arguable that it must be from the subject's side. But that's not the case in this example.
Does it still use the viewer's perspective if the prompt specifies "Put a strawberry in the _patient's left eye_"? If it does, then you're onto something. Otherwise I completely disagree with this.
Same language, opposite meaning because of a particular noun + context.
I think the only thing obvious here is that there is no obvious solution other than adding lots of clarification to your prompt.
Also context matters, if you're talking to someone you would say "right shoulder" for _their_ right since you know it's an observer with different vantage point. Talking about a scene in a photo "the right shoulder" to me would more often mean right portion of the photo even if it was the person's left shoulder.
>> All five of the edits are implemented correctly
This is a GREAT example of the (not so) subtle mistakes AI will make in image generation, or code creation, or your future knee surgery. The model placed the specified items in the eye sockets based on the viewers left/right; when we talk relative in this scenario we usually (always?) mean from the perspective of the target or "owner". Doctors make this mistake too (they typically mark the correct side with a sharpie while the patient is still alert) but I'd be more concerned if we're "outsourcing" decision making without adequate oversight.
https://minimaxir.com/2025/11/nano-banana-prompts/#hello-nan...