Remember in image editing, the source image itself is a huge part of the prompt, and that's often the source of the ambiguity. The model may clearly understand your prompt to change the color of a shirt, but struggle to understand the boundaries of the shirt. I was just struggling to use AI to edit an image where the model really wanted the hat in the image to be the hair of the person wearing it. My guess for that bias is that it had just been trained on more faces without hats than with them on.
Something similar has been the case with text models. People write vague instructions and are dissatisfied when the model does not correctly guess their intentions. With image models it's even harder for model to guess it right without enough details.