I am surprised because it's such a simple task. Any human who is a bit diligent would be able to figure it out. They give both the original and the modified version.
However it feels a bit like counting letters. So maybe it can be solved with post training. We'll know in 3 to 6 months if it was easy for the labs to "fix" this.
In my daily use of LLMs I regularly have some overly optimistic answers because they fail to consider potentially absent or missing information (even harder because it's out of context).
if we want models that detect absences? you need training objectives that expect absence. maybe even input encodings that represent "this might've been here."