For the attention mechanism, there isn't much difference between
Original: {shared prefix} {removed part} {shared suffix} Modified: {shared prefix} {shared suffix}
And Original: {shared prefix} {shared suffix} Modified: {shared prefix} {added part} {shared suffix}
I think you could implement an algorithm for this in RASP (a language for manually programming transformers) roughly like this:1. The first layer uses attention to the "Original:" and "Modified:" tokens to determine whether the current token is in the original or modified parts.
2. The second layer has one head attend equally to all original tokens, which averages their values, and another head attends equally to all modified tokens, averaging them as well. The averages are combined by computing their difference.
3. The third layer attends to tokens that are similar to this difference, which would be the ones in the {removed part}/{added part}.
The only ordering-dependent part is whether you compute the difference as original_average - modified_average or the other way around.
If a model can detect additions but not removal, that would show that it is capable of learning this or a similar algorithm in principle, but wasn't trained on enough removal-style data to develop the necessary circuitry.
Additional channels being recognized in training usually didn't matter for the experiments and models I used to deal with before 2022, and if they were, certainly did not matter for colors. Then again, the work I was doing was on known (and some additional confusers) classes for object detection and classification where the color pretty much didn't matter in the first place.
The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!
Thank you for sharing on HN.