As far as I can tell, the paper covers text documents only. Therefore your example doesn't quite apply.
It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.
I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:
* OCR
* CNNs (2D pattern recognizers) with different zooms, angles, slices etc
* Others maybe too?
It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.
I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:
* OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?