You're correct! Feature extractors lose fidelity and have finite attention, just like us. But we can reduce/compress the "essence" of an image, paragraph, song, etc into some combination of underlying features.
Think of a 4096x4096 pixel white image.
To hold this image in mind, does your memory load tens of millions of bits? Thankfully no! What if we add a big red circle which spans the image? Or write the chorus of All Star inside it? Ezpz! The number of "features" is comically simple.
Same thing for AI models. They discover the concept of letters, the sound of b-flats, image symmetry, turns of phrase, the conceptual distance between a "woman" an a "queen", etc. These are all natural patterns common to the data it sees. It can thus (like us!) reduce complicated input into a (fixed-size) smear of these learned, related features.
Or is my understanding completely off? Perhaps it's "Translating" the image to text, by outputting a sequence of text tokens as it scans the image regions, and then the text queries (e.g. "whats funny about this") uses this translation as the context? Presumably, this is how the model handles audio input.