Comment by stri8ted - Hacker Neue

stri8ted Sep 28, 2023 parent

As far as I understand, these multi-modal models work by embedding the text/image in a shared representation space. To perform OCR on such an embedding, it would require extracting every letter, in the correct order, from the embedding. But given the embedding is a fixed size, and therefor necessarily compressed, I would expect it to loose the exactness of the underlying input, especially with images containing a lot of text. So assuming GPT-V can effectively perform OCR, how is this being done given the constraints?

Or is my understanding completely off? Perhaps it's "Translating" the image to text, by outputting a sequence of text tokens as it scans the image regions, and then the text queries (e.g. "whats funny about this") uses this translation as the context? Presumably, this is how the model handles audio input.

a_wild_dandan Sep 28, 2023

You're correct! Feature extractors lose fidelity and have finite attention, just like us. But we can reduce/compress the "essence" of an image, paragraph, song, etc into some combination of underlying features.

Think of a 4096x4096 pixel white image.

To hold this image in mind, does your memory load tens of millions of bits? Thankfully no! What if we add a big red circle which spans the image? Or write the chorus of All Star inside it? Ezpz! The number of "features" is comically simple.

Same thing for AI models. They discover the concept of letters, the sound of b-flats, image symmetry, turns of phrase, the conceptual distance between a "woman" an a "queen", etc. These are all natural patterns common to the data it sees. It can thus (like us!) reduce complicated input into a (fixed-size) smear of these learned, related features.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous