Preferences

About 20 years ago, I was working on a project to build a vector model over a corpus of math texts. What ultimately killed it was that I couldn't figure a way to automatically reduce the equations to consistent, searchable text. This paper and others I've read ignore that issue. Anyone know how it is actually handled?

threeducks
Visual language models are quite good at converting images to LaTeX or any other kind of representation these days. Here is a demo where you can upload a screenshot of a page with an equation and ask the neural network to e.g. "Transcribe this equation as LaTeX.":

https://huggingface.co/spaces/Qwen/Qwen2-VL

You can also run a smaller model locally if you have enough VRAM, for example Qwen2.5-VL-7B-Instruct:

https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file#usin...

Also works reasonably well with hand-written equations.

For searching through similar equations, you can probably embed each as a high-dimensional vector and then search for the closed vector. Here is a ranking of text embedding networks:

https://huggingface.co/spaces/mteb/leaderboard

Or if you want something more deterministic, parse the LaTeX equation to create an abstract syntax tree for which there are plenty of similarity measures.

wrp OP
I think LaTeX renderings are not unique. Is there a schema for normalizing them?
lanstin
In practice communities of research mathematicians develop latex styles, similar to what they name the variables and like which proofs can be taken for granted. Collaborations are a method to synchronize the latex as are professor/grad student relationships.

So when you injest all the latex, you get the semantics, latex conventions, an variable naming of each school math for free

meowkit
Not sure I follow your question. Pretty sure they tokenize LaTeX, which should be searchable if the unrendered LaTeX is available
wrp OP
We were doing OCR on printed text. I tried to get access to the Zentralblatt MATH corpus, but no go.

This item has no comments currently.