Comment by wrp - Hacker Neue

wrp Feb 1, 2025 parent

About 20 years ago, I was working on a project to build a vector model over a corpus of math texts. What ultimately killed it was that I couldn't figure a way to automatically reduce the equations to consistent, searchable text. This paper and others I've read ignore that issue. Anyone know how it is actually handled?

threeducks Feb 1, 2025

Visual language models are quite good at converting images to LaTeX or any other kind of representation these days. Here is a demo where you can upload a screenshot of a page with an equation and ask the neural network to e.g. "Transcribe this equation as LaTeX.":

https://huggingface.co/spaces/Qwen/Qwen2-VL

You can also run a smaller model locally if you have enough VRAM, for example Qwen2.5-VL-7B-Instruct:

https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file#usin...

Also works reasonably well with hand-written equations.

For searching through similar equations, you can probably embed each as a high-dimensional vector and then search for the closed vector. Here is a ranking of text embedding networks:

https://huggingface.co/spaces/mteb/leaderboard

Or if you want something more deterministic, parse the LaTeX equation to create an abstract syntax tree for which there are plenty of similarity measures.

wrp OP Feb 1, 2025

I think LaTeX renderings are not unique. Is there a schema for normalizing them?

lanstin Feb 1, 2025

In practice communities of research mathematicians develop latex styles, similar to what they name the variables and like which proofs can be taken for granted. Collaborations are a method to synchronize the latex as are professor/grad student relationships.

So when you injest all the latex, you get the semantics, latex conventions, an variable naming of each school math for free

meowkit Feb 1, 2025

Not sure I follow your question. Pretty sure they tokenize LaTeX, which should be searchable if the unrendered LaTeX is available

wrp OP Feb 1, 2025

We were doing OCR on printed text. I tried to get access to the Zentralblatt MATH corpus, but no go.

This item has no comments currently.