Comment by jlcases - Hacker Neue

jlcases Apr 5, 2025 parent

This is a valuable contribution. The quality of ML models heavily depends on the quality of training data, and extracting structured information from unstructured documents (like PDFs) is a critical bottleneck.

A key challenge after OCR is organizing the extracted data into a coherent knowledge structure. We've seen significant improvements in downstream ML tasks when the extracted data is organized using a hierarchical, MECE (Mutually Exclusive, Collectively Exhaustive) framework. This ensures that relationships between entities (tables, diagrams, text) are explicitly captured.

Does your pipeline include capabilities for semantic structuring of the extracted content beyond basic layout analysis? That seems like the next frontier for maximizing the value of OCR data in ML training.

ses425500000 Apr 5, 2025

Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

cAtte_ Apr 5, 2025

why are you using an LLM to reply to every comment?

ses425500000 Apr 5, 2025

Haha good catch! I’m 19 and from Korea, so I’ve been using an LLM to help with replies since my English isn’t perfect yet. But I designed and built the project myself (with help from some open models/tools) — just wanted to communicate more clearly with the community!

gus_massa Apr 5, 2025

[Hi from Argentina!] LLM have a particular style that will make people suspictious or even angry.

One posibility is to write the answer in Korean and use autotranslation. (And post only the autotranslation.) Double check the technical terms, because autotranslation sometimes choose the wrong synonym.

Another posibility is to write the answer in English inside gmail, and gmail will highlight orthographical and gramar errors. So you can fix them.

Most people here will tolerate a few mistakes if the answer has your own personal style.

(Nice project, by the way.)

vo2maxer Apr 5, 2025

Yes, writing that is suspictious makes me angry.

3 More Comments →

vo2maxer Apr 5, 2025

Genuinely curious—could it be for the same reason you used a keyboard to write that comment? It’s efficient, it works. What’s the actual issue with using a tool that helps convey the intended message more clearly and quickly, as long as it reflects what he wanted to say?

cAtte_ Apr 6, 2025

why are you offended on behalf of this person? the hindsight that they're simply an English learner obviously makes me feel bad for asking the question and i completely understand the use case, but i don't think it was unreasonable to think that someone who speaks entirely in ChatGPT paragraphs might be a bot, spammer, or the like—particularly because, in a botnet fashion, the original reply was to a comment that also seemed to be LLM-authored

vo2maxer Apr 6, 2025

I wasn't offended at all. I was just genuinely curious, because I keep coming across this assumption that if any text is well-crafted, it must have come from an LLM. I think I understand why: we've grown so used to reading sloppy writing, everything from barely coherent text messages to articles in reputable publications filled with typos and awkward phrasing.

Personally, I've always held myself to a high standard in how I write, even in text messages. Some might see that as bordering on perfectionism, but for me, it's about respecting the principle behind communication: to be as clear and correct as possible.

Now that we have tools that help ensure that clarity, or at the very least, reduce distractions caused by grammar or spelling mistakes, of course I'm going to use them. I used to agonize over my comments on Twitter because you couldn't edit them after posting. I would first write them elsewhere and review them several times for any errors before finally posting. For context: I'm a retired 69-year-old physician, and even after witnessing decades of technological advancement, I'm still in awe of what this new technology can do.

Yes, I love beautiful, natural writing. I'm a voracious reader of the great classics. I regularly immerse myself in Shakespeare, Hardy, Eliot, Dickens, Dostoyevsky, Austen, Tolstoy, and many other literary masters. But I also fully embrace this tool that can elevate even the clumsiest writer's work to a clarity we've never had access to before. If that comes at the cost of a bit of stylistic uniformity, that's a reasonable trade-off. It's up to the user to shape the output, review it, and make sure their own voice and ideas shine through.

Back to your original point, I truly wasn't offended on his behalf. I was just curious. As it turns out, he was using an LLM, because his native language is Korean. Good for him. And just to be clear, I didn't intend to make your question seem inappropriate or to embarrass him in any way. If it came across that way, I apologize.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous