Comment by NoMoreNicksLeft

NoMoreNicksLeft 3 days ago parent

A proper epub consists of multiple html and css, not to mention the correct font files (ttf or otf). OCR can't recover those. I've found other books like that, you know, where someone didn't even bother to remove the page numbers from the OCRed text, and it's just a subpar reading experience.

UltraSane 3 days ago

For novels where you just want the text it is fine and you can reformat the text as you like.

NoMoreNicksLeft OP 3 days ago

I can always retypeset the text... but I'm not a professional editor/typesetter. You often lose parts/phrases/words the author wanted emphasized with italic and bold. Blockquotes can be gone. Even paragraph indentations in the worst offenders. I couldn't recreate that if I tried. Lord forbid there's a list/table/figure (even in some of the fiction I've read, they'll have those... weirdo science fiction novels, after all). I've gotten pretty good at fixing epubs with Calibre's editor, some of these are salvageable. Just finished with one where they split the chapters wrong (not at the chapter headings, but in between for some reason). And I'll often go get a high-res cover image off the publisher's website; they like to use bad sized-for-favicon scans off a random google image search for some reason.

But for me, the bad OCR ebooks can be painful to read.

UltraSane 3 days ago

The best OCR tool I have ever used is Editable Text and Images in Adobe Acrobat. It can actually replace text in place with a dynamically generated vector font that matches the original font. It is actually really impressive.

This item has no comments currently.