Comment by cahaya - Hacker Neue

cahaya Jul 18, 2025 parent

I can confirm. When trying convert simple Word sentences and tables to e.g. Markdown/HTML from a Word XML you need a PhD in XML edge cases and nested garbage.

paulbjensen Jul 18, 2025

I wonder if this tool by MSFT is able to handle that:

https://github.com/microsoft/markitdown

I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.

I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).

favorited Jul 18, 2025

Even though markitdown is a Microsoft project, it's just a thin wrapper around a bunch of 3rd party Python packages. For example, to go from docx to Markdown, it uses mammoth to convert docx to HTML[0], then uses markdownify to convert the HTML into Markdown[1].

[0]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c... [1]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c...

strongpigeon Jul 18, 2025

Technically, they're a bit more than just zip files (they're OPC containers [0]), but if you're hand editing the file content it doesn't really matter.

[0] Open Package Convention: https://en.wikipedia.org/wiki/Open_Packaging_Conventions

superjan Jul 18, 2025

Well, it is not pretty to see how the sausage gets made, but extracting formatted text from docx is absolutely doable, no PhD involved. Source: I have done it as a little sidequest because it was useful to audit a set of word documents.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous