cahaya parent
I can confirm. When trying convert simple Word sentences and tables to e.g. Markdown/HTML from a Word XML you need a PhD in XML edge cases and nested garbage.
I wonder if this tool by MSFT is able to handle that:
https://github.com/microsoft/markitdown
I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.
I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).
Even though markitdown is a Microsoft project, it's just a thin wrapper around a bunch of 3rd party Python packages. For example, to go from docx to Markdown, it uses mammoth to convert docx to HTML[0], then uses markdownify to convert the HTML into Markdown[1].
[0]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c... [1]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c...
Technically, they're a bit more than just zip files (they're OPC containers [0]), but if you're hand editing the file content it doesn't really matter.
[0] Open Package Convention: https://en.wikipedia.org/wiki/Open_Packaging_Conventions
Well, it is not pretty to see how the sausage gets made, but extracting formatted text from docx is absolutely doable, no PhD involved. Source: I have done it as a little sidequest because it was useful to audit a set of word documents.