Hello,
We (Docstoc Inc, Los Angeles) are using Aspose Products to convert documents to different formats such as Docx, Html, Pdf etc.
We have a flow where our internal system generates a valid MS WORD document (.docx format, not using Aspose) and we convert this document to HTML using Aspose Words .NET 14.7.0. Then we let people to customize this HTML document using a limited WYSIWYG editor, and converting this HTML back to MS WORD.
The flow is:
1. Docx to Html
1. Docx to Html
2. Html to Docx
We are having a “Aspose.Words.FileCorruptedException - The document appears to be corrupted and cannot be loaded” exception while trying to convert (Aspose Words .Net generated) HTML back to Docx format.
I created a sample console app to find out what was breaking HTML to DOCX conversion and found out that HTML file had some elements with inline styles such as -aw-headerfooter-type: header-primary; -aw-different-first-page: true; etc. Here is an example:
I noticed that these styles were not defined in the Html file thus I stripped out any inline style starting with ‘-aw-’ using a Regular Expression and the Html file was able to convert to Docx properly. I assume these styles are being used for preserving formatting in HTML somehow but they were causing a FileCorruptedException.
I am attaching the sample html and the code so you guys can also verify the issue. Using .NET 4.5, library version is 14.7.0 which is latest at the moment.
Thanks,
Cihan