Remove Non-Printable Invisible Characters of Paragraphs with 0 Line Height Spacing during Word DOCX to HTML Conversion using Java

Hi,

Greetings.

We use Aspose.words Java to convert docx file to html.

When we convert the attached docx file to html, we are able to see some junk characters in the converted html file, which were not there in the original file.

Please suggest ways we can mitigate this and let us know your valuable suggestions to resolve this issue.

Please use the attached zip file, which has the original docx file and the converted html file.
Error File.zip (17.5 KB)

@EdwinPearson,

We have logged this problem in our issue tracking system with ID WORDSNET-21853. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

@EdwinPearson,

Regarding WORDSNET-21853, we have completed the analysis of this issue and concluded to close this issue with “not a bug” status. Please check the following analysis details:

The “junk characters” are in fact stored in the source Word document in invisible Paragraphs. Those Paragraphs are not rendered in MS Word, because their line spacing is set to zero. In HTML, however, line spacing (“line-height”) cannot be zero, and the paragraphs become visible. The same effect is observed in HTML documents generated by MS Word. We are going to close this issue as “Not a Bug” not only because Aspose.Words copies MS Word’s behavior in this case, but also because zero line spacing is an uncommon corner case. For example, MS Word’s user interface doesn’t allow to set line spacing to zero.

As a workaround, you can remove paragraphs with zero line spacing before saving the Word DOCX document to HTML:

Document doc = new Document("C:\\Temp\\Error File\\Error File.docx");

for (Paragraph paragraph : doc.getFirstSection().getBody().getParagraphs())
    if (paragraph.getParagraphFormat().getLineSpacing() == 0)
        paragraph.remove();

doc.save("C:\\Temp\\Error File\\awjava-21.2 workaround.html");