Consecutive White Spaces Collapsed when converting from HTML to Word Document

Hey,


Background and Use case:
I am using version 13.5.0 of Aspose.Words for Java API. My goal is to convert an HTML file to a word document and maintain consecutive white space characters (i.e. spaces, paragraphs). It seems that consecutive line breaks (i.e. ) and non-breaking space entities ( ) are preserved and encoded appropriately in the exported word document.

Consistent with the HTML specification, it appears that consecutive spaces (i.e. encoded as 0x20) and empty paragraphs are being collapsed to a single white space character when exported to a word document. I am wondering if it is possible to control whether the white space characters are re-constructed in the exported document, especially spaces.

Thank you for any help!

Hi Chase,


Thanks for your interest in Aspose.Words.

Aspose.Words mimics the behaviour of Microsoft Word. To you, this means that if you convert your input HTML file into Word document using Aspose.Words, the output will appear exactly as if it was done by Microsoft Word. However, could you please attach your input HTML file and output Word document showing the undesired behaviour here for testing? I will investigate the issue on my side and provide you more information.

Best regards,

Thank you!


I have attached a zip file that contains:
  • The Input HTML File
  • The actual word document that is generated
  • A word document that demonstrates the expectation


Hi Chase,


Thanks for the additional information. Well, this is the expected behaviour; as shown in attached screenshot even web browsers don’t render consecutive white space characters and

tags the way you’re expecting. So, if we can help you with anything else, please feel free to ask.

Best regards,