Preserving extra spaces between words when loading an HTML document

Hi,

I’m trying to load an HTML document and save it as a Word docx file. The HTML may contain extra spaces between words I’d like to preserve in the final docx document. However, all spaces between words are merged into one.

I found an old thread that apparently mentions the same problem. I see a related feature was introduced in the 14.4 version. Is there a way to enable it?

Best,
Michal

@michaln

To preserve extra spaces between words when loading an HTML document and saving it as a DOCX file, you can utilize the HtmlLoadOptions class provided by Aspose.Words. This class allows you to specify how the HTML content should be processed.

Here’s a code example demonstrating how to set up HtmlLoadOptions to maintain the formatting, including extra spaces:

import com.aspose.words.Document;
import com.aspose.words.HtmlLoadOptions;
import com.aspose.words.SaveFormat;

import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;

public class PreserveSpacesExample {
    public static void main(String[] args) throws Exception {
        // Sample HTML content with extra spaces
        final String HTML = "<html>   <body>   This   is   a   test.   </body>   </html>";

        // Create HtmlLoadOptions
        HtmlLoadOptions loadOptions = new HtmlLoadOptions();
        // You can set additional options here if needed

        // Load the HTML document
        Document doc = new Document(new ByteArrayInputStream(HTML.getBytes(StandardCharsets.UTF_8)), loadOptions);

        // Save the document as DOCX
        doc.save("PreservedSpaces.docx", SaveFormat.DOCX);
    }
}

In this example, the HtmlLoadOptions is instantiated, and you can customize it further if necessary. The HTML content is then loaded into a Document object, and finally, it is saved as a DOCX file. This approach should help in preserving the extra spaces as they appear in the original HTML.

If you are using a version of Aspose.Words prior to 14.4, consider upgrading to take advantage of the features that support better handling of whitespace in HTML documents [1].

Sources:
[1]: Aspose.Words for Java 14.4 Release Notes

@michaln Could you please attach your input HTML, your current and expected output documents? We will check the issue and provide you more information.

Hi @alexey.noskov,

I’m using the following HTML input:

<html><head></head><body><h1>Text  with  double  spaces</h1><p>This  text  contains  double  spaces  between  words.</p></body></html>

My code:

var html = "<html><head></head><body><h1>Text  with  double  spaces</h1><p>This  text  contains  double  spaces  between  words.</p></body></html>";

var options = new LoadOptions();
options.setLoadFormat(LoadFormat.HTML);
options.setEncoding(StandardCharsets.UTF_8);

new Document(new ByteArrayInputStream(html.getBytes()), options)
  .save("html2docx.docx", SaveFormat.DOCX);

The output docx document doesn’t contain double spaces between words.

@michaln The behavior is correct and expected it matches MS Word and browser behavior. Here are output documents produced by Aspose.Words and MS Word.
Aspose.Words: out.docx (7.6 KB)
MS Word: ms.docx (12.3 KB)

1 Like

Thank you for the confirmation @alexey.noskov!

1 Like