I’m trying to load an HTML document and save it as a Word docx file. The HTML may contain extra spaces between words I’d like to preserve in the final docx document. However, all spaces between words are merged into one.
I found an old thread that apparently mentions the same problem. I see a related feature was introduced in the 14.4 version. Is there a way to enable it?
To preserve extra spaces between words when loading an HTML document and saving it as a DOCX file, you can utilize the HtmlLoadOptions class provided by Aspose.Words. This class allows you to specify how the HTML content should be processed.
Here’s a code example demonstrating how to set up HtmlLoadOptions to maintain the formatting, including extra spaces:
import com.aspose.words.Document;
import com.aspose.words.HtmlLoadOptions;
import com.aspose.words.SaveFormat;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
public class PreserveSpacesExample {
public static void main(String[] args) throws Exception {
// Sample HTML content with extra spaces
final String HTML = "<html> <body> This is a test. </body> </html>";
// Create HtmlLoadOptions
HtmlLoadOptions loadOptions = new HtmlLoadOptions();
// You can set additional options here if needed
// Load the HTML document
Document doc = new Document(new ByteArrayInputStream(HTML.getBytes(StandardCharsets.UTF_8)), loadOptions);
// Save the document as DOCX
doc.save("PreservedSpaces.docx", SaveFormat.DOCX);
}
}
In this example, the HtmlLoadOptions is instantiated, and you can customize it further if necessary. The HTML content is then loaded into a Document object, and finally, it is saved as a DOCX file. This approach should help in preserving the extra spaces as they appear in the original HTML.
If you are using a version of Aspose.Words prior to 14.4, consider upgrading to take advantage of the features that support better handling of whitespace in HTML documents [1].
@michaln Could you please attach your input HTML, your current and expected output documents? We will check the issue and provide you more information.
<html><head></head><body><h1>Text with double spaces</h1><p>This text contains double spaces between words.</p></body></html>
My code:
var html = "<html><head></head><body><h1>Text with double spaces</h1><p>This text contains double spaces between words.</p></body></html>";
var options = new LoadOptions();
options.setLoadFormat(LoadFormat.HTML);
options.setEncoding(StandardCharsets.UTF_8);
new Document(new ByteArrayInputStream(html.getBytes()), options)
.save("html2docx.docx", SaveFormat.DOCX);
The output docx document doesn’t contain double spaces between words.
@michaln The behavior is correct and expected it matches MS Word and browser behavior. Here are output documents produced by Aspose.Words and MS Word.
Aspose.Words: out.docx (7.6 KB)
MS Word: ms.docx (12.3 KB)