I’m trying to load an HTML document and save it as a Word docx file. The HTML may contain extra spaces between words I’d like to preserve in the final docx document. However, all spaces between words are merged into one.
I found an old thread that apparently mentions the same problem. I see a related feature was introduced in the 14.4 version. Is there a way to enable it?
To preserve extra spaces between words when loading an HTML document and saving it as a DOCX file, you can utilize the HtmlLoadOptions class provided by Aspose.Words. This class allows you to specify how the HTML content should be processed.
Here’s a code example demonstrating how to set up HtmlLoadOptions to maintain the formatting, including extra spaces:
import com.aspose.words.Document;
import com.aspose.words.HtmlLoadOptions;
import com.aspose.words.SaveFormat;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
public class PreserveSpacesExample {
public static void main(String[] args) throws Exception {
// Sample HTML content with extra spaces
final String HTML = "<html> <body> This is a test. </body> </html>";
// Create HtmlLoadOptions
HtmlLoadOptions loadOptions = new HtmlLoadOptions();
// You can set additional options here if needed
// Load the HTML document
Document doc = new Document(new ByteArrayInputStream(HTML.getBytes(StandardCharsets.UTF_8)), loadOptions);
// Save the document as DOCX
doc.save("PreservedSpaces.docx", SaveFormat.DOCX);
}
}
In this example, the HtmlLoadOptions is instantiated, and you can customize it further if necessary. The HTML content is then loaded into a Document object, and finally, it is saved as a DOCX file. This approach should help in preserving the extra spaces as they appear in the original HTML.
If you are using a version of Aspose.Words prior to 14.4, consider upgrading to take advantage of the features that support better handling of whitespace in HTML documents [1].
@michaln Could you please attach your input HTML, your current and expected output documents? We will check the issue and provide you more information.
<html><head></head><body><h1>Text with double spaces</h1><p>This text contains double spaces between words.</p></body></html>
My code:
var html = "<html><head></head><body><h1>Text with double spaces</h1><p>This text contains double spaces between words.</p></body></html>";
var options = new LoadOptions();
options.setLoadFormat(LoadFormat.HTML);
options.setEncoding(StandardCharsets.UTF_8);
new Document(new ByteArrayInputStream(html.getBytes()), options)
.save("html2docx.docx", SaveFormat.DOCX);
The output docx document doesn’t contain double spaces between words.
@michaln The behavior is correct and expected it matches MS Word and browser behavior. Here are output documents produced by Aspose.Words and MS Word.
Aspose.Words: out.docx (7.6 KB)
MS Word: ms.docx (12.3 KB)
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.
Sets consent for personalized advertising.
Cookie Notice
To provide you with the best experience, we use cookies for personalization, analytics, and ads. By using our site, you agree to our cookie policy.
More info
Enables storage, such as cookies, related to analytics.
Enables storage, such as cookies, related to advertising.
Sets consent for sending user data to Google for online advertising purposes.