Aspose.Words for Java failed to keep the html format when converting html to PDF

Here’s the sample code using Aspose.Words for Java to convert html to pdf.

Document doc = new Document("xx.html");
doc.save("xxxx.pdf");

After convertion, the .pdf file did not keep all the style of the .html file. Examples as shown below.
1 The background color of a div becomes uncontinous.
the div background in .html file:

in .pdf file

2 Table format
in .html file

in .pdf file

Am I missing any configurations?

@suhjt First of all, please note, Aspose.Words is designed to work with MS Word documents. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one format to another.

  1. There is no direct analog of DIV elements in MS Word documents, so the DIVs are converted to paragraphs in Aspose.Words DOM. This might cause the difference.
    You can try setting HtmlLoadOptions.BlockImportMode property to BlockImportMode.PRESERVE. This might help to resolve the issue.
    Also, please attach your problematic HTML filed here, it is impossible to analyze the problem using screenshots.

  2. Could you please attach your HTML and output PDF documents here for our reference? We will check the issue and provide you more information.

@alexey.noskov Thanks for your answer.

  1. Seems like HtmlLoadOptions.BlockImportMode is a property of Aspose.Words for .NET.
  2. I’ve simplified the HTML file, but it gets more weird. I’ve attached the HTML file and PDF file here.
    test.zip (61.0 KB)

@suhjt

  1. The option available in both .NET and Java, as well as in C++ and Python versions of Aspose.Words. It is available starting from 22.4 version of Aspose.Words.
HtmlLoadOptions opt = new HtmlLoadOptions();
opt.setBlockImportMode(BlockImportMode.PRESERVE);
  1. We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

    Issue ID(s): WORDSNET-25157
    

    You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Partially the problem can be worked around by fitting the table to window:

Document doc = new Document("C:\\Temp\\in.html");

Iterable<Table> tables = doc.getChildNodes(NodeType.TABLE, true);
for (Table t : tables)
{
    t.autoFit(AutoFitBehavior.AUTO_FIT_TO_WINDOW);
}

doc.save("C:\\Temp\\out.pdf");

But still layout is not perfect: out.pdf (42.0 KB)

@alexey.noskov
I used version 21.12. So that’s why I didn’t find this setting.
I’ll give it a try and see if it works or if there is any workaround.
Thanks a lot.

1 Like

@suhjt We have completed analysis of the issue. Aspose.Words ignores the <style> element of the HTML document because its “type” attribute value contains an extra double quote character: <style type=""text/css"> . If you remove that character, Aspose.Words will import the document correctly. Since browsers and MS Word are able to correctly import the source document, we should fix this bug too.
The issue is scheduled for development in 23.5 (May 2023) version of Aspose.Words. We will be sure to let you know once it is resolved.

The issues you have found earlier (filed as WORDSNET-25157) have been fixed in this Aspose.Words for Java 23.5 update.