Set Base Uri to correctly load Images in HTML file and convert to Word or PDF using Java | HtmlLoadOptions

Hi Aspose team,

I use the code example to convert html to pdf that works on my PC:

HtmlLoadOptions options = new HtmlLoadOptions();
Document doc = new com.aspose.words.Document(“C:\Users\jing.luo\test.html”,options);
doc.save(“C:\Users\jing.luo\test.pdf”);

However when I use the htmlstring to bytestream and load into the Document as input. The image failed to be loaded.

InputStream inputStream = new ByteArrayInputStream(htmlStr.getBytes(StandardCharsets.UTF_8));
System.out.println(htmlStr);
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
HtmlLoadOptions options = new HtmlLoadOptions();
Document doc = new Document(inputStream, options);
doc.save(baos, SaveFormat.PDF);

I print out this html in string and create a new html file and put in my resource folder. I can open in the web browser to see the image was correctly displayed.
Capture_Browser.PNG (89.8 KB)

I also attach the html file for your reference.
test.zip (2.8 KB)

All the picture resources have been put in the same folder of the html file.

Please advise.

Thanks

@jing.luo,

For relative links, you must provide a correct base URI either in the HTML document via the element:

<html>
    <head>
        <base href="https://www.example.com">
    </head>
    <body>
        <p><img class="shrink-logo" src="/exampleLogo.png"></p> 
    </body>
</html> 

Or in the code via HtmlLoadOptions.BaseUri (in this case the element is not needed):

HtmlLoadOptions options = new HtmlLoadOptions();
options.setBaseUri("https://www.example.com");
Document doc = new Document("in.html", options);

Hope, this helps.

Thanks for your reply, but for my case, all images is coming from our resource package folder(src/main/resources/pdf/), instead of coming from external web.

So how should I set the baseURI?

@jing.luo,

Unfortunately, we cannot even see any image in test.html when viewing it with web browsers on our end. This is because the resources folder is missing on our end. Can you please provide your resources folder (containing the image) so that we will be able to view correct output in web browser? Please ZIP the whole folder while preserving the directory structure (hierarchy) and share it here for further testing.

@awais.hafeez
Here is the zip folder(keeping the same structure)
src.zip (103.1 KB)

The file structure is this:
Capture.PNG (66.0 KB)

We use the velocity template to get the htmlstr:
Here is the htmlstr:
test.zip (2.8 KB)

The code to generate the pdf is this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
InputStream inputStream = new ByteArrayInputStream(htmlStr.getBytes(StandardCharsets.UTF_8));
HtmlLoadOptions options = new HtmlLoadOptions();
options.setBaseUri("/src/main/resources/pdf");
Document doc = new Document(inputStream, options);
doc.save(baos, SaveFormat.PDF);

However, the image still does not show.
Please advise.
Thank you!

@jing.luo,

Please provide complete path like this:

HtmlLoadOptions options = new HtmlLoadOptions();
options.setBaseUri("E:\\Temp\\TEST\\resources\\pdf\\");
Document doc = new Document("E:\\Temp\\TEST\\test.html", options);
doc.save("E:\\Temp\\TEST\\20.4.pdf"); 

Hope, this helps.

@awais.hafeez, thank you for your reply

I set the base url but the styling is off. Could you check how does this happen?
Capture.PNG (130.2 KB)

Here is the pdf
JLuo RFI Test 4-23_03-Sep-19-Pemrission_05-07-2020 (1).pdf (69.7 KB)

@jing.luo,

We are working on your query and will get back to you soon.

@jing.luo,

Alternative to using BaseUri property, another way to fix the relative paths of Image URLs is as follows:

private static class HandleResourceLoadingCallback implements IResourceLoadingCallback {
    public int resourceLoading(ResourceLoadingArgs args) {
        if (args.getResourceType() == ResourceType.IMAGE) {
            String[] splits = args.getOriginalUri().split("/");
            String imageName = splits[splits.length - 1];

            args.setUri("E:\\Temp\\212204\\set 2\\src\\src\\main\\resources\\pdf\\" + imageName);
        }

        return ResourceLoadingAction.DEFAULT;
    }
}

HtmlLoadOptions options = new HtmlLoadOptions();
options.setResourceLoadingCallback(new HandleResourceLoadingCallback());

Document doc = new Document("E:\\Temp\\212204\\set 2\\test\\test.html", options);
doc.save("E:\\Temp\\212204\\set 2\\test\\20.5.pdf");

We tested this scenario and have managed to reproduce the same problem on our end. For the sake of correction, we have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-20534. We will further look into the details of this problem and will keep you updated on the status of correction. We apologize for your inconvenience.

@jing.luo,

We have completed the analysis of WORDSNET-20534 and the root cause has been identified. As a workaround you can change “vertical-align: baseline” to “vertical-align: top”. Please let us know if it is acceptable for you to use “vertical-align:top” instead of “vertical-align:baseline” on table cells in the HTML document? Thanks for your cooperation.

Hi awais,

I have tried the "vertical-align: top” as you request. But the styling is still not acceptable. We found the root cause is there is a default border has been inserted into the page.

Do you know how to remove the auto border?

I have attached the screenshot and the pdf / html for your reference.
Capture.PNG (232.0 KB)
test.pdf (93.9 KB)
test.zip (4.5 KB)

@jing.luo,

You are right; we also observe almost 0 margins on the Left and Right sides when viewing your “test.html” with web browsers on our end. But, when converting this HTML to PDF format by using your Java code, we see Aspose.Words adds default margins of 1 inch on the Left and 1 inch on the Right sides. For the sake of any correction, we have logged the following issue in our bug tracking system.

WORDSNET-20833: Set Proper Left Right Margins during HTML to PDF Conversion

We will further look into the details of this problem and will keep you updated on the status of the linked issue. We apologize for your inconvenience.

@awais.hafeez,

Thank you for your response, could you tell me is there any way to remove this margin in the pdf generation by make some configuration / setting of aspose word.
I saw there are some configure attribute in page setup like leftborder or something. And autoSpace attribute in the ParagraphFormat setting. I can see their are some option there to change the layout, but I just don’t know how should I config to make it look right.

@jing.luo,

For this particular scenario, please try using the following Java code;

InputStream inputStream = new ByteArrayInputStream(Files.readAllBytes(Paths.get("E:\\Temp\\TEST\\test.html")));
HtmlLoadOptions options = new HtmlLoadOptions();
options.setEncoding(StandardCharsets.UTF_8);
options.setBaseUri("https://api.kyc.com/o360-api/cc-api/O360commonext");
com.aspose.words.Document doc = new com.aspose.words.Document(inputStream, options);
for (Section sec : doc.getSections()) {
    sec.getPageSetup().setLeftMargin(0);
    sec.getPageSetup().setRightMargin(0);
}
doc.updateTableLayout();
doc.save("E:\\Temp\\TEST\\awjava-20.7.pdf", SaveFormat.PDF);

Thanks @awais.hafeez, that actually works! We now have only one remaining issue left.

You can see in our original html, there are space between “CONTENT DETAILS” and “COUNTERPARTY DETAILS”, however in the final pdf the space was missing. Do you know how to introduce this space?

I put the screenshot in the attachment for your reference.
HTML_Capture.PNG (133.9 KB)
PDF_Capture.PNG (199.1 KB)

@jing.luo,

To address this problem, we have logged the following issue in our bug tracking system.

WORDSNET-20835: Preserve spacing between two Tables during HTML to PDF conversion

We will further look into the details of this problem and will keep you updated on the status of the linked issues. We apologize for your inconvenience.

@jing.luo,

Regarding WORDSNET-20833, please note that Aspose.Words tries to mimic the behavior of MS Word and uses model of the MS Word document. Please set up appropriate properties after importing this HTML into Aspose.Words DOM.

Document doc = new Document("test.html");
doc.getFirstSection().getPageSetup().setLeftMargin(20);
doc.getFirstSection().getPageSetup().setRightMargin(20);
doc.updateTableLayout();
doc.save("test.pdf");