Converting html with embedded images to pdf using Aspose Words

Hi,

We recently noticed that, when loading html files with embedded resources(images), the library tries to get the image from the internet and takes a lot of time when there is no internet connectivity.
So we wanted to disable that behaviour, i.e. not to load any remote resources ever when converting from html to pdf using Aspose Words.

Following the documentation available on Aspose forums, we went with ResourceLoadingAction way of disabling this.

 LoadOptions loadOptions = new LoadOptions(loadFormat, null, null);
        loadOptions.setResourceLoadingCallback(resourceLoadingArgs -> {
            return ResourceLoadingAction.SKIP;
        });

With this the conversion time did get better on environments without internet connectivity but where we have internet connectivity, it still seems to load the image to the pdf.

When I tried the other option specified on document, i.e. setting web request timeout to 0.

HtmlLoadOptions htmlLoadOptions = new HtmlLoadOptions();
htmlLoadOptions.setWebRequestTimeout(0);

With this, I do see that pdf no longer loads the image irrespective of whether there is internet connectivity or not and it also seems to be a lot faster than the above method.

Could you please confirm these behaviours and on the performance difference between the two?
Also is there any other way to explicitly stop Aspose Words from loading any remote resources via configuration?

Note: We are using Aspose Words version:22.6

Also attaching an example html file we were testing with.
test-sample.zip (1.4 KB)

Thanks,

@hunair if you want to avoid completely HTTP request the best option will be implement the interface IResourceLoadingCallback. For example:

LoadOptions opts = new LoadOptions();
opts.setResourceLoadingCallback(new HandleResourceLoadingCallback());
Document doc = new Document("C:\\Temp\\input.html", opts);
doc.save("C:\\Temp\\output.pdf", SaveFormat.PDF);
static class HandleResourceLoadingCallback implements IResourceLoadingCallback {
    public ResourceLoadingAction resourceLoading(ResourceLoadingArgs args) {
        return args.getOriginalUri().startsWith("http") ? ResourceLoadingAction.SKIP : ResourceLoadingAction.DEFAULT;
    }
}

Hi,
How is this piece of code different from the first code snippet I have posted? There I am returning ResourceLoadingAction.SKIP for all cases.
When we tried this, it was still loading http urls but the webrequesttimeout option seems to be the only one working for us.
Also, does this configuration change when appending the created Document above into another empty document and we do a document.save on that document?

@hunair The code that I posted is very similar to your original code, but I’m just including a filter to avoid only HTTP/S requests. This is the best way to avoid loading resources. When you implement this interface, your code is executed before the Aspose API loads each file. So, if you SKIP the load process, the Aspose API will not look for that resource.

Setting the WebRequestTimeout to 0 will not prevent the HTTP call (in fact, a second call will be made since the first one fails), but it will avoid waiting for the result. Therefore, in terms of avoiding loading external resources for a document, implementing the IResourceLoadingCallback is the best approach.

Now, I tested the scenario that you mentioned (which was not specified in the original question), and you are right. When the document is appended to another document, the resource is loaded when that document is saved as PDF. I will raise a ticket to the development team to investigate this. As a workaround, you can still set the WebRequestTimeout property to 0.

Thanks @eduardo.canal for the reply.
Could you please confirm if we might miss out on loading of other local assets when we SKIP for all resources? Mainly wanted to understand, what other kinds of resource loading exist apart from http/s requests due which you were selectively looking for “http” starting ones.

Btw, though my original question did not mention about appending to document, our usecase has an empty document into which this document gets inserted, which is probably why the first code snippet did not work for us and the second snippet(webrequesttimeout) worked.

To get around the issue with first snippet that uses IResourceLoadingCallback , I tried adding the following code on the empty root document and it seems to work.

// Empty root document
this.document = new Document();
// Setting property directly on root document
this.document.setResourceLoadingCallback(resourceLoadingArgs -> ResourceLoadingAction.SKIP);

// Configuring loadoption when creating document with resource
LoadOptions loadOptions = new LoadOptions(loadFormat, null, null);
loadOptions.setResourceLoadingCallback(resourceLoadingArgs -> ResourceLoadingAction.SKIP);
Document documentWithResource  = new new Document(inputStream, loadOptions);

// Appending to root document.
this.document.appendDocument(documentToAppend, ImportFormatMode.KEEP_SOURCE_FORMATTING);

// Saving to PDF
this.document.save(outputStream, SaveFormat.PDF);

Is this appropriate way to do this?

Also, when we ran some performance tests, we found that the method of using webrequesttimeout seems to be faster than the above mentioned code. Both cases, the external resource is not loaded. Any idea why we see this behaviour when webrequesttimeout is supposed to be slower?

Thanks,

A document may contain references to resources stored in your local environment.

Is this appropriate way to do this?

Yes, that is the best approach.

Also, when we ran some performance tests, we found that the method of using webrequesttimeout seems to be faster than the above mentioned code. Both cases, the external resource is not loaded. Any idea why we see this behaviour when webrequesttimeout is supposed to be slower?

Here in the support forum, we do not provide data about performance tests, as it depends on the application and the environment where the API is executed. However, skipping the loading of resources using the interface implementation should generally result in faster performance and in your particular scenario in network traffic reduction. Nonetheless, I will raise your concerns with the development team.

Thank you @eduardo.canal for all the feedback.
Yes, it would be great if you could cross-check out solution for appended document from my previous comment with the development team, in terms of any potential performance impacts. Thanks.

1 Like

@hunair

When you load the document with the following configuration:

LoadOptions loadOptions = new LoadOptions(LoadFormat.HTML, null, null);
loadOptions.setResourceLoadingCallback(resourceLoadingArgs-> {
    return ResourceLoadingAction.SKIP;
});

Document doc = new Document("C:\\Temp\\in.html", loadOptions);

The external images and other resources are not loaded, the images in this case are read into the document as linked images and are supposed to be loaded on demand. But since you specified IResourceLoadingCallback in the load options of the document, the same resource loading callback is used while converting the document to PDF and resource loading is skipped.
But when you append the loaded document to an empty document, the resource loading callback of the target document is used. If it is not specified the resources are loaded normally. So if you need to skip resource loading in the target document, you should also specify resource loading callback for target document, just as you do in your code:

// Empty root document
this.document = new Document();
// Setting property directly on root document
this.document.setResourceLoadingCallback(resourceLoadingArgs -> ResourceLoadingAction.SKIP);

I have tested the same, but cannot see that webrequesttimeout approach works faster that ResourceLoadingAction.SKIP approach. IResourceLoadingCallback approach works faster than webrequesttimeout approach on my side.

Hi @alexey.noskov
Thank you for looking into this and for the feedback.

1 Like