Conversion of HTML into PDF or PPTX converts all content as images

h0las · April 19, 2023, 11:02pm

Hi! We’ve been using the Aspose in our C# and .Net projects for a long time.
Now I want to convert HTML page into PDF and PPTX.

I found pretty simple code:

        var options = new HtmlLoadOptions();
        var htmlDoc = new Document("doc.html", options);
        htmlDoc.Save("output.pdf", Aspose.Pdf.SaveFormat.Pdf);
        htmlDoc.Save("output.pptx", Aspose.Pdf.SaveFormat.Pptx);

I’ve attached the example of my doc.html file (it’s actually a pptx file converted to html by the Aspose) and result output PDF/PPTX files.

The problem is all content of my html file (text, titles, etc.) converts into images after conversion. So I’m even unable to copy text from such PDF/PPTX file. I want to keep text as text, images as images, etc. in the result PDF/PPTX files. Is it possible to do that? Any settings/configs?

samples.7z (5.8 MB)

carlos.molina · April 20, 2023, 2:20pm

@h0las,

Sadly there is no way to tell the rasterization process that text on top of an image is not part of the image. So it takes everything as part of the original image.

h0las · April 23, 2023, 10:01pm

@carlos.molina Yes, but there is a noticeable thing. I tried to do my HTML to PDF conversion via this Apose online converter: https://products.aspose.app/words/conversion/html-to-pdf
So this converter gave me result which exactly I expected! I could copy text from the online converter result! So why this converter recognized text as text but my code (first message above) doesn’t recognize it?
So that’s why I started to think about special configs/params which I could missed

You could check the result from the code and from the Aspose online converter, I attached it below
conversion-difference.7z (5.3 MB)

carlos.molina · April 24, 2023, 2:31pm

@h0las,

The source is different since the online converter does not have the trial watermark. Can you provide the real Html document you used for the online tool?

h0las · April 24, 2023, 6:36pm

@carlos.molina Yes, here is my original html. I’ve used it for both conversion - via my code and via online converter. Files with watermark are from my test poc project

Also there is result of html => pdf conversion via code without watermark. I’m still not be able to copy text from it (instead of online converter result)
output_from_code.pdf (4.9 MB)

source-html.7z (807.5 KB)

carlos.molina · April 24, 2023, 7:07pm

@h0las,

I was able to replicate the issue. I will gather all this information and create a ticket for the dev team.

carlos.molina · April 24, 2023, 7:22pm

@h0las
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54453

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

h0las · April 24, 2023, 8:37pm

Thank you for fast response!
Is this ticket only about html => pdf conversion?
The thing is I need html => pptx conversion too that will work in a proper way (save text as text instead of converting all content to images)
I have the result example in the first message. Source html is the same like for pdf conversion

carlos.molina · April 24, 2023, 8:39pm

@h0las,

I do not think I can add the issue to the same ticket. Let me ask and I will figure if I have to create another ticket for that or just one.