PDF to HTML Conversion: Text remains visible as background image

Hi,
I am trying to convert the attached PDF named “EnglishForm.pdf” into HTML using aspose. But, After converting, I see that, the text of the pdf file rendered in html but problem is, these texts also remains in the background image. My expectation is, these text will not be present in the background image.
I am providing the input pdf file named: “EnglishForm.pdf” and also providing HTML file named “aspose_output.zip”.

Can anyone please help me on this?aspose_output.zip (67.3 KB)
EnglishForm.pdf (175.1 KB)

@jahidul.hasan

We have tested the scenario using the latest version of Aspose.PDF for .NET 22.9 and have not found the shared issue. So, please use Aspose.PDF for .NET 22.9. We have attached the output HTML with this post for your kind reference.
22.9.zip (419 Bytes)

Hi @tahir.manzoor,
I have updated to your mentioned version but it doesn’t work. I am stating my problem with more details here:

  1. I am attaching a pdf named “English_ocr_form.pdf”.
  2. Now When I convert that pdf using aspose.pdf .net to html then I got the folder attached here naming “English_ocr_form_files.zip”.
  3. Inside this, in the folder “English_ocr_form_files”, we are getting “img_03.svg” which actually use of “img_01.png” and “img_02.png”.
  4. Here I am okay with “img_01.png” which represent background image without text.
  5. But, problem is with “img_02.png” which represent an image of text.
  6. Finaly “img_03.svg” consist an image which actually has text in it. :frowning:
  7. In html conversion, I am using “SaveShadowedTextsAsTransparentTexts” and “SaveTransparentTexts” properties and setting both to true and this can parse text from pdf.
  8. now problem is: since “img_03.svg” setting as background and this represent an image having text, so the parsed raw text actually overlapping it and this making a problem for me.

My expectation is: Is there any way or configuration so that “img_03.svg” will represent only background image but no text in it. that means here “img_02.png” will not be created.
FYI, The pdf may contains text under image.

Thanks.

English_ocr_form.pdf (175.1 KB)
English_ocr_form_files.zip (1014.1 KB)

@jahidul.hasan

Could you please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing? We will investigate this issue further and provide you more information on it.

@tahir.manzoor
Hi,
As your suggestion, I am attaching here a console application with a pdf file in “Resouces” folder named “English_ocr_form.pdf”.
After running the application, it converts pdf into html and generate necessary files in the “OutputFiles” directory.
In the “English_ocr_form_files” folder, you can see an image named “img_02.png” which contains the texts. So, when “img_03.svg” set as background, this actually holds all texts in the image.

So, this image is our problem. We don’t want our background image contains any text since these texts are already parsed and remains in html.
Due to this image, those texts actually get overlaped with texts in image.
So, is there any way to avoid this image with text “img_02.png” in a generic manner that works for all pdf? We only want the background image which contains no image(“img_01.png”).

Note: We have seen, for some pdf, image like “img_02.png” doesn’t generate. We nned to know, how to identify and remove this.
Application Downlaod Url: AsposePdfToHtmlConversion.zip

@jahidul.hasan

We have logged this problem in our issue tracking system as PDFNET-52750. We will inform you once there is an update available on it.

We apologize for your inconvenience.

Hi @tahir.manzoor,
Thanks for your response. I tried do work with delete method of aspose ArtifactCollection. Unfortunately, artifact doesn’t get deleted although the method executes. I have worked with 1 based index. Can you please help me on this?

Delete Method

@jahidul.hasan

Your issue has been logged for investigation. We will inform you once there is an update available on it.

1 Like