PDF with pages that are both searchable and non searchable

BenjaminA · December 11, 2018, 2:02pm

Hi.

Question 1.
I have a PDF containing pages that are both searchable and non searchable (images). Is there any way to convert only those pages that are not searchable and make them searchable (using tesseract)?

Question 2.
Is there any way to identify a PDF page containing both text and image?

Farhan.Raza · December 11, 2018, 9:25pm

@benjamin.a

Thank you for contacting support.

Aspose.PDF for .NET allows you to find out whether a PDF file contains only text, or it contains only images. You can also find whether it contain both or none, as explained in Find whether PDF file contains images or text only.

Likewise, you may test existence of images or text on page level by using StartPage and EndPage properties.

PdfExtractor ext = new PdfExtractor();
ext.BindPdf("Aspose.pdf");
ext.StartPage = 2;
ext.EndPage = 5;
ext.ExtractText();

Once ascertained the pages which are non searchable, you may create a List, Dictionary or any other approach to store non searchable page numbers and then Split PDF File containing only such pages. Later you may make them searchable with tesseract and then Concatenate or insert those pages back into original PDF document with PageCollection.Insert method as explained in Manipulate Page in a PDF File.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.