Identifying and Creating searchable PDF from Normal Scanned PDF

Hello,

I am evaluating your product Aspose.PDF in .Net platform for one of our client. We have to identify which PDFs are Normal scanned PDF and which are already have a text layer. The PDFs which do not have any text layer we want to create searchable PDF from them. The feature is similar to one which can be found in Adobe Acrobat, please refer attached screenshot.


Please send us the possible solution with Aspose.PDF as early as possible.
Any .Net code example reference will be preferred.

Thanks

Hello There,


Thanks for contacting support.

We are looking into your request. We will update you with some definite response in a while.

Best Regards,


Hi,

Thanks for contacting support.

Please visit the following link for information on how to Find whether PDF file contains images or text only.

Furthermore, in order to convert non-searchable PDF file (scanned image PDF) to searchable PDF document, please try using following code snippet with Tesseract.

In case you face any issue, please share the sample PDF files, so that we can further look into this matter.

using System;
using System.Diagnostics;
using System.IO;

public class PDFtoSearchablePDF
{
    public static void Main()
    {
        string inputPdfPath = @"D:/Downloads/input.pdf";
        string outputPdfPath = @"E:/Data/pdf_searchable.pdf";

        Aspose.Pdf.Document doc = new Aspose.Pdf.Document(inputPdfPath);
        doc.Convert(CallBackGetHocr);
        doc.Save(outputPdfPath);
    }

    static string CallBackGetHocr(System.Drawing.Image img)
    {
        string dir = @"E:\Data";
        string imagePath = Path.Combine(dir, "ocrtest.jpg");
        string outputHocrPath = Path.Combine(dir, "out.html");

        img.Save(imagePath);

        ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
        info.WindowStyle = ProcessWindowStyle.Hidden;
        info.Arguments = $"{imagePath} {outputHocrPath}";

        using (Process p = new Process())
        {
            p.StartInfo = info;
            p.Start();
            p.WaitForExit();
        }

        if (File.Exists(outputHocrPath))
        {
            using (StreamReader streamReader = new StreamReader(outputHocrPath))
            {
                string text = streamReader.ReadToEnd();
                return text;
            }
        }
        else
        {
            Console.WriteLine("OCR process failed. Check Tesseract installation and paths.");
            return string.Empty;
        }
    }
}