Make a PDF file searchable

I am looking for the .NET code of doing the same operation as in here:

https://www.techwalla.com/articles/how-to-make-a-pdf-searchable

Thank you!

@webmaster6b8e5

Thanks for contacting support.

Aspose.Pdf for .NET offers various features to search object, elements and text from PDF document. You may search/extract text, images, annotations and tables from PDF document, while using Aspose.Pdf for .NET. Please check following article(s) in our API documentation, which demonstrate the basic usage of mentioned features:

Now concerning to the OCR operations, you may extract images from PDF by following instructions in shared article and perform OCR operation by using our other API Aspose.OCR for .NET. In order to perform OCR operation on an image, please visit “Performing OCR on an Image” article in our API documentation. In case of any further assistance, please feel free to contact us.

Hi,

I am not talking about extracting information but to adjust the property the described in the document to make a PDF searchable. Is that possible - if yes, how?

@webmaster6b8e5

Thanks for writing back.

There is no direct way to adjust some property of PDF, in order to make it searchable, through the API. However, you can convert a non-searchable PDF into searchable PDF document by using following code snippet.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition you may use outer OCR supports HOCR standard (hOCR - Wikipedia). We have used free google tesseract OCR in the above code snippet. Please install it to you computer from tesseract-ocr · GitHub, after that you will have tesseract.exe console application.

In case suggested approach does not still fit your requirement, please share your sample PDF document, so that we can log an investigation ticket accordingly.