Thanks for your patience.
I have tested the scenario using code logic which you have shared earlier and instead of using Aspose.Pdf.Generator, I have tried using Aspose.Pdf namespace. As per my observations, the resultant file do not contain proper content. However, another approach is to use Aspose.Pdf with Tesseract-OCR and as a result, all the content is rendered in PDF file but I am afraid not all the content is searchable. For the sake of correction, I have logged it as PDFNET-43177 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.
For your reference, I have also attached the output generated with following code snippet. input_searchable.pdf (375.8 KB)
Document doc = new Document(@"C:\pdftest\Code\input.pdf");
static string CallBackGetHocr(System.Drawing.Image img)
string dir = @"C:\pdftest\Code\";
img.Save(dir + "ocrtest.jpg");
System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
info.Arguments = @"C:\pdftest\Code\ocrtest.jpg C:\\pdftest\\Code\\out hocr";
System.Diagnostics.Process p = new System.Diagnostics.Process();
p.StartInfo = info;
StreamReader streamReader = new StreamReader(@"C:\pdftest\Code\out.html");
string text = streamReader.ReadToEnd();