Aspose PDF with OCR and DPI

ChangepondAspose · August 25, 2022, 11:42am

Hi we have got one requirement to OCR the PDF with editable text where users can modify the content or copy the content easily (like what Adobe Reader has option with Edit PDF which makes the PDF to editable one). Is there anything same provision available within Aspose.PDF assembly. We have verified with Aspose OCR but it looks like only processing the images alone and creating a new PDF with copying all the images. Could you please help us on this how to proceed with Aspose.PDF assembly.

Also we need to set the PDF dpi with maximum of 300 only, I can’t find the way how to set the DPI via Aspose.PDF assembly could you please also provide the further details on this as well

Thanks,
Jagadeeshwaran M

asad.ali · August 25, 2022, 8:27pm

@ChangepondAspose

You can check different optimization strategies in order to specify the image resolution in a PDF during optimization process. Furthermore, you can also try Aspose.PDF with tesseract in order to create a searchable PDF. Please check this post for the code snippet. Feel free to let us know in case you notice any issues.

ChangepondAspose · August 26, 2022, 1:54pm

I have used your code snippet, but its throwing error with the line “StreamReader streamReader = new StreamReader(@“E:\data\out.html”);” as couldn’t find the file out.html. Could you please support us on this.

asad.ali · August 26, 2022, 7:11pm

@ChangepondAspose

We apologize for the inconvenience. The issue may be occurring due to the wrong file name. The file name should be the same as it is given as an argument to the process.

Please note that the OCR feature in Aspose.PDF was developed to support any OCR (via callback) but we tested it only for Tesseract for now. Please check the below code snippet:

// C# Code
static void Main(string[] args)
{
 var doc = new Document(“c:/temp/test_10.pdf”);
 doc.Convert(CallBackGetHocr);
 doc.Save(“C:/temp/output_10.pdf”);
}
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @“C:\temp”;
 img.Save(dir + “ocrtest.jpg”);
 ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files (x86)\Tesseract- OCR\tesseract.exe”);
 info.WindowStyle = ProcessWindowStyle.Hidden;
 info.Arguments = @“C:\temp\ocrtest.jpg C:\temp\out hocr”;
 Process p = new Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@“C:\temp\out.hocr”);
string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

At the moment, Aspose.PDF only supports HOCR format as input to embed a hidden layer of text on scanned PDF pages. Furthermore, in case you only have HOCR file and wants to embed it in the PDF, you can use the below code snippet:

// C# Code
using (var pdf = new Aspose.Pdf.Document(dataDir + @"Scanned.pdf"))
{
 pdf.Convert((image) =>
 {
  return File.ReadAllText(dataDir + @"sample.hocr");
 });
 pdf.Save(dataDir + "test_searchable.pdf");
}

ChangepondAspose · September 7, 2022, 9:34am

Thanks a lot. Now the code works fine, but its not meet with our requirement. The find text search was not highlighting with all the Texts those are available in PDF, instead we need to find next next to highlight the search text. Is there any other options in Aspose PDF to utilize this option ?

asad.ali · September 7, 2022, 5:43pm

@ChangepondAspose

We are afraid that Aspose.PDF for .NET does not have such feature. However, would you kindly share your sample source and expected output PDF files for our reference? We will investigate the feasibility of your requirement and share our feedback with you.