Presales PDF Technical questions

ecarley · March 17, 2022, 11:58pm

Sales told me to post my “technical” questions here. Sadly, I do not think any of these questions are very technical, regardless here are the questions.

Does Aspose have a PDF viewer that can be used to display and interact with PDF files?
What OCR engine is used by Aspose? Is the OCR engine using Tesseract, another OCR vendor engine (e.g. Kofax, ABBYY, etc.), or something created by OCR?
How can I test the OCR accuracy without installing the SDK trial version?
Does the OCR identify regions or table data?
Does the PDF Data extraction use relational positioning in the searchable text?
Does the PDF OCR use relational positioning in its results file?

asad.ali · March 18, 2022, 11:53am

@ecarley

No, Aspose.PDF does not offer any viewer or control to view the PDF. Instead, you can use it in the code behind to create and manipulate PDF documents.

As far as functionality to perform OCR on scanned PDF is concerned, Aspose.PDF does not include any OCR API in it. It only offers to use other third-party OCR like Tesseract to extract text from scanned PDFs. Further features of OCR also depend upon the OCR utility that you are using. Please note that the OCR feature in Aspose.PDF was developed to support any OCR (via callback) but we tested it only for Tesseract for now. Please check the below code snippet:

// C# Code
static void Main(string[] args)
{
 var doc = new Document(“c:/temp/test_10.pdf”);
 doc.Convert(CallBackGetHocr);
 doc.Save(“C:/temp/output_10.pdf”);
}
//********************* CallBackGetHocr method ***********************//
static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @“C:\temp”;
 img.Save(dir + “ocrtest.jpg”);
 ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files (x86)\Tesseract- OCR\tesseract.exe”);
 info.WindowStyle = ProcessWindowStyle.Hidden;
 info.Arguments = @“C:\temp\ocrtest.jpg C:\temp\out hocr”;
 Process p = new Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@“C:\temp\out.hocr”);
string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

At the moment, Aspose.PDF only supports HOCR format as input to embed a hidden layer of text on scanned PDF pages. Furthermore, in case you only have HOCR file and wants to embed it in the PDF, you can use the below code snippet:

// C# Code
using (var pdf = new Aspose.Pdf.Document(dataDir + @"Scanned.pdf"))
{
 pdf.Convert((image) =>
 {
  return File.ReadAllText(dataDir + @"sample.hocr");
 });
 pdf.Save(dataDir + "test_searchable.pdf");
}