Hello there,
I am in the process of evaluating Aspose.PDF and Aspose.OCR Licenses. Having a little trouble while trying to extract text from a scanned pdf.
Business Scenario: We generally receive scanned invoices as a pdf from vendors. We need OCR to extract the values from scanned invoice.
I am trying to replicate the scenario here by scanning one of the example PDFs found from the Aspose.PDF project downloaded from GitHub. Please find the attached “Scanned.pdf”.
Aspose.OCR does not support .pdf file format, as a workaround I am trying to do
1. “In Memory” conversion from pdf to Jpeg
2. Use Aspose.OCR to get the text from above converted Jpeg file.
However the output after the above steps is incorrect and has JUNK Values. Please find the attached screenshot of the Output. "OCREngine’s text"
Here is the code snippet
Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);
//Create an instance of Document to load the PDF
Document pdfDocument = new Document(“C:\PDFs\Scanned.pdf”);
//Create an instance of OcrEngine for recognition
OcrEngine ocrEngine = new OcrEngine();
//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (MemoryStream imageStream = new MemoryStream())
{
//Create Resolution object
Resolution resolution = new Resolution(300);
JpegDevice jpegDevice = new JpegDevice();
//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);
//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
Console.WriteLine(ocrEngine.Text);
}
}
}
Looking forward for your assistance on this.
Regards,
Ajay