Reading text from Scanned PDF

AjayPrasad · February 25, 2016, 3:43pm

Hello there,

I am in the process of evaluating Aspose.PDF and Aspose.OCR Licenses. Having a little trouble while trying to extract text from a scanned pdf.

Business Scenario: We generally receive scanned invoices as a pdf from vendors. We need OCR to extract the values from scanned invoice.

I am trying to replicate the scenario here by scanning one of the example PDFs found from the Aspose.PDF project downloaded from GitHub. Please find the attached “Scanned.pdf”.

Aspose.OCR does not support .pdf file format, as a workaround I am trying to do

1. “In Memory” conversion from pdf to Jpeg

2. Use Aspose.OCR to get the text from above converted Jpeg file.

However the output after the above steps is incorrect and has JUNK Values. Please find the attached screenshot of the Output. "OCREngine’s text"

Here is the code snippet

Aspose.OCR.License license = new Aspose.OCR.License();

license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF

Document pdfDocument = new Document(“C:\PDFs\Scanned.pdf”);

//Create an instance of OcrEngine for recognition

OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)

{

//Creating a MemoryStream to hold the image temporarily

using (MemoryStream imageStream = new MemoryStream())

{

//Create Resolution object

Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream

jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

imageStream.Position = 0;

//Set Image property of OcrEngine to the stream obtained from previous step

ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time

if (ocrEngine.Process())

{

Console.WriteLine(ocrEngine.Text);

}

Looking forward for your assistance on this.

Regards,

Ajay

ikram.haq · February 26, 2016, 3:49am

Hi Ajay,

Thank you for your inquiry and providing sample code and files.

Please add the following line of code to your sample code and it will get correct results instead of junk values. Direction of the text is important while performing OCR operation.

Feel free to reach us in case you have any query or comments.

CODE:

using (var image = Aspose.Imaging.Image.Load(@"Scanned.tif"))

{
image.RotateFlip(Aspose.Imaging.RotateFlipType.Rotate90FlipNone);
image.Save();
}

AjayPrasad · February 26, 2016, 2:47pm

Thanks. That helped a bit. I still see jumbled and misspelt words. Refer to the attachment.

Here is what I did.

1. Read PDF

2. Convert to Jpeg in Memory

3. Flip the image by 90Deg (from your suggestion)

4. Ger the OCR Engine process the image.

Code snippet of the complete method. Let me know if I have missed anything.

public static void Run()

{

Aspose.OCR.License license = new Aspose.OCR.License();

license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF

Document pdfDocument = new Document(“C:\PDFs\Scanned.pdf”);

//Create an instance of OcrEngine for recognition

OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)

{

//Creating a MemoryStream to hold the image temporarily

using (MemoryStream imageStream = new MemoryStream())

{

//Create Resolution object

Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream

jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

imageStream.Position = 0;

using (var image = Aspose.Imaging.Image.Load(imageStream))

{

image.RotateFlip(Aspose.Imaging.RotateFlipType.Rotate90FlipNone);

image.Save();

}

//Set Image property of OcrEngine to the stream obtained from previous step

ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time

if (ocrEngine.Process())

{

Console.WriteLine(ocrEngine.Text);

}

AjayPrasad · February 26, 2016, 3:11pm

Based on your suggestion, if the image is rotated by code the junk values are still seen (Attachment: In my previous post), However If the scanned PDF has the text facing us, the results are much better but for the spelling mistakes. Attachement <<RotatedTheImageInPDF.JPG>>

Code Snippet:

public static void Run()

{

Aspose.OCR.License license = new Aspose.OCR.License();

license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF

Document pdfDocument = new Document(“C:\PDFs\ScannedForAspose.pdf”);

//Create an instance of OcrEngine for recognition

OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)

{

//Creating a MemoryStream to hold the image temporarily

using (MemoryStream imageStream = new MemoryStream())

{

//Create Resolution object

Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream

jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

imageStream.Position = 0;

ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time

if (ocrEngine.Process())

{

Console.WriteLine(ocrEngine.Text);

}

ikram.haq · February 29, 2016, 5:10am

Hi Ajay,

Thank you for writing us back along with sample files and results.

We have tested the scenario at our end. It was found that the issue persists. The issue has been logged into our issue tracking system with ID IMAGING-35226. Our product team will further look into it. We will update you accordingly via this thread.

It is further requested to share the sample PDF that contains text facing us. This will help us while investigation.

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.