Reading text from Scanned PDF

Hello there,


I am in the process of evaluating Aspose.PDF and Aspose.OCR Licenses. Having a little trouble while trying to extract text from a scanned pdf.

Business Scenario: We generally receive scanned invoices as a pdf from vendors. We need OCR to extract the values from scanned invoice.

I am trying to replicate the scenario here by scanning one of the example PDFs found from the Aspose.PDF project downloaded from GitHub. Please find the attached “Scanned.pdf”.

Aspose.OCR does not support .pdf file format, as a workaround I am trying to do
1. “In Memory” conversion from pdf to Jpeg
2. Use Aspose.OCR to get the text from above converted Jpeg file.

However the output after the above steps is incorrect and has JUNK Values. Please find the attached screenshot of the Output. "OCREngine’s text"

Here is the code snippet

Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF
Document pdfDocument = new Document(“C:\PDFs\Scanned.pdf”);

//Create an instance of OcrEngine for recognition
OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (MemoryStream imageStream = new MemoryStream())
{
//Create Resolution object
Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;

//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
Console.WriteLine(ocrEngine.Text);
}
}
}

Looking forward for your assistance on this.

Regards,
Ajay


Hi Ajay,

Thank you for your inquiry and providing sample code and files.

Please add the following line of code to your sample code and it will get correct results instead of junk values. Direction of the text is important while performing OCR operation.

Feel free to reach us in case you have any query or comments.

CODE:

using (var image = Aspose.Imaging.Image.Load(@"Scanned.tif"))
{
image.RotateFlip(Aspose.Imaging.RotateFlipType.Rotate90FlipNone);
image.Save();
}

Thanks. That helped a bit. I still see jumbled and misspelt words. Refer to the attachment.


Here is what I did.
1. Read PDF
2. Convert to Jpeg in Memory
3. Flip the image by 90Deg (from your suggestion)
4. Ger the OCR Engine process the image.

Code snippet of the complete method. Let me know if I have missed anything.


public static void Run()
{

Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF
Document pdfDocument = new Document(“C:\PDFs\Scanned.pdf”);

//Create an instance of OcrEngine for recognition
OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (MemoryStream imageStream = new MemoryStream())
{
//Create Resolution object
Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
using (var image = Aspose.Imaging.Image.Load(imageStream))
{
image.RotateFlip(Aspose.Imaging.RotateFlipType.Rotate90FlipNone);
image.Save();
}
//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
Console.WriteLine(ocrEngine.Text);
}
}
}
}


Based on your suggestion, if the image is rotated by code the junk values are still seen (Attachment: In my previous post), However If the scanned PDF has the text facing us, the results are much better but for the spelling mistakes. Attachement <<RotatedTheImageInPDF.JPG>>


Code Snippet:

public static void Run()
{

Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(“C:\Licenses\Aspose.OCR.lic”);

//Create an instance of Document to load the PDF
Document pdfDocument = new Document(“C:\PDFs\ScannedForAspose.pdf”);

//Create an instance of OcrEngine for recognition
OcrEngine ocrEngine = new OcrEngine();

//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (MemoryStream imageStream = new MemoryStream())
{
//Create Resolution object
Resolution resolution = new Resolution(300);

JpegDevice jpegDevice = new JpegDevice();

//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
ocrEngine.Image = ImageStream.FromStream(imageStream, ImageStreamFormat.Jpg);

//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
Console.WriteLine(ocrEngine.Text);
}
}
}
}
Hi Ajay,

Thank you for writing us back along with sample files and results.

We have tested the scenario at our end. It was found that the issue persists. The issue has been logged into our issue tracking system with ID IMAGING-35226. Our product team will further look into it. We will update you accordingly via this thread.

It is further requested to share the sample PDF that contains text facing us. This will help us while investigation.

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.