Quality and performance issues in performing OCR on PDF Documents

Muzna_Tariq · June 9, 2017, 6:44am

I am facing following quality/performance issues in converted text when I try to perform OCR on PDF documents by using two Aspose APIs that is; Aspose.Pdf APIs convert the PDF pages to images and Aspose.OCR APIs perform the OCR operation on the extracted/converted images.

Conversion is taking too much time. A 253 KB document took 6 minutes to perform OCR on that document when tested in Visual Studio 2013 project running over Windows 10 (x64).

Spacing is not proper among converted text.

Different symbols are shown in text instead of words.

I am using following piece of code to perform OCR referenced from Recognition|Documentation

//Create an instance of Document to load the PDF

var pdfDocument = new Aspose.Pdf.Document(“D:/input file.pdf”);


//Create an instance of OcrEngine for recognition

var ocrEngine = new Aspose.OCR.OcrEngine();


//Iterate over the pages of PDF

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)

{

//Creating a MemoryStream to hold the image temporarily

using (var imageStream = new System.IO.MemoryStream())

{

//Create Resolution object with DPI value

var resolution = new Aspose.Pdf.Devices.Resolution(300);


//Create JPEG device with specified attributes (Width, Height, Resolution, Quality)

//where Quality [0-100], 100 is Maximum

var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(resolution, 100);


//Convert a particular page and save the image to stream

jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);


imageStream.Position = 0;


//Set Image property of OcrEngine to the stream obtained from previous step

ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);


//Perform OCR operation on one page at a time

if (ocrEngine.Process())

{

File.AppendAllText(“output file.txt”, ocrEngine.Text.ToString());

}

}

}

I have attached two documents, input file on which OCR is performed and output file is result of the conversion.

Thanks.

ikram.haq · June 9, 2017, 1:35pm

Hi Muzna,

Thank you for your inquiry and providing details.

The execution time is increased due to PDF to image conversion and image complexity. Furthermore we have investigated the issue at our end. While investigation it was found that images extracted from the PDF are not of a good quality. The text is blurring. It contains hand written writing and data in tabular format. Please note that current implementation does not support hand written writing and extracting data from table format.

ikram.haq · July 25, 2017, 10:00am

@Muzna_Tariq,

This is to update you that reading data from tabular format issue has been logged into our system with ID OCRNET-2941. The issue ID has been link with this forum thread. You will be notified automatically in this forum thread once any update is available.