I am facing following quality/performance issues in converted text when I try to perform OCR on PDF documents by using two Aspose APIs that is; Aspose.Pdf APIs convert the PDF pages to images and Aspose.OCR APIs perform the OCR operation on the extracted/converted images.
Conversion is taking too much time. A 253 KB document took 6 minutes to perform OCR on that document when tested in Visual Studio 2013 project running over Windows 10 (x64).
Spacing is not proper among converted text.
Different symbols are shown in text instead of words.
I am using following piece of code to perform OCR referenced from Recognition|Documentation
//Create an instance of Document to load the PDF
var pdfDocument = new Aspose.Pdf.Document(“D:/input file.pdf”);
//Create an instance of OcrEngine for recognition
var ocrEngine = new Aspose.OCR.OcrEngine();
//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
//Creating a MemoryStream to hold the image temporarily
using (var imageStream = new System.IO.MemoryStream())
{
//Create Resolution object with DPI value
var resolution = new Aspose.Pdf.Devices.Resolution(300);
//Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
//where Quality [0-100], 100 is Maximum
var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(resolution, 100);
//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);
//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
File.AppendAllText(“output file.txt”, ocrEngine.Text.ToString());
}
}
}
I have attached two documents, input file on which OCR is performed and output file is result of the conversion.
Thanks.