Hi,
When trying to perform OCR on a 39 page pdf file from scanner, I get the following error when it gets to page nineteen (or thereabout).
Aspose.OCR.OcrException
HResult=0x80131500
Message=Error occurred during recognition.
Source=Aspose.OCR
StackTrace:
at Aspose.OCR.OcrEngine.()
at Aspose.OCR.OcrEngine.Process()
at Aspose.OCR.Examples.CSharp.PerformingandManagingOCR.PerformOCROnPDF.Run() in C:\Users\Bruker\source\repos\Aspose.OCR-for-.NET-master\Examples\CSharp\PerformingandManagingOCR\PerformOCROnPDF.cs:line 76
at Aspose.OCR.Examples.CSharp.RunExamples.Main(String[] args) in C:\Users\Bruker\source\repos\Aspose.OCR-for-.NET-master\Examples\CSharp\RunExamples.cs:line 44
Inner Exception 1:
ArgumentOutOfRangeException: Recognition block bottom edge exceeds image border.
Parameter name: recognition block
I started off from the example provided on git, and have edited it to try to get satisfactory results. Currently code looks like this.
public class PerformOCROnPDF
{
public static void Run()
{
// ExStart:PerformOCROnPDF
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_OCR();
Console.WriteLine(dataDir);
//Create an instance of Document to load the PDF
var pdfDocument = new Aspose.Pdf.Document(dataDir + “Sample.pdf”);
//Create an instance of OcrEngine for recognition
var endTime = DateTime.Now.AddHours(1);
var ocrEngine = new Aspose.OCR.OcrEngine();
var path = dataDir + "result39pagesWithBlankSeparators.txt";
var filters = new Aspose.OCR.CorrectionFilters();
filters.Add(new Aspose.OCR.Filters.RemoveNoiseFilter());
//filters.Add(new Aspose.OCR.Filters.MedianFilter());
//filters.Add(new Aspose.OCR.Filters.GaussBlurFilter());
ocrEngine.Config.CorrectionFilters = filters;
//ocrEngine.Config.DetectTextRegions = true;
ocrEngine.Config.RemoveNonText = true;
ocrEngine.Config.AdjustRotation = AdjustRotationMode.Automatic;
//ocrEngine.Config.DoSpellingCorrection = true;
while (DateTime.Now < endTime)
{
using (var tw = new StreamWriter(path, File.Exists(path)))
{
var st = DateTime.Now;
tw.WriteLine("**** Start OCRprocessing ****");
tw.WriteLine("**** Started at: " + st.ToShortTimeString());
tw.WriteLine("**** " + ocrEngine.Config.ToString() + " ****");
foreach (Aspose.OCR.Filter f in ocrEngine.Config.CorrectionFilters.Filters)
{
tw.WriteLine("**** " + f.ToString() + " ****");
}
//Iterate over the pages of PDF
for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
tw.WriteLine("*********************************");
tw.WriteLine("pdfDocument Page " + pageCount);
tw.WriteLine("*********************************");
//Creating a MemoryStream to hold the image temporarily
using (var imageStream = new System.IO.MemoryStream())
{
//Create Resolution object with DPI value
var resolution = new Aspose.Pdf.Devices.Resolution(150);
//Create PageSize object with A4 size
var pagesize = new Aspose.Pdf.PageSize(Aspose.Pdf.PageSize.A4.Width, Aspose.Pdf.PageSize.A4.Height);
//Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
//where Quality [0-100], 100 is Maximum
var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(pagesize, resolution);
//Rotate page. Only use this if you know the rotation angle of the page
//pdfDocument.Pages[pageCount].Rotate = Pdf.Rotation.on90;
//Convert a particular page and save the image to stream
jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);
imageStream.Position = 0;
//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);
//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
tw.WriteLine(ocrEngine.Text);
}
}
tw.WriteLine("**** Elapsed time: " + (DateTime.Now - st).ToString());
}
}
}
// ExStart:PerformOCROnPDF
}
}
Is there anything I should change to get it to work, or is this a known bug when handling scanned documents or is this a known limitation when using trial license or…???
Please inform of your recommendations on this, I am in the process of prospecting which provider I can use for splitting large pdf files on blank separator pages.
Regards,
Torgeir