Unhandeled exception Aspose.OCR.Exception


#1

Hi,

When trying to perform OCR on a 39 page pdf file from scanner, I get the following error when it gets to page nineteen (or thereabout).
Aspose.OCR.OcrException
HResult=0x80131500
Message=Error occurred during recognition.
Source=Aspose.OCR
StackTrace:
at Aspose.OCR.OcrEngine.()
at Aspose.OCR.OcrEngine.Process()
at Aspose.OCR.Examples.CSharp.PerformingandManagingOCR.PerformOCROnPDF.Run() in C:\Users\Bruker\source\repos\Aspose.OCR-for-.NET-master\Examples\CSharp\PerformingandManagingOCR\PerformOCROnPDF.cs:line 76
at Aspose.OCR.Examples.CSharp.RunExamples.Main(String[] args) in C:\Users\Bruker\source\repos\Aspose.OCR-for-.NET-master\Examples\CSharp\RunExamples.cs:line 44

Inner Exception 1:
ArgumentOutOfRangeException: Recognition block bottom edge exceeds image border.
Parameter name: recognition block

I started off from the example provided on git, and have edited it to try to get satisfactory results. Currently code looks like this.
public class PerformOCROnPDF
{
public static void Run()
{
// ExStart:PerformOCROnPDF
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_OCR();
Console.WriteLine(dataDir);
//Create an instance of Document to load the PDF
var pdfDocument = new Aspose.Pdf.Document(dataDir + “Sample.pdf”);

        //Create an instance of OcrEngine for recognition
        var endTime = DateTime.Now.AddHours(1);
        var ocrEngine = new Aspose.OCR.OcrEngine();
        var path = dataDir + "result39pagesWithBlankSeparators.txt";
        var filters = new Aspose.OCR.CorrectionFilters();
        filters.Add(new Aspose.OCR.Filters.RemoveNoiseFilter());
        //filters.Add(new Aspose.OCR.Filters.MedianFilter());
        //filters.Add(new Aspose.OCR.Filters.GaussBlurFilter());            
        ocrEngine.Config.CorrectionFilters = filters;
        //ocrEngine.Config.DetectTextRegions = true;
        ocrEngine.Config.RemoveNonText = true;
        ocrEngine.Config.AdjustRotation = AdjustRotationMode.Automatic;
        //ocrEngine.Config.DoSpellingCorrection = true;

        while (DateTime.Now < endTime)
        {
            using (var tw = new StreamWriter(path, File.Exists(path)))
            {
                var st = DateTime.Now;
                tw.WriteLine("**** Start OCRprocessing ****");
                tw.WriteLine("**** Started at: " + st.ToShortTimeString());
                tw.WriteLine("**** " + ocrEngine.Config.ToString() + " ****");
                foreach (Aspose.OCR.Filter f in ocrEngine.Config.CorrectionFilters.Filters)
                {
                    tw.WriteLine("**** " + f.ToString() + " ****");
                }
                //Iterate over the pages of PDF
                for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
                {
                    tw.WriteLine("*********************************");
                    tw.WriteLine("pdfDocument Page " + pageCount);
                    tw.WriteLine("*********************************");
                    //Creating a MemoryStream to hold the image temporarily
                    using (var imageStream = new System.IO.MemoryStream())
                    {
                        //Create Resolution object with DPI value
                        var resolution = new Aspose.Pdf.Devices.Resolution(150);

                        //Create PageSize object with A4 size
                        var pagesize = new Aspose.Pdf.PageSize(Aspose.Pdf.PageSize.A4.Width, Aspose.Pdf.PageSize.A4.Height);

                        //Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
                        //where Quality [0-100], 100 is Maximum
                        var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(pagesize, resolution);

                        //Rotate page. Only use this if you know the rotation angle of the page
                        //pdfDocument.Pages[pageCount].Rotate = Pdf.Rotation.on90;
                        
                        //Convert a particular page and save the image to stream
                        jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

                        imageStream.Position = 0;

                        //Set Image property of OcrEngine to the stream obtained from previous step
                        ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);

                        //Perform OCR operation on one page at a time
                        if (ocrEngine.Process())
                        {                              
                            tw.WriteLine(ocrEngine.Text);
                        }
                    }
                    tw.WriteLine("**** Elapsed time: " + (DateTime.Now - st).ToString());
                }
            }
        }
        // ExStart:PerformOCROnPDF            
    }
}

Is there anything I should change to get it to work, or is this a known bug when handling scanned documents or is this a known limitation when using trial license or…???

Please inform of your recommendations on this, I am in the process of prospecting which provider I can use for splitting large pdf files on blank separator pages.

Regards,
Torgeir


#2

@brmykleb

Would you kindly share the sample PDF document that you are using at your side. We will test the scenario in our environment and address it accordingly.


#3

current code:
public class PerformOCROnPDF
{
public static void Run()
{
// ExStart:PerformOCROnPDF
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_OCR();
Console.WriteLine(dataDir);
//Create an instance of Document to load the PDF
var pdfDocument = new Aspose.Pdf.Document(dataDir + “Sample.pdf”);

        //Create an instance of OcrEngine for recognition
        var endTime = DateTime.Now.AddHours(1);
        var ocrEngine = new Aspose.OCR.OcrEngine();
        var path = dataDir + "result39pagesWithBlankSeparators.txt";
        var filters = new Aspose.OCR.CorrectionFilters();
        filters.Add(new Aspose.OCR.Filters.RemoveNoiseFilter());
        //filters.Add(new Aspose.OCR.Filters.MedianFilter());
        //filters.Add(new Aspose.OCR.Filters.GaussBlurFilter());            
        ocrEngine.Config.CorrectionFilters = filters;
        //ocrEngine.Config.DetectTextRegions = true;
        ocrEngine.Config.RemoveNonText = true;
        ocrEngine.Config.AdjustRotation = AdjustRotationMode.Automatic;
        //ocrEngine.Config.DoSpellingCorrection = true;            
        var iteration = 0;
        while (iteration < 2)
        {
            using (var tw = new StreamWriter(path, File.Exists(path)))
            {
                var st = DateTime.Now;
                tw.WriteLine("**** Start OCRprocessing ****");
                tw.WriteLine("**** Started at: " + st.ToShortTimeString());
                tw.WriteLine("**** " + ocrEngine.Config.ToString() + " ****");
                foreach (Aspose.OCR.Filter f in ocrEngine.Config.CorrectionFilters.Filters)
                {
                    tw.WriteLine("**** " + f.ToString() + " ****");
                }
                //Iterate over the pages of PDF
                for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
                {
                    tw.WriteLine("*********************************");
                    tw.WriteLine("pdfDocument Page " + pageCount);
                    tw.WriteLine("*********************************");
                    //Creating a MemoryStream to hold the image temporarily
                    using (var imageStream = new System.IO.MemoryStream())
                    {                          
                            
                        //Create Resolution object with DPI value
                        var resolution = new Aspose.Pdf.Devices.Resolution(300);

                        //Create PageSize object with A4 size
                        //var pagesize = new Aspose.Pdf.PageSize(Aspose.Pdf.PageSize.A4.Width, Aspose.Pdf.PageSize.A4.Height);

                        //Create JPEG device with specified attributes (Width, Height, Resolution, Quality)
                        //where Quality [0-100], 100 is Maximum
                        //var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(pagesize, resolution);
                        var jpegDevice = new Aspose.Pdf.Devices.JpegDevice(resolution, 100);

                        //Rotate page. Only use this if you know the rotation angle of the page
                        //pdfDocument.Pages[pageCount].Rotate = Pdf.Rotation.on90;
                        
                        //Convert a particular page and save the image to stream
                        jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

                        imageStream.Position = 0;

                        //Set Image property of OcrEngine to the stream obtained from previous step
                        ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imageStream, Aspose.OCR.ImageStreamFormat.Jpg);

                        //Perform OCR operation on one page at a time
                        try
                        {
                            if (ocrEngine.Process())
                            {
                                tw.WriteLine(ocrEngine.Text);
                            }
                        }
                        catch (OcrException e)
                        {
                            tw.WriteLine("OCRException thrown: " + e.Message);
                            ocrEngine.Dispose();
                            ocrEngine = new Aspose.OCR.OcrEngine();
                            ocrEngine.Config.CorrectionFilters = filters;
                            //ocrEngine.Config.DetectTextRegions = true;
                            ocrEngine.Config.RemoveNonText = true;
                            ocrEngine.Config.AdjustRotation = AdjustRotationMode.Automatic;
                            ocrEngine.LanguageContainer.ResetToDefaults();
                            continue;
                        }
                    }
                    tw.WriteLine("**** Elapsed time: " + (DateTime.Now - st).ToString());
                }
            }
            iteration++;
        }
        // ExStart:PerformOCROnPDF            
    }
}

test result ocrEngine.text:
soon as plasslble by means of
snorelltilnk(s) mellylsurlmen!I and on blIhalf ot our phrllelpals !I herebli! rlserve the llght to revelrt orll the
mader in case of any abnormai sholrtllollllI.
l
l
llilleml[r II llyllo SGS Gllyllup (who Glllllto w sllirlt)
lloClo.01011tol.18
'n

[
[

I
n

**** Elapsed time: 00:48:13.1184516


pdfDocument Page 32


OCRException thrown: Error occurred during recognition.


pdfDocument Page 33


I
**** Elapsed time: 00:48:27.1277581


pdfDocument Page 34


K m.am M AP nm
y
Vessel
llylgo
Weillyllhe

rhe sllyllles illydicilated llalllallm were received frallyl the vessel.
–{
I
I
-]
rhe ilabove sllyllples for ollylr retent ion lly!Id received frallyl vessel will be retilained for 3 llyllnths unless
otherllyli se infollyi
o
o
mmammam (IIIIILING mpoRT)
oo Inspection Noo ATITIT

  • Terllyllinilall?ort : -
  • Dilate r -
  • Sloipl ing r -
    SITIIILING IIIIIP0RT
    oo
    o
    o
    o
    o

-{
G
-low-.-.-D…y–
xmam: 2Ol4-01-Ol
I Fur loillding
e
oor unloallding
I
I

mlm/mlm
**** Elapsed time: 00:49:37.3968001


pdfDocument Page 35


OCRException thrown: Error occurred during recognition.


pdfDocument Page 36


I
I
I
**** Elapsed time: 00:49:51.9248218


pdfDocument Page 37


n
]
T

m w–.i T —
r - - – - - - —
T
I I
–y
]
o
-] ‘`’’ –
l

la lllli)lly uppaa-c M -

t— - -
I
wca iahlylll .,-.- - Pi E IIIIIIIIII u-ollyllt llillowovw ooo-a’w -
'I
mmb)r of ah) ssS s,oup (SIIIII)a) llllly})mlllly II Sullaaallrllilly)
ollm1 0I011T-213. 1 R

— y


I
'l
l
**** Elapsed time: 00:51:13.9103875


pdfDocument Page 38


OCRException thrown: Error occurred during recognition.


pdfDocument Page 39


u
]
l-- ‘I
5tr!T!TyuFenee’?!P!2 Tabe-derpo }
n Thn snmpyns yndicated MyM nre latt lr your custody Ducanlra to lrstns mnr our cyynntu An lor your raunst.
Y t e ne ey, teet !!r;.ater:e:2 r’;.; 2%2 2e eeundera '`
Dntu Swt s)whych rhn proaducar or wnr al lhn product nrunt rrkn nvahblu m yw nnu nny oth aplyrtn
]
Your umnnture ynd4tns thnt you hnve rene ntle ununlltood thyb noll. yf yau hnvn nny qunskno. Plonsn cayy rhn
SSS oll yneycatne nbovn.
n Thn Wanrpyus ynui-Iod ! - '-re nscyved from IIIN volnllila m tly vessny:
omer No. :
T - — =-
Sampiil10 Raport p 1/yth ynslimon a M sampyyng akr14 a M prempying
Snmpyyng Mntl*I - 4-- =~- el
– l
I–


m - --4 r —u mm u—u ------ w- w-----= w- - -w- -
c–w-.way ,-mm-m-. -w–I—

wm*r ot tw sss sroip (socInta aunarnta ua su-utnw)
osarm1=)1171}.1R

v - = 9 -
I
—}
l

I
**** Elapsed time: 00:55:31.4618636Sample.pdf (6.1 MB)


#4

@brmykleb

Could you kindly share the .txt file which was generated at your end. Please share it in ZIP format by attaching with your post.


#5

Continuing the discussion from Unhandeled exception Aspose.OCR.Exception:

Here is the text file. This has been written over several tests. The two iterations from the last test are at the bottom.result39pagesWithBlankSeparators.zip (95.7 KB)


#6

@brmykleb

Thank you for sharing requested file.

We are checking this and will get back to you soon.


#7

Any news on this one??

It might be a coincidence, but I have noticed that on the last two iterations in the txt file, whenever OcrEngine produced an error, it had just processed a page with a black field at the bottom on the page before…

Regards,
Torgeir


#8

@brmykleb

We tested the scenario in our environment with both Aspose.OCR for .NET 17.11 and 19.9. We noticed the same exception with 17.11 version and exception was different in case of v19.9. However, we logged an issue as OCR-811 in our issue tracking system for further investigation on this scenario.

We will surely investigate the issue in details and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.