RecognizePDF Result Is Null

tjbarber573 · March 30, 2022, 7:15pm

I’m looking for a solution that can convert some old scanned tables into excel sheets. I’m trying this solution, but the recognition results are just returning null and erroring out. I have a temporary license. Here is the test code I’m using.

var output = @"C:\.NET Applications\OCRTest\OCRTest\";
var pdf = @"C:\.NET Applications\OCRTest\OCRTest\test4.pdf";

Aspose.OCR.License license = new Aspose.OCR.License();

license.SetLicense("Aspose.OCR.NET.lic");

var api = new AsposeOcr();

var settings = new DocumentRecognitionSettings();
settings.StartPage = 0;
settings.PagesNumber = 1;
settings.LinesFiltration = true;
settings.DetectAreas = true;

var res = api.RecognizePdf(pdf, settings);

AsposeOcr.SaveMultipageDocument(output + "test2.xlsx", SaveFormat.Xlsx, res);

Any ideas on what the issue could be? Is there a way to find out if this pdf is even readable? Other OCR programs have been able to at least return some sort of results.

asad.ali · March 31, 2022, 1:22pm

@tjbarber573

Would you please share your sample PDF document for our reference? We will test the scenario in our environment and address it accordingly.

asad.ali · April 1, 2022, 8:01pm

@tjbarber573

Thanks for sharing the sample PDF in a private message. We have checked it and found that it has complex layout and table structure. We need to perform investigation on how can similar layout be achieved after performing OCR and export the results in Excel format. For the purpose, an investigation ticket as OCRNET-489 has been logged in our issue management system. We will further look into its and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

tjbarber573 · April 1, 2022, 8:30pm

I appreciate your help! Thank you for your time.

asad.ali · April 8, 2022, 7:38pm

@tjbarber573

We fixed the bug that caused the error, but you can try this after release 22.4 only (it will be available in 2 weeks)
We have added the ability to better recognize the structure of table documents, so you can try this feature (see the code examples)
We plan to improve excel file creation. So in the next release (22.4), you can get a better excel file

  List<RecognitionResult> res = api.RecognizePdf(
                 @"test4.pdf", 
                 new DocumentRecognitionSettings { DetectAreasMode = DetectAreasMode.COMBINE });

or

  List<RecognitionResult> res = api.RecognizePdf(
                 @"test4.pdf", 
                 new DocumentRecognitionSettings { DetectAreasMode = DetectAreasMode.PHOTO});

tjbarber573 · April 25, 2022, 2:24pm

Hello!

I’ve updated the package and am currently testing. Right it seems like it hangs when trying to convert the pdf. Is there a problem with how I am converting the pdf? This is the code I’m using to test.

var output = @"C:\.NET Applications\OCRTest\OCRTest\";
            var pdf = @"C:\.NET Applications\OCRTest\OCRTest\test4.pdf";

            //Aspose
            Aspose.OCR.License license = new Aspose.OCR.License();

            license.SetLicense("Aspose.OCR.NET.lic");

            var api = new AsposeOcr();

            var settings = new DocumentRecognitionSettings();
            settings.StartPage = 0;
            settings.PagesNumber = 1;
            settings.LinesFiltration = true;
            settings.DetectAreas = true;
            settings.DetectAreasMode = DetectAreasMode.COMBINE;

            var res = api.RecognizePdf(pdf, settings);

            AsposeOcr.SaveMultipageDocument(output + "test2.xlsx", SaveFormat.Xlsx, res);

asad.ali · April 25, 2022, 10:48pm

@tjbarber573

We were able to reproduce the same behavior in our environment while using 22.4 version. Therefore, a separate issue as OCRNET-501 has been logged in our issue tracking system. We will check it in further detail and let you know once it is resolved.

We apologize for your inconvenience.

asad.ali · May 4, 2022, 7:46pm

@tjbarber573

We have investigated the earlier logged ticket. Please, try to use for this file:

LinesFiltration = false;

The line filtering algorithm takes a long time as there are many small lines in this file. Therefore, the recognition process takes a long time.

tjbarber573 · May 16, 2022, 4:12pm

Hello! I will try. However, the trail period has since expired. Is there a way to get another trial period for a license to see if this works?

Thank you

asad.ali · May 16, 2022, 8:34pm

@tjbarber573

Sure, you can drop your request for extension in trial period of license in our Purchase Forum and you will be assisted accordingly.