Aspose.OCR: extract Text from nonsearchable PDF's not working properly

hasanirmak · April 15, 2024, 12:16pm

Hello support team

We want to convert “non-searchable PDF’s” into “searchable PDF’s” with the help of Aspose.OCR (see Sample_Doc.pdf in the attachment).

In our solution we get a black content after the conversion with Aspose.OCR and the content of the PDF is completely destroyed and no longer usable for the customer. (See not_wanted_result.pdf in the attachment)

I have checked the same PDF with Aspose Cloud solution at “OCR Online. Convert PDF to Searchable PDF” and get only a part as OCR (see Aspose_Cloud_result.jpg in the attachment).

I have tested the same PDF with the mentioned sample under “GitHub - aspose-ocr/Aspose.OCR-for-.NET: Aspose.OCR for .NET examples, plugins and showcase projects” as an application and get no text at all, although the conversion is successful and shows no error (see Aspose_App_Result.jpg in the attachment).

Are you aware of such a case?
What do we have to do so that if it cannot read and convert the content, we catch the error or empty content and do not pass on a corrupt document and simply leave the original as it is.

Customer_Attachments.zip (1.0 MB)

Thanks in advance for a possible solution or answer
Best regards

asad.ali · April 15, 2024, 11:52pm

@hasanirmak

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-827

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

hasanirmak · April 29, 2024, 10:50am

asad.ali:

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): OCRNET-827
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hello Support Team
Thank you for opening the ticket.
Has anything changed in the meantime or have you been able to reproduce the problem?

Best Regards

asad.ali · April 29, 2024, 11:14pm

@hasanirmak

The issue has been resolved while using below code with 24.4 version. Results are also attached:

OcrInput input = new OcrInput(InputType.PDF, filter);
string imgPath = @"D:\imgs\ISSUES\NET827\Sample_Doc.pdf";
input.Add(imgPath);
List<RecognitionResult> result = api.Recognize(input, new RecognitionSettings
{
  DetectAreasMode = DetectAreasMode.TABLE
});

result.zip (610.4 KB)

hasanirmak · July 4, 2024, 3:22pm

Hi again

Thanks for the suggestion. I could now solve my problem with nonsearchable PDF’s thanks to this setting, but after upgrading the Aspose components I cant now open the converted images (tif,jpg,bmp,png,…).
As soon as I want to open the converted image with Acrobat , it gives me a message “There was an error opening this document. This file can not be opened because it has no pages”, although it is not 0 KB.
Error while opening convertet image file.gif (26.4 KB)

In the code, even with the following recognition settings, I see no OCR result for images and always get 0 and cannot open the converted document.

Programcode example used for converting.gif (175.5 KB)

sample code and some images are attached.
Image samples.zip (1.1 MB)

Thanks in advance for any help

asad.ali · July 4, 2024, 11:33pm

@hasanirmak

We are checking it and will get back to you shortly.

hasanirmak · July 5, 2024, 7:54am

Hi

I think I have found the problem. it is the startpage value in the Input.Add method. Can you tell me what the correct startindexing for this method is?

var pagecount = doc.Pages.Count();

var documentSettings = OcrHelper.SetDocumentRecognitionSettings(ocrImageSettings, "IMAGE");
var outputFileContent = new MemoryStream();

AsposeOcr api = new AsposeOcr();

OcrInput input = new OcrInput(InputType.PDF);

input.Add(originalFileContent, 1 , pagecount );

I am using 1 for startpage in PDF, but this is not working for images and i got no OCR Result for images.

can you please tell me how to determine the correct startpage so that it works for both PDF and all image types?
Or is it also depending on which setting we use for image recognition?

  public static RecognitionSettings SetDocumentRecognitionSettings(OcrModel ocrSettings, string docType)
  {
      RecognitionSettings settings = new RecognitionSettings();

      if (docType == "PDF")
      {
          settings.DetectAreasMode = ocrSettings.DetectAreasMode; //TABLE
      }
      else if (docType == "IMAGE")
      {
          settings.DetectAreasMode = ocrSettings.DetectAreasMode; // TABLE
      }
      else
      {

          if (ocrSettings.AllowedCharacters.IsActive)
              settings.AllowedCharacters = ocrSettings.AllowedCharacters.Allowed; //ALL
          settings.AutomaticColorInversion = ocrSettings.AutomaticColorInversion; //false
          settings.AllowedSymbols = string.Equals(ocrSettings.AllowedSymbols, "null") ? null : ocrSettings.AllowedSymbols;
          settings.DetectAreasMode = ocrSettings.DetectAreasMode; //TABLE
          settings.IgnoredSymbols = string.Equals(ocrSettings.IgnoredSymbols, "null") ? null : ocrSettings.IgnoredSymbols;
          settings.Language = ocrSettings.Language; //ExtLatin
          settings.LinesFiltration = ocrSettings.LinesFiltration; //false
          settings.RecognizeSingleLine = ocrSettings.RecognizeSingleLine; //false
          if (ocrSettings.ThreadsCount > 0)
              settings.ThreadsCount = ocrSettings.ThreadsCount; // 0

          settings.UpscaleSmallFont = ocrSettings.UpscaleSmallFont;//false
          foreach (var recVarOCR in ocrSettings.RecognitionAreas) //null
          {
              if (recVarOCR.IsActive)
              {
                  settings.RecognitionAreas.Add(new Rectangle(recVarOCR.X, recVarOCR.Y, recVarOCR.Width, recVarOCR.Height));
              }
              else
              {
                  settings.RecognitionAreas = null;
              }
          }
      }

if i dont use the startpage and pagecount as you suggested i got a nullrefrence error.
code_snippet.gif (58.4 KB)

thanks in advance for any suggestions

hasanirmak · July 5, 2024, 2:23pm

asad.ali:

OcrInput input = new OcrInput(InputType.PDF, filter);
string imgPath = @"D:\imgs\ISSUES\NET827\Sample_Doc.pdf";
input.Add(imgPath);
List<RecognitionResult> result = api.Recognize(input, new RecognitionSettings
{
  DetectAreasMode = DetectAreasMode.TABLE
});

i realized that your suggested code is also not working with multipage PDF and change it to:


 AsposeOcr api = new AsposeOcr();
 OcrInput input = new OcrInput(InputType.PDF);
 input.Add(originalFileContent, 1 , pageCount);

 // Recognize image
 List<RecognitionResult> resultTest = api.Recognize(input, new RecognitionSettings
 {
     DetectAreasMode = DetectAreasMode.TABLE
 });

With this resocnition setting, i can just convert multipage nonsearchable pdf’s with more text in it, but no chance anymore with nonsearchable pdf’s which has an image.

The question is now how to get a result with multipage pdf’s which has also text and image.

And after trying to convert the multipage given in the Aspose Example

i got the following message:

`Microsoft.ML.OnnxRuntime.OnnxRuntimeException: '[ErrorCode:RuntimeException] Non-zero status code returned while running ConvInteger node. Name:'Conv_0_quant' Status Message: bad allocation'`

Can you tell me, if this has something to do with the recognition setting? or why do i get such a message and cant convert this example.

Thanks in advance for helping

asad.ali · July 5, 2024, 10:10pm

@hasanirmak

Please allow us to investigate the scenario and we will get back to you as soon as we have some feedback to share.

hasanirmak · July 8, 2024, 8:48am

Hello Asad
Thanks in advance for analysing.

i want just mention the things that does not work:

multipage pdf in git aspose example (link mentioned above) → generates OnnxRuntimeException
i am now differentiate between one page and multipage also for image and the InputType.SingleImage and InputType.TIFF i got always while recognizing the text with this sample code the error: {“Value does not fall within the expected range.”}

        public static MemoryStream ProcessOcrImage(AsposeSettings setting, MemoryStream originalFileContent, Document doc, BaseDocument org_doc)
        {
            var outputFileContent = new MemoryStream();
            OcrInput input = new OcrInput(InputType.SingleImage);
            try
            {

                var pageCount = doc.Pages.Count();
                AsposeOcr api = new AsposeOcr();
                
                if (pageCount == 1)
                {
                    if (setting.OCRSettings.AutoSkew)
                    {
                       
                        PreprocessingFilter filters = new PreprocessingFilter
                        {
                            PreprocessingFilter.AutoSkew()
                        };
                        input = new OcrInput(InputType.SingleImage, filters);
                    }
                    else
                    {
                        input = new OcrInput(InputType.SingleImage);
                    }
                    
                    input.Add(originalFileContent);
                }
                else
                {
                    if (setting.OCRSettings.AutoSkew)
                    {
                       
                        PreprocessingFilter filters = new PreprocessingFilter
                        {
                            PreprocessingFilter.AutoSkew()
                        };
                        input = new OcrInput(InputType.TIFF, filters);
                    }
                    else
                    {
                        input = new OcrInput(InputType.TIFF);
                    }
                    
                    input.Add(originalFileContent, 1, pageCount);
                }

                // Recognize image      -->> causes Error
                List<RecognitionResult> ocrResult = api.Recognize(input, new RecognitionSettings
                {
                    DetectAreasMode = DetectAreasMode.COMBINE
                });


                if (ocrResult.Count == 1)
                {
                    org_doc.OCRResult = ocrResult[0].RecognitionText;            
                    AsposeOcr.SaveMultipageDocument(outputFileContent, OCR.SaveFormat.Pdf, ocrResult);
                  

                }
                else if(ocrResult.Count < 1)
                {
                    return originalFileContent;
                }

            }
            catch (Exception ex)
            {
                Logger.Error($"ProcessOcrImage failed:{ex.Message}", ex);
                throw ex;
            }
            finally
            {
                Logger.Info("End of ProcessOcrImage method.");
            }
            return outputFileContent;

        }```

asad.ali · July 8, 2024, 9:56pm

@hasanirmak

Please check the link in private message we just sent you to download the project:

The problem was

conflict with different Aspose packages
ocr settings

case "RecognitionAreas": // If not null, it should be defined as a complex object with x,y,h
                            setting.OCRSettings.RecognitionAreas = new List<Rectangle>();
                            break;

You were setting this as null. We will fix in the next release to allow null for this setting, but now you must set this as empty collection.

About the onnxrutime error, we can advice only to use ThreadCount = 1 in the settings. It works more stable.

We are afraid that we cannot identify it as an Aspose.OCR error. Can you please explain a bit more.