We want to convert “non-searchable PDF’s” into “searchable PDF’s” with the help of Aspose.OCR (see Sample_Doc.pdf in the attachment).
In our solution we get a black content after the conversion with Aspose.OCR and the content of the PDF is completely destroyed and no longer usable for the customer. (See not_wanted_result.pdf in the attachment)
I have checked the same PDF with Aspose Cloud solution at “OCR Online. Convert PDF to Searchable PDF” and get only a part as OCR (see Aspose_Cloud_result.jpg in the attachment).
Are you aware of such a case?
What do we have to do so that if it cannot read and convert the content, we catch the error or empty content and do not pass on a corrupt document and simply leave the original as it is.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): OCRNET-827
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
Thanks for the suggestion. I could now solve my problem with nonsearchable PDF’s thanks to this setting, but after upgrading the Aspose components I cant now open the converted images (tif,jpg,bmp,png,…).
As soon as I want to open the converted image with Acrobat , it gives me a message “There was an error opening this document. This file can not be opened because it has no pages”, although it is not 0 KB. Error while opening convertet image file.gif (26.4 KB)
In the code, even with the following recognition settings, I see no OCR result for images and always get 0 and cannot open the converted document.
I think I have found the problem. it is the startpage value in the Input.Add method. Can you tell me what the correct startindexing for this method is?
var pagecount = doc.Pages.Count();
var documentSettings = OcrHelper.SetDocumentRecognitionSettings(ocrImageSettings, "IMAGE");
var outputFileContent = new MemoryStream();
AsposeOcr api = new AsposeOcr();
OcrInput input = new OcrInput(InputType.PDF);
input.Add(originalFileContent, 1 , pagecount );
I am using 1 for startpage in PDF, but this is not working for images and i got no OCR Result for images.
can you please tell me how to determine the correct startpage so that it works for both PDF and all image types?
Or is it also depending on which setting we use for image recognition?
i realized that your suggested code is also not working with multipage PDF and change it to:
AsposeOcr api = new AsposeOcr();
OcrInput input = new OcrInput(InputType.PDF);
input.Add(originalFileContent, 1 , pageCount);
// Recognize image
List<RecognitionResult> resultTest = api.Recognize(input, new RecognitionSettings
{
DetectAreasMode = DetectAreasMode.TABLE
});
With this resocnition setting, i can just convert multipage nonsearchable pdf’s with more text in it, but no chance anymore with nonsearchable pdf’s which has an image.
The question is now how to get a result with multipage pdf’s which has also text and image.
And after trying to convert the multipage given in the Aspose Example
i got the following message:
`Microsoft.ML.OnnxRuntime.OnnxRuntimeException: '[ErrorCode:RuntimeException] Non-zero status code returned while running ConvInteger node. Name:'Conv_0_quant' Status Message: bad allocation'`
Can you tell me, if this has something to do with the recognition setting? or why do i get such a message and cant convert this example.
i want just mention the things that does not work:
multipage pdf in git aspose example (link mentioned above) → generates OnnxRuntimeException
i am now differentiate between one page and multipage also for image and the InputType.SingleImage and InputType.TIFF i got always while recognizing the text with this sample code the error: {“Value does not fall within the expected range.”}
public static MemoryStream ProcessOcrImage(AsposeSettings setting, MemoryStream originalFileContent, Document doc, BaseDocument org_doc)
{
var outputFileContent = new MemoryStream();
OcrInput input = new OcrInput(InputType.SingleImage);
try
{
var pageCount = doc.Pages.Count();
AsposeOcr api = new AsposeOcr();
if (pageCount == 1)
{
if (setting.OCRSettings.AutoSkew)
{
PreprocessingFilter filters = new PreprocessingFilter
{
PreprocessingFilter.AutoSkew()
};
input = new OcrInput(InputType.SingleImage, filters);
}
else
{
input = new OcrInput(InputType.SingleImage);
}
input.Add(originalFileContent);
}
else
{
if (setting.OCRSettings.AutoSkew)
{
PreprocessingFilter filters = new PreprocessingFilter
{
PreprocessingFilter.AutoSkew()
};
input = new OcrInput(InputType.TIFF, filters);
}
else
{
input = new OcrInput(InputType.TIFF);
}
input.Add(originalFileContent, 1, pageCount);
}
// Recognize image -->> causes Error
List<RecognitionResult> ocrResult = api.Recognize(input, new RecognitionSettings
{
DetectAreasMode = DetectAreasMode.COMBINE
});
if (ocrResult.Count == 1)
{
org_doc.OCRResult = ocrResult[0].RecognitionText;
AsposeOcr.SaveMultipageDocument(outputFileContent, OCR.SaveFormat.Pdf, ocrResult);
}
else if(ocrResult.Count < 1)
{
return originalFileContent;
}
}
catch (Exception ex)
{
Logger.Error($"ProcessOcrImage failed:{ex.Message}", ex);
throw ex;
}
finally
{
Logger.Info("End of ProcessOcrImage method.");
}
return outputFileContent;
}```
Please check the link in private message we just sent you to download the project:
The problem was
conflict with different Aspose packages
ocr settings
case "RecognitionAreas": // If not null, it should be defined as a complex object with x,y,h
setting.OCRSettings.RecognitionAreas = new List<Rectangle>();
break;
You were setting this as null. We will fix in the next release to allow null for this setting, but now you must set this as empty collection.
About the onnxrutime error, we can advice only to use ThreadCount = 1 in the settings. It works more stable.
We are afraid that we cannot identify it as an Aspose.OCR error. Can you please explain a bit more.