AsposeOCRPdf Java version 23.6 does nothing

raviteja · June 14, 2023, 8:13pm

Hi,

I am new to aspose ocr for image pdf extraction. When i use 22.2 version - i get the following error Exception in thread “main” java.lang.NoClassDefFoundError: ai/onnxruntime/OrtException

For the above error - i am not able to find the jar for it.

And when i change it to 23.6 version, then the Class DocumentRecognitionSettings shows as depricated. So what is the java code for verion 23.6 to convert. With the following code , 23.6 code does nothing. It does not have any result. For testing, i used the same pdf given in aspose examples. Please let me know what is wrong here.

    String file = "multi_page_1.pdf";
	
	
	// Create api instance
	AsposeOCRPdf api = new AsposeOCRPdf();

	// Set recognition options
	DocumentRecognitionSettings settings = new DocumentRecognitionSettings(0,1);
	settings.setDetectAreas(false);

	// Get result list
	ArrayList<RecognitionResult> result = api.RecognizePdf(file, settings);
	//System.out.println(api.GetPagesNumber(file));

	// print result		
	for(RecognitionResult r: result) {
		printResult(r);
	}

	// ExEnd:1
	System.out.println("OCRRecognizePdf: execution complete");

asad.ali · June 15, 2023, 12:07am

@raviteja

In order to resolved onnxruntime error, you need to specify following dependency in your pom.xml file:

<!-- https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime -->
<dependency>
   <groupId>com.microsoft.onnxruntime</groupId>
   <artifactId>onnxruntime</artifactId>
   <version>1.10.0</version>
</dependency>

Furthermore, can you please share your sample file for our reference so that we can test the scenario in our environment and address it accordingly?

raviteja · June 15, 2023, 12:53am

Sample_Scanned_PDF.pdf (866.5 KB)
This is the document i am trying. Infact any document nothing is happening with version 23.6

asad.ali · June 15, 2023, 12:08pm

@raviteja

We are checking it and will get back to you shortly.

asad.ali · June 15, 2023, 12:19pm

@raviteja

Please check the below code snippet that we use and we got results in our environment using 23.6 version of the API:

 AsposeOCR api = new AsposeOCR();
License lic = new License();
lic.setLicense("D:\\ASPOSE\\JAVA\\aspose.ocr-for-java\\testproject\\resources\\licenses\\Aspose.OCR.Product.Family.lic");
String file = "D://Sample_Scanned_PDF.pdf";
RecognitionSettings set = new RecognitionSettings();
OcrInput input = new OcrInput(InputType.PDF);
input.add(file);
ArrayList<RecognitionResult> res =  api.Recognize(input, set);
	
System.out.println("TEXT:\n" + res.get(0).recognitionText);
AsposeOCR.SaveMultipageDocument("D://java.pdf", Format.Pdf, res);
AsposeOCR.SaveMultipageDocument("D://java1.pdf", Format.PdfNoImg, res);

java.pdf (3.1 MB)
java1.pdf (8.6 KB)

raviteja · June 15, 2023, 8:54pm

Asad,

The sample pdf works fine with the new code snippet when i move to server. No issues.

But when i ran the actual pdf (which i cannot share as per company policy) - it is throwing this error. We dont need special character or anything just need to read page number printed on the page. How to overcome this error?

Exception in thread “main” java.lang.IllegalArgumentException: Empty data for hough transform.
at com.aspose.ocr.e0cd0c7d77.f(Unknown Source)
at com.aspose.ocr.g.edf(Unknown Source)
at com.aspose.ocr.edf.f(Unknown Source)
at com.aspose.ocr.PreprocessingFilter.f(Unknown Source)
at com.aspose.ocr.t.f(Unknown Source)
at com.aspose.ocr.u.f(Unknown Source)
at com.aspose.ocr.s.e0cd0c6d16(Unknown Source)
at com.aspose.ocr.s.f(Unknown Source)
at com.aspose.ocr.AsposeOCR.Recognize(Unknown Source)

asad.ali · June 15, 2023, 9:26pm

asad.ali:

<!-- https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime -->
<dependency>
   <groupId>com.microsoft.onnxruntime</groupId>
   <artifactId>onnxruntime</artifactId>
   <version>1.10.0</version>
</dependency>

@raviteja

Please make sure to use above dependency as suggested earlier. OR you can download the JAR from internet to directly reference in your project. In case issue still persists, please let us know.

raviteja · June 15, 2023, 9:27pm

I have added it. The sample pdf works fine. But when actual document is added then i am getting that new hough transform error.

asad.ali · June 15, 2023, 9:29pm

@raviteja

We are afraid that we cannot comment further about the issue that you are facing with the original PDF without being able to replicate it in our environment. In case it is confidential, you can share it in private message and we assure you that we use files only for investigation purpose and erase them from our system after resolving the issue.

You can send a private message by using option in the post editor as shown in the screenshot. image.png (18.1 KB)

asad.ali · June 16, 2023, 12:10am

@raviteja

We were able to notice the issue in our environment while test the scenario using your PDF document.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRJAVA-325

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

raviteja · June 16, 2023, 1:02am

Asad,

Yes we do have paid support but that never works for prioritization. Is there any other way this can be achieved with other code snippet, as we are in a time constraint scenario.

asad.ali · June 16, 2023, 11:44am

@raviteja

The ticket has just been logged and as soon as we complete our investigation, we will be able to share some workaround or a fix against it. We have recorded your concerns to expedite the investigation. Furthermore, please note the paid support does not guarantee any immediate solution but, it does escalate the process of investigation quicker.

The issues logged in paid support category have precedence over the issues logged under free support model. Nevertheless, we will inform you once we have some updates regarding ticket fix. Please spare us little time.

We are sorry for the inconvenience.

asad.ali · July 20, 2023, 11:46am

@raviteja

Would you please use 23.7 version of the API and let us know in case the issue still persists?

raviteja · July 21, 2023, 4:33pm

Unfortunately aspose is not working out as it is taking too much time for a 90 page document and finally heap space error after 1 hour.

Other tools like tesseract extracted the same file in 7 minutes. Can we have some performance issues fixed on this?

asad.ali · July 21, 2023, 10:14pm

@raviteja

Can you please share the complete code snippet and the sample file for our reference? We will definitely test the scenario in our environment and try to fix the performance issues.