Support for mixed languages

Anishc · June 12, 2019, 3:58am

Dear Team,

We are evaluating the Aspose library for one of our project. We have few issues with the OCR library and require your valuable inputs on the same.

API Versions used:
“Aspose.OCR” Version=“17.11.0”
“Aspose.PDF” Version=“19.5.0”

Requirement:
Need to extract texts from PDF documents which contained scanned images. The document can contain multiple languages (English, Spanish, Portuguese).

Issue:
When we try to specify more than one language pack, its throwing error.

“Unfortunately work with more than one languages is not supported yet! Use Clear() method to remove current language.”

We are following the steps mentioned in https://docs.aspose.com/ocr/net/languages/
As the documents are coming to the application from multiple sources, we are not able to predefined which language the documents are in. So the idea is to include multiple language packs and let the API detect the correct language.

Question:
Is Aspose capable of identifying the language of the input document automatically and take the respective language pack?. Or is there a recommended approach to address this kind of scenario.

Request your timely help in this matter.

Thanks & Regards
Anish

asad.ali · June 12, 2019, 6:13pm

@Anishc

Thanks for contacting support.

Aspose.OCR for .NET offers IRecognizedTextPartInfo Interface which holds useful information about the recognized text part. Each part has its own style, font, text size, color, language and other attributes. You can use it to determine the language of identified text. In case you face any issue, please share your sample input files along with complete sample code snippet. We will test the scenario in our environment and address it accordingly.

Anishc · June 12, 2019, 11:48pm

@asad.ali Thanks for the details.

We tried the approach with IRecognizedTextPartInfo but the language is always returning NULL.
Attaching the code snippet for you to verify. Please let us know what needs to be done to identify the language correctly.
ParsePDF_Images.zip (1.0 KB)

asad.ali · June 13, 2019, 5:01pm

@Anishc

Thanks for sharing sample code snippet.

Would you please also share input image file with which you tested this code snippet so that we can test the scenario accordingly.

Anishc · June 13, 2019, 6:59pm

@asad.ali
Please find the sample document attached.
Scan_Test_14_52_13-06-2019.pdf (106.0 KB)

asad.ali · June 14, 2019, 4:43pm

@Anishc

Thanks for providing requested details.

We have tested the scenario in our environment and observed similar issue that API was unable to detect language from found text blocks. Therefore, an issue has been logged as OCR-689 in our issue tracking system for the sake of correction. We will further look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

Anishc · June 17, 2019, 2:32pm

Thanks for the analysis. Do we have any update on how much time it will take for the resolution. We are in the process of evaluating Aspose using temporary license for few of our use cases. I am afraid our evaluation period expire on 2019-07-11.

asad.ali · June 17, 2019, 11:59pm

@Anishc

We regret to share that we cannot share any ETA for resolution of the logged ticket. Aspose.OCR for .NET has not been revised after 17.11 version and we intend to launch new revision of this API in future. Due to other high priority tasks and components integration, this may take some time. However, as soon as we make some significant progress towards resolution of logged ticket, we will let you know. Please spare us little time.

In case your free license gets expired, you can surely request for extension in Aspose.Purchase forum.