Support for mixed languages


#1

Dear Team,

We are evaluating the Aspose library for one of our project. We have few issues with the OCR library and require your valuable inputs on the same.

API Versions used:
“Aspose.OCR” Version=“17.11.0”
“Aspose.PDF” Version=“19.5.0”

Requirement:
Need to extract texts from PDF documents which contained scanned images. The document can contain multiple languages (English, Spanish, Portuguese).

Issue:
When we try to specify more than one language pack, its throwing error.

“Unfortunately work with more than one languages is not supported yet! Use Clear() method to remove current language.”

We are following the steps mentioned in https://docs.aspose.com/display/ocrnet/Managing+OCR#ManagingOCR-WorkingwithDifferentLanguages
As the documents are coming to the application from multiple sources, we are not able to predefined which language the documents are in. So the idea is to include multiple language packs and let the API detect the correct language.

Question:
Is Aspose capable of identifying the language of the input document automatically and take the respective language pack?. Or is there a recommended approach to address this kind of scenario.

Request your timely help in this matter.

Thanks & Regards
Anish


#2

@Anishc

Thanks for contacting support.

Aspose.OCR for .NET offers IRecognizedTextPartInfo Interface which holds useful information about the recognized text part. Each part has its own style, font, text size, color, language and other attributes. You can use it to determine the language of identified text. In case you face any issue, please share your sample input files along with complete sample code snippet. We will test the scenario in our environment and address it accordingly.


#3

@asad.ali Thanks for the details.

We tried the approach with IRecognizedTextPartInfo but the language is always returning NULL.
Attaching the code snippet for you to verify. Please let us know what needs to be done to identify the language correctly.
ParsePDF_Images.zip (1.0 KB)


#4

@Anishc

Thanks for sharing sample code snippet.

Would you please also share input image file with which you tested this code snippet so that we can test the scenario accordingly.


#5

@asad.ali
Please find the sample document attached.
Scan_Test_14_52_13-06-2019.pdf (106.0 KB)


#6

@Anishc

Thanks for providing requested details.

We have tested the scenario in our environment and observed similar issue that API was unable to detect language from found text blocks. Therefore, an issue has been logged as OCR-689 in our issue tracking system for the sake of correction. We will further look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.