What is best approach (new to Aspose)

Hi There

I need to perform a OCR scanning on input PDF files that is either:

  1. Image based.
  2. Indexed - but for some reason “unreadable”.

the output should be a indexed PDF file.

Any “best practice” approach to this?
Any aspects concerning the PDF file I should be aware of?

/Bruno

Hi Bruno,


Thank you for considering Aspose APIs.

Firstly, let me provide you an overview of the Aspose.OCR APIs for better understanding.

  • Aspose.OCR API can read characters from images.
  • Support for JPG, JPEG, PNG, GIF, BMP and TIFF image file formats for OCR.
  • Support for English, French and Spanish.
  • Read popular fonts including Arial, Times New Roman, Courier New, Verdana, Tahoma and Calibri.
  • Support for regular, bold and italic font styles.
  • Scan the whole image or any part of the image.
  • Scan rotated images.
  • Can apply different noise removal filters before image recognition.
  • Allow second guess of a symbol.

As stated above, the current implementation of Aspose.OCR APIs can work with images only, therefore if you wish to process the PDF files then you have to first convert the PDF files to images while using Aspose.Pdf APIs and then perform the OCR operation with Aspose.OCR APIs. Moreover, the results of the OCR operation is in plain text format so if you wish to convert the text to PDF you have use Aspose.Pdf APIs to achieve this goal.

Please feel free to write back in case you have further questions or concerns.

Hi Babar

Thank you for your insights.

Anything to be aware of in regards to languages? The documents I need to process are in Danish using the special Danish characters. Would that cause any problems?

/Bruno

Hi Bruno,


I am afraid, Aspose.OCR APIs currently do not support Danish language. I have attached the appropriate ticket (OCR-33762) from our database to this thread so you could be notified automatically once the required feature is available for public use. Please note, we have the required featured on the road map of Aspose.OCR APIs scheduled for the last quarter of 2015.

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.