Make pdf image searchable using aspose only (without tesseract and co)

mike1986 · July 21, 2017, 10:18pm

You give us an exemple, to make pdf image searchable using aspose pdf and tesseract ocr, but can we do this using only aspose products( using aspose ocr instead of tesseract) if yes can you give me a code exemple for prospection purpose?

codewarior · July 22, 2017, 12:02am

@mike1986,

Thanks for contacting support.

As per your requirement, you can first Convert PDF Pages to JPEG Images and then perform OCR on image file. But when using this approach, the formatting of text is not preserved and text is recognized as plain text.

In case you have any further query or you encounter any issue while using our APIs, please feel free to contact.

mike1986 · July 22, 2017, 1:39pm

Exception in thread “main” class com.aspose.pdf.internal.ms.System.z9: At most 4 text fragments can be added in evaluation mode.

So if i understand a license is needed to test this fonctionnality?

codewarior · July 24, 2017, 8:16am

@mike1986,

Thanks for sharing the details.

The reason above mentioned error is appearing is because you might be using the API in trial mode. Please note that trail / evaluation version has a limitation of manipulating 4 objects.

Therefore in order to test the APIs without any limitations, please request a 30 days temporary license.

mike1986 · July 24, 2017, 1:03pm

@codewarior
Thanks for your anwser.
is it exact than hocr function is in development in aspose?
Can you give me a release or pre-release date please?

imran.rafique · July 24, 2017, 11:29pm

@mike1986,

Our colleague Nayyer has already shared a solution as per your requirement. You can convert PDF pages to images with Aspose.Pdf API, and then retrieve text from the images with Aspose.OCR API. Kindly let us know if the proposed solution is different than your requirement.

Besides this, if your source PDF has images, then you can retrieve these images from the PDF file, finally, pass these images to Aspose.OCR API to retrieve the text.

Best Regards,
Imran Rafique

ikram.haq · July 25, 2017, 11:40am

@mike1986,

This is to update you that our product team is currently working on restructuring the existing APIs and improving the overall quality of OCR results and already supported features. hOCR functionality feature has been logged into our system with ID OCRNET-2945.

Because of the complexity of this feature, it can take some time and at the moment, we are not in a position to share any reliable ETA. However, we will update you once our product team brings this feature on their roadmap. We are sorry for the inconvenience.

mike1986 · January 19, 2018, 3:11pm

Hi, i’m happy to read in Aspose.OCR for Java 17.6 - Release Notes than hocr is ready, but i can’t find any documentation about this, couls send me a link for this docs please?

ikram.haq · January 19, 2018, 6:12pm

@mike1986,

We have asked for details on this feature from our product team as details are internal to the API. We will share the information with you as soon as it is available.

mike1986 · January 23, 2018, 8:19am

Thank you for your reply.
Could you make sure that I have this information in the next few days please?
Indeed my manager should take this week, a total license java, to start an urgent development, and it would be a shame, knowing that the function exists, to have to develop a first version using an external API, for example tesseract, lack of documentation, and only to be able to use the HOCR aspose function later.

ikram.haq · January 23, 2018, 3:07pm

@mike1986,

Thank you for your patience. This is to update you that hOCR feature is not supported. Because of the complexity of this feature, it will take time. Due to a mistake wrong information got updated and published. We apologize for the misconception.

We are very sorry for the inconvenience caused.

mike1986 · August 14, 2018, 2:25pm

Can you tell me, if an approcimative date of release of this feature is already available?

It’s still a year that I’m waiting for the release of this feature

asad.ali · August 14, 2018, 6:30pm

@mike1986

We are checking the details regarding ticket ID OCRNET-2945 and will update you in a while.

asad.ali · August 14, 2018, 9:51pm

@mike1986

I am afraid that no significant progress has been made towards issue resolution due to complex implementation and previously pending issues in the queue. I regret to share that logged ticket would take more time to get resolved and is not expected to be implemented in this year. However, we will surely let you know as soon as we have some further updates to share in this regard. Please be patient and spare us little time.

We are sorry for the inconvenience.