Scaling performance issues inserting hOCR into PDF document

frimbingpickering · October 4, 2023, 12:59pm

Hi Aspose forum,

There seem to be scaling performance issues with using Aspose.PDF to add hOCR to a PDF without a selectable text layer. The idea is to use a hOCR string obtained from OCR with the original PDF to create a selectable PDF.

More specifically, the hOCR hook in Aspose seems to take exponentially longer to execute depending on the size of the hOCR string.

The current way I have found to circumvent this issue is to perform the following process, since adding hOCR to single-page PDFs works fine:

Split multi-page PDF documents into many single-page documents
Generate per-page hOCR
Add the hOCR to every single-page PDF using Aspose
Use Aspose to merge (assemble) all the single-page PDF documents into one (original) PDF

As you can see this process is unnecessarily complex and expensive, since this can all be done in one step if adding larger hOCR strings from multi-page documents performance scaled better. I surmised that it could perhaps be an algorithmic issue with handling hOCR info from many pages at once.

I have attached a .NET Framework console application AsposeHocrIssue.zip (3.4 MB)
which can be used to reproduce the issue. The license will need to be added as AsposeHocr.Properties.Resources.Aspose_Total_NET as it is not included. Running this application will make it get stuck attempting to add hOCR to the sample PDF.

The input sample PDF and hOCR files are included in the ‘Properties’ directory. The hOCR file was generated from directly using Tesseract OCR on the sample PDF.

Thanks for your support and I hope this performance issues can be solved.

asad.ali · October 4, 2023, 8:12pm

@frimbingpickering

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55630

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.