Insufficient memory when converting searchable pdf

Rich_Yu · November 3, 2023, 1:54am

Hi team,

We noticed that when converting searchable PDFs, memory usage became uncontrollable. We have a pdf file with 20 pages and the file size is 5.43MB. In order to reduce memory usage, we split the pdf into single pages and then convert it to searchable pdf. One of the pages is a floor plan of the house. Similar to this image: 11.png (35.3 KB). When converting this page, memory usage will be unusually high.

The following the sample code

        Document.CallBackGetHocr cbgh = bufferedImage -> {
            try {
                Logger.Log("return empty HOCR XML");
                return EMPTY_HOCR_XML;
            } catch (Exception e) {
                Logger.Error(e);
                return EMPTY_HOCR_XML;
            } finally {
                bufferedImage.flush();
                bufferedImage = null;
                System.gc();
            }
        };

        for (int i = 0; i < singlePageFilePathList.size(); i++) {
            String singlePageFilePath = singlePageFilePathList.get(i);
            try (Document doc = new Document(singlePageFilePath)){
                doc.convert(cbgh);
                doc.save(singlePageFilePath);
            } catch (Exception e) {
                Logger.Error(e);
            }
        }

We have a machine with 2GB free memory. But before we see the log of this line Logger.Log(“return empty HOCR XML”);, the 2GB free memory has been used up. When we switch to a machine with more free memory, we can see the log of this line Logger.Log(“return empty HOCR XML”);, but it appears more than 5,000 times, so we interrupt the process. We saved each BufferedImage as a tiff file, and the images in more than 5,000 tiff files are all the same:
tiff.png (18.8 KB)

Maybe there really are so many identical border in that page, but is there any way to reduce the memory usage?

Thanks

asad.ali · November 3, 2023, 9:16am

@Rich_Yu

We need to investigate this case in order to further determine the issue. Can you please confirm if you are using the latest version of the API? Also, please share your sample PDF document for our reference so that we can test the scenario in our environment and address it accordingly.

Rich_Yu · November 8, 2023, 2:38am

Yes, we tested it with aspose-pdf-23.10 but it still ran out of memory.
Attached is the sample PDF:
sample.pdf (5.4 MB)

asad.ali · November 8, 2023, 10:49am

@Rich_Yu

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55874

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Rich_Yu · November 13, 2023, 3:36am

Hi Team,

Since we are using the JAVA version, could you please file a ticket to the JAVA library as well?

Thanks!

asad.ali · November 13, 2023, 1:30pm

@Rich_Yu
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43285

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

aweech · August 21, 2024, 6:27pm

@asad.ali can you let me know if there is any update on PDFJAVA-43285? Thanks.

asad.ali · August 21, 2024, 8:16pm

@aweech

We are afraid that the earlier logged ticket hasn’t been resolved yet due to other issues in the queue. However, as soon as we make some progress towards its resolution, we will inform you via this forum thread. Please spare us some time. We are sorry for the inconvenience.