Hi team,
We noticed that when converting searchable PDFs, memory usage became uncontrollable. We have a pdf file with 20 pages and the file size is 5.43MB. In order to reduce memory usage, we split the pdf into single pages and then convert it to searchable pdf. One of the pages is a floor plan of the house. Similar to this image: 11.png (35.3 KB). When converting this page, memory usage will be unusually high.
The following the sample code
Document.CallBackGetHocr cbgh = bufferedImage -> {
try {
Logger.Log("return empty HOCR XML");
return EMPTY_HOCR_XML;
} catch (Exception e) {
Logger.Error(e);
return EMPTY_HOCR_XML;
} finally {
bufferedImage.flush();
bufferedImage = null;
System.gc();
}
};
for (int i = 0; i < singlePageFilePathList.size(); i++) {
String singlePageFilePath = singlePageFilePathList.get(i);
try (Document doc = new Document(singlePageFilePath)){
doc.convert(cbgh);
doc.save(singlePageFilePath);
} catch (Exception e) {
Logger.Error(e);
}
}
We have a machine with 2GB free memory. But before we see the log of this line Logger.Log(“return empty HOCR XML”);, the 2GB free memory has been used up. When we switch to a machine with more free memory, we can see the log of this line Logger.Log(“return empty HOCR XML”);, but it appears more than 5,000 times, so we interrupt the process. We saved each BufferedImage as a tiff file, and the images in more than 5,000 tiff files are all the same:
tiff.png (18.8 KB)
Maybe there really are so many identical border in that page, but is there any way to reduce the memory usage?
Thanks