Large amount of RAM/time needed to convert a scanned PDF with text layer to PDF/A-2a

t.dobreva · November 7, 2023, 2:27pm

Hello,

I am converting a scanned PDF with a text layer to PDF/A-2a. I noticed that converting this PDF needs about 1,5GB of RAM to succeed, and it takes about 22 minutes. The font that is used for the text layer is not embedded and is not present in the system, so it is replaced with another one (and that font is embedded) in order to produce a valid PDF/A-2a.

Given that the size of the original file is about 41MB, I was amazed that around 1,5GB of RAM were needed for the conversion.

Sample code:

try (Document pdf = new Document("scanned-100-input-with-text.pdf")) {
	PdfFormatConversionOptions conversionOptions = new PdfFormatConversionOptions(PdfFormat.PDF_A_2A);

	conversionOptions.setAlignText(true);
	conversionOptions.setAlignStrategy(PdfFormatConversionOptions.SegmentAlignStrategy.RestoreSegmentBounds);

	conversionOptions.getFontEmbeddingOptions().setUseDefaultSubstitution(true);

	ByteArrayOutputStream conversionLog = new ByteArrayOutputStream();
	conversionOptions.setLogStream(conversionLog);

	pdf.convert(conversionOptions);

	pdf.save("scanned-100-output.pdf", new PdfSaveOptions());
}

If we take another step and substitute certain missing fonts with similar ones before the actual conversion, then the memory needed goes up to about 2,5GB of RAM. We do these substitutions in the following way:

TextFragmentAbsorber absorber = new TextFragmentAbsorber(new TextEditOptions(TextEditOptions.FontReplace.RemoveUnusedFonts));
pdf.getPages().accept(absorber);

TextFragmentCollection textFragments = absorber.getTextFragments();

for (Iterator<TextFragment> iterator = textFragments.iterator(); iterator.hasNext(); ) {
	TextFragment textFragment = iterator.next();

	Font font = textFragment.getTextState().getFont();
	if (!font.isEmbedded()) {
		if (fontSubstitutions.containsKey(font.getFontName())) {
			Font substituteFont = FontRepository.findFont(fontSubstitutions.get(font.getFontName()));
			textFragment.getTextState().setFont(substituteFont);
		}
	}
}

Also, when I tried converting a version of the scanned document without the text layer, only about 200MB of RAM were needed, and the conversion was done in a couple of seconds.

So I am wondering whether there is a way to reduce the amount of RAM/time needed to convert such scanned documents with text?
We limit the RAM for each conversion, so is there a way to estimate how much RAM would be needed for a conversion?

I have uploaded the mentioned documents here: input-files.zip - Google Drive

Using Aspose PDF Java 23.10 on Ubuntu 18.04 and Java 8.

Thank you!

asad.ali · November 7, 2023, 5:55pm

@t.dobreva

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43262

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.