Unable to extract text in PDF printed pdf document

tjuuldlr · March 22, 2024, 6:22am

I have two pdf’s that by the eye looks exactly the same. One is the original pdf and the other is the original pdf “printed” to a new PDF using “Microsoft Print to PDF”.

I try to use Aspose TextAbsorber to get all the text from the documents.
When I try it on the original PDF, the result is fine.
When I try the same code on the PDF created by “Microsoft Print to PDF”, nothing comes out. I have tried it in various examples without any luck. In some cases, the TextAbsorber do come out with something, but it is gibberish.

Does anybody have any explanation to what the problem is and what is actually happening when you do the “Microsoft Print to PDF”.?

Regards, Thomas…

ilyazhuykov · March 22, 2024, 7:32am

@tjuuldlr
Could you please share example of your code and documents before and after using “Microsoft Print to PDF” so we can investigate your issue?

tjuuldlr · March 22, 2024, 11:49am

I have uploaded the two documents:

Original document.pdf (445.8 KB)
MS Print to PDF document.pdf (514.8 KB)

This is my code:

    Document pdfDocument = new Document(pathToPdf.toAbsolutePath().toString());
    TextAbsorber textAbsorber = new TextAbsorber();
    pdfDocument.getPages().accept(textAbsorber);
    String extractedText = textAbsorber.getText();
    textAbsorber.visit(pdfDocument);

    log.info(extractedText);

ilyazhuykov · March 22, 2024, 12:14pm

@tjuuldlr
It seems that Microsoft Print to PDF makes some peculiar transformations to document that removes all content to Paths
In original there’s physical text fragments that could be read by TextFragmentAbsorber

document_before_print.png (84.0 KB)
document_after_print.png (83.8 KB)