PDF text extraction adds an extra carriage return (2 total) when there is only 1

brissonp · November 1, 2024, 2:42pm

Hi, I am using Apose PDF version 24.9 to extract text from a PDF. For certain types of PDF, the extracted text contains 2 carriage return when the original pdf only has one.

The attached document can be used to reproduce the issue, the text extracted is as follow

John Smith

John Doe

When I would expect

John Smith
John Doe

The code used to extract is as follow

        Document doc = new Document( dataDir + "asposeSupport_3.pdf");
        TextAbsorber textAbsorber = new TextAbsorber();
        TextExtractionOptions options;
        options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);                
        textAbsorber.setExtractionOptions(options);            
        doc.getPages().accept(textAbsorber);
        content = textAbsorber.getText();

asposeSupport_3.pdf (135.5 KB)

Thanks

asad.ali · November 1, 2024, 9:39pm

@brissonp

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44463

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.