PDF to Text conversion different on linux and windows

brissonp · October 7, 2024, 1:15pm

Hi,

We are extracting text from PDF and the format is different when on Windows vs Linux. The end output format is important for us, so a difference in spacing is crucial. I validated that both on Windows and Linux, Aspose finds the “fonts” used by the PDF.

Here is the 2 different output of the attached file

Windows scraping:

Test Document\n\n\nList:\nItem 1 \nItem 2 \nItem 3\nItem 4\n\n\n\n\n\n\n

Linux scraping:

Test Document\n\n\nList:\n Item 1 \n Item 2 \n Item 3\n Item 4\n\n\n\n\n\n\n
HTML version (because multiple spaces in HTML are just 1)
Test Document\n\n\nList:\n Item 1 \n Item 2 \n Item 3\n Item 4\n\n\n\n\n\n\n

Thanks
Aspose_Support.pdf (58.1 KB)

asad.ali · October 7, 2024, 7:20pm

@brissonp

If possible, can you please share which API version are you using along with the sample code snippet so that we can further proceed with the investigation? Also, please share the complete version information of the Linux OS.

brissonp · October 10, 2024, 2:59pm

Hi

Using Apose PDF 24.9

And here is the code snippet

        Document doc= new Document(filePath);
        TextAbsorber textAbsorber = new TextAbsorber();
        TextExtractionOptions options;
        options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);                
        textAbsorber.setExtractionOptions(options);            
        doc.getPages().accept(textAbsorber);
        content = textAbsorber.getText();

asad.ali · October 10, 2024, 7:03pm

@brissonp

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-58333

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.