Hi,
We are extracting text from PDF and the format is different when on Windows vs Linux. The end output format is important for us, so a difference in spacing is crucial. I validated that both on Windows and Linux, Aspose finds the “fonts” used by the PDF.
Here is the 2 different output of the attached file
Windows scraping:
Test Document\n\n\nList:\nItem 1 \nItem 2 \nItem 3\nItem 4\n\n\n\n\n\n\n
Linux scraping:
Test Document\n\n\nList:\n Item 1 \n Item 2 \n Item 3\n Item 4\n\n\n\n\n\n\n
HTML version (because multiple spaces in HTML are just 1)
Test Document\n\n\nList:\n Item 1 \n Item 2 \n Item 3\n Item 4\n\n\n\n\n\n\n
Thanks
Aspose_Support.pdf (58.1 KB)
@brissonp
If possible, can you please share which API version are you using along with the sample code snippet so that we can further proceed with the investigation? Also, please share the complete version information of the Linux OS.
Hi
Using Apose PDF 24.9
And here is the code snippet
Document doc= new Document(filePath);
TextAbsorber textAbsorber = new TextAbsorber();
TextExtractionOptions options;
options = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
textAbsorber.setExtractionOptions(options);
doc.getPages().accept(textAbsorber);
content = textAbsorber.getText();
@brissonp
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-58333
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.