We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Garbage text returned during PDF extraction

Converting the attached bug.html.zip (3.3 KB) to a PDF with Safari 14.1.2 or Firefox 92.0 and then extracting the text content produces a range of garbage characters.

Expected
Clean text extraction from the PDF.

Actual
Clean text and lots of surrounding garbage characters.
Screen Shot 2021-09-14 at 11.26.50 am.png (7.2 KB)

Code

License pdfLicence = new License();
try {
  pdfLicence.setLicense(new ByteArrayInputStream(
          LICENCE_DATA.getBytes(StandardCharsets.UTF_8)));
} catch (Exception ex) {
  System.out.println(ex);
  return;
}
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFilename);
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

@bcrowhurst

Your screenshot is different than the output we are getting so I request you to share your PDF file for our investigations.

Please see attached bug-firefox.pdf (70.7 KB) and
bug-safari.pdf (75.5 KB)

@bcrowhurst

A ticket with ID PDFJAVA-40867 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.