Garbage text returned during PDF extraction

bcrowhurst · September 14, 2021, 1:27am

Converting the attached bug.html.zip (3.3 KB) to a PDF with Safari 14.1.2 or Firefox 92.0 and then extracting the text content produces a range of garbage characters.

Expected
Clean text extraction from the PDF.

Actual
Clean text and lots of surrounding garbage characters.
Screen Shot 2021-09-14 at 11.26.50 am.png (7.2 KB)

Code

License pdfLicence = new License();
try {
  pdfLicence.setLicense(new ByteArrayInputStream(
          LICENCE_DATA.getBytes(StandardCharsets.UTF_8)));
} catch (Exception ex) {
  System.out.println(ex);
  return;
}
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFilename);
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

mudassir.fayyaz · September 14, 2021, 2:10pm

@bcrowhurst

Your screenshot is different than the output we are getting so I request you to share your PDF file for our investigations.

bcrowhurst · September 14, 2021, 11:05pm

Please see attached bug-firefox.pdf (70.7 KB) and
bug-safari.pdf (75.5 KB)

mudassir.fayyaz · September 15, 2021, 12:59pm

@bcrowhurst

A ticket with ID PDFJAVA-40867 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.

asad.ali · March 12, 2022, 11:59pm

@bcrowhurst

We have investigated the earlier logged ticket. The reason garbage text appears is broken fonts used. When some letters have a visual appearance like the letter “A” but have under it description like “#”. Sometimes this is used expressly to encode and protect content from copy.

You can also test and copy the text from adobe acrobat and you will receive the same result.
And the only way to receive original text from such documents - convert them into images and use optical OCR recognizers.

Create callBack - logic recognize the text for pdf images. Use outer OCR supports HOCR standard(hOCR - Wikipedia).
for example free google tesseract OCR(Tesseract (software) - Wikipedia)
Java Examples for com.aspose.pdf.Document.CallBackGetHocr

Also, you can use Aspose.PDF converter HTML → PDF when the fonts in the system are not broken as described before.

HtmlLoadOptions htmloptions = new HtmlLoadOptions();
Document doc = new Document(dataDir+"bug.html/bug.html", htmloptions);
doc.save(dataDir+"Output_aspose_pdf_22_2_.pdf");


String id = "Output_aspose_pdf_22_2_";
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(dataDir + id + ".pdf");
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

Another solution is to use PDF alternative print plugin in browsers where HTML is going to be converted to PDF. This is document saved in our Firefox browser: (bug-firefox_my.pdf)

bug-firefox_my.pdf (166.9 KB)
Output_aspose_pdf_22_2_.pdf (542.7 KB)