@bcrowhurst
We have investigated the earlier logged ticket. The reason garbage text appears is broken fonts used. When some letters have a visual appearance like the letter “A” but have under it description like “#”. Sometimes this is used expressly to encode and protect content from copy.
You can also test and copy the text from adobe acrobat and you will receive the same result.
And the only way to receive original text from such documents - convert them into images and use optical OCR recognizers.
Create callBack - logic recognize the text for pdf images. Use outer OCR supports HOCR standard(http://en.wikipedia.org/wiki/HOCR).
for example free google tesseract OCR(http://en.wikipedia.org/wiki/Tesseract_(software))
https://www.javatips.net/api/com.aspose.pdf.document.callbackgethocr
Also, you can use Aspose.PDF converter HTML -> PDF when the fonts in the system are not broken as described before.
HtmlLoadOptions htmloptions = new HtmlLoadOptions();
Document doc = new Document(dataDir+"bug.html/bug.html", htmloptions);
doc.save(dataDir+"Output_aspose_pdf_22_2_.pdf");
String id = "Output_aspose_pdf_22_2_";
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(dataDir + id + ".pdf");
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);
Another solution is to use PDF alternative print plugin in browsers where HTML is going to be converted to PDF. This is document saved in our Firefox browser: (bug-firefox_my.pdf)
bug-firefox_my.pdf (166.9 KB)
Output_aspose_pdf_22_2_.pdf (542.7 KB)