When parsing the attached bug.pdf (205.5 KB) we see content wrap onto a new line for unexplained reasons. A point to note, is the lines that wrap aren’t even the longest in the file, however, they are always in the email To, From, CC, or Subject block.
The PDF is generated from the attached bug.html.zip (3.3 KB)
file via Chrome 93.0.4577.63 PDF print functionality and parsed with Aspose 21.8.
Expected
...
To: Schnapps, Bob; Polan, Roosevalet; Bert, John (Jane); Cohen, Todd
...
Actual
...
To: Schnapps, Bob; Polan, Roosevalet; Bert, Jo
hn (Jane); Cohen, Todd
...
Code
License pdfLicence = new License();
try {
pdfLicence.setLicense(new ByteArrayInputStream(
LICENCE_DATA.getBytes(StandardCharsets.UTF_8)));
} catch (Exception ex) {
System.out.println(ex);
return;
}
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFilename);
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);