Text wrapping appearing when parsing PDF

bcrowhurst · September 14, 2021, 12:46am

When parsing the attached bug.pdf (205.5 KB) we see content wrap onto a new line for unexplained reasons. A point to note, is the lines that wrap aren’t even the longest in the file, however, they are always in the email To, From, CC, or Subject block.

The PDF is generated from the attached bug.html.zip (3.3 KB)
file via Chrome 93.0.4577.63 PDF print functionality and parsed with Aspose 21.8.

Expected

...
To: Schnapps, Bob; Polan, Roosevalet; Bert, John (Jane); Cohen, Todd
...

Actual

...
To: Schnapps, Bob; Polan, Roosevalet; Bert, Jo
hn (Jane); Cohen, Todd
...

Code

License pdfLicence = new License();
try {
  pdfLicence.setLicense(new ByteArrayInputStream(
          LICENCE_DATA.getBytes(StandardCharsets.UTF_8)));
} catch (Exception ex) {
  System.out.println(ex);
  return;
}
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFilename);
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

mudassir.fayyaz · September 14, 2021, 2:29pm

@bcrowhurst

I request you to try the following code and share your feedback.

TextExtractionOptions options = new TextExtractionOptions(TextFormattingMode.Raw);
Document pdfDocument = new Document("bug.pdf");
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
textAbsorber.setExtractionOptions(options);
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

bcrowhurst · September 14, 2021, 11:09pm

Thanks that solution looks to have resolved the issue.

mudassir.fayyaz · September 15, 2021, 1:26pm

@bcrowhurst

It’s good to know that suggested option has proved to be working on your end.