We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Text wrapping appearing when parsing PDF

When parsing the attached bug.pdf (205.5 KB) we see content wrap onto a new line for unexplained reasons. A point to note, is the lines that wrap aren’t even the longest in the file, however, they are always in the email To, From, CC, or Subject block.

The PDF is generated from the attached bug.html.zip (3.3 KB)
file via Chrome 93.0.4577.63 PDF print functionality and parsed with Aspose 21.8.

Expected

...
To: Schnapps, Bob; Polan, Roosevalet; Bert, John (Jane); Cohen, Todd
...

Actual

...
To: Schnapps, Bob; Polan, Roosevalet; Bert, Jo
hn (Jane); Cohen, Todd
...

Code

License pdfLicence = new License();
try {
  pdfLicence.setLicense(new ByteArrayInputStream(
          LICENCE_DATA.getBytes(StandardCharsets.UTF_8)));
} catch (Exception ex) {
  System.out.println(ex);
  return;
}
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(pdfFilename);
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

@bcrowhurst

I request you to try the following code and share your feedback.

TextExtractionOptions options = new TextExtractionOptions(TextFormattingMode.Raw);
Document pdfDocument = new Document("bug.pdf");
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
textAbsorber.setExtractionOptions(options);
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
System.out.println(extractedText);

Thanks that solution looks to have resolved the issue.

@bcrowhurst

It’s good to know that suggested option has proved to be working on your end.