hi there,
we noticed that in this version on some pdf documents the TextAbsorber paragaph.getText() returns too few whitespaces wheras segments and fragments are fine (example code below). i will attach a pdf for you to be able to reproduce the issue.
val textBoxes = new ArrayList<String>();
try {
val paragraphAbsorber = new ParagraphAbsorber();
paragraphAbsorber.visit(page);
for (val markup : paragraphAbsorber.getPageMarkups()) {
for (val section : markup.getSections()) {
for (val paragraph : section.getParagraphs()) {
StringBuilder sb = new StringBuilder();
//String text = paragraph.getText(); // this seems missing some white spaces for unknown reasons
for (val fragments : paragraph.getFragments()) {
for (val segments : fragments.getSegments()) {
sb.append(" " + segments.getText() + " ");
}
sb.append(" ");
}
String text = sb.toString();
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]+", " ");
text = text.replaceAll(" ", " ");
text = text.replaceAll(" ", " ");
text = text.trim();
textBoxes.add(text);
}
}
}
return textBoxes;
} catch (Exception e) {
log.warn(
"error parsing page {}. returning {} textboxes", page.getNumber(), textBoxes.size(), e);
return textBoxes;
}
info about system environment:
----Java System Properties----------------
java.vm.name: OpenJDK 64-Bit Server VM
java.vm.vendor: Ubuntu
java.vm.version: 11.0.8+10-post-Ubuntu-0ubuntu120.04
java.runtime.name: OpenJDK Runtime Environment
java.runtime.version: 11.0.8+10-post-Ubuntu-0ubuntu120.04
os.name: Linux
os.arch: amd64
java.io.tmpdir: /tmp
file.encoding: UTF-8
sun.io.unicode.encoding: UnicodeLittle
sun.cpu.endian: little
Available processors (cores): 12
Free memory (bytes): 600 MB
Maximum memory (MBytes): 3916 MB
Total memory available to JVM (MBytes): 867 MB
File system root: /
Total: 477500 MB; used: 182761 MB; available: 294738 MB
Aspose lib versions:
Aspose.Pdf for Java : 20.8
Greets from Berlin
Helge Rennicke
Autonomous Vehicle Implementation Predictions.pdf (807.0 KB)