Aspose.Pdf.Java 20.8 ParagraphAbsorber paragraph.getText() returns too few whitespaces

rennicke · October 5, 2020, 8:01am

hi there,

we noticed that in this version on some pdf documents the TextAbsorber paragaph.getText() returns too few whitespaces wheras segments and fragments are fine (example code below). i will attach a pdf for you to be able to reproduce the issue.

val textBoxes = new ArrayList<String>();
    try {
        val paragraphAbsorber = new ParagraphAbsorber();
        paragraphAbsorber.visit(page);
        for (val markup : paragraphAbsorber.getPageMarkups()) {
            for (val section : markup.getSections()) {
                for (val paragraph : section.getParagraphs()) {
                    StringBuilder sb = new StringBuilder();
                    //String text = paragraph.getText(); // this seems missing some white spaces for unknown reasons
                    for (val fragments : paragraph.getFragments()) {
                        for (val segments : fragments.getSegments()) {
                            sb.append(" " + segments.getText() + " ");
                        }
                        sb.append(" ");
                    }
                    String text = sb.toString();
                    text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]+", " ");
                    text = text.replaceAll("  ", " ");
                    text = text.replaceAll("  ", " ");
                    text = text.trim();
                    textBoxes.add(text);
                }
            }
        }
        return textBoxes;
    } catch (Exception e) {
        log.warn(
                "error parsing page {}. returning {} textboxes", page.getNumber(), textBoxes.size(), e);
        return textBoxes;
    }

info about system environment:
----Java System Properties----------------
java.vm.name: OpenJDK 64-Bit Server VM
java.vm.vendor: Ubuntu
java.vm.version: 11.0.8+10-post-Ubuntu-0ubuntu120.04
java.runtime.name: OpenJDK Runtime Environment
java.runtime.version: 11.0.8+10-post-Ubuntu-0ubuntu120.04
os.name: Linux
os.arch: amd64
java.io.tmpdir: /tmp
file.encoding: UTF-8
sun.io.unicode.encoding: UnicodeLittle
sun.cpu.endian: little
Available processors (cores): 12
Free memory (bytes): 600 MB
Maximum memory (MBytes): 3916 MB
Total memory available to JVM (MBytes): 867 MB
File system root: /
Total: 477500 MB; used: 182761 MB; available: 294738 MB
Aspose lib versions:
Aspose.Pdf for Java : 20.8

Greets from Berlin
Helge Rennicke

Autonomous Vehicle Implementation Predictions.pdf (807.0 KB)

asad.ali · October 5, 2020, 7:28pm

@rennicke

We have managed to reproduce the issue in our environment with Aspose.PDF for Java 20.9 and logged it as PDFJAVA-39832 in our issue tracking system for the sake of further investigation. We will look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.