Paragraph Absorber with MultiColumn Sections not working

sraab1998 · September 7, 2022, 11:24am

Hi,
I am currently looking for a new Framework to extract Text From PDF Files. Therefore we need to process MultiColumn Files hopefully only with 2 Columns. I followed one of the Code expamples but the order of the paragraphs is wrong and therefore i am unable to find the overflow of the last Section in the first column and the first one in the second Column

ParagraphAbsorber absorber = new ParagraphAbsorber();
            absorber.visit(page);
            for (PageMarkup markup : absorber.getPageMarkups()) {
                int i = 0;
                markup.setMulticolumnParagraphsAllowed(true);

                for (MarkupSection section : markup.getSections()) {

                    int j = 0;

                    for (MarkupParagraph paragraph : section.getParagraphs()) {
                        StringBuilder paragraphText = new StringBuilder();

                        for (java.util.List<TextFragment> line : paragraph.getLines()) {

                            for (TextFragment fragment : line) {

                                paragraphText.append(fragment.getText() + " ");
                            }
                            paragraphText.append("\r\n");
                        }
                        paragraphText.append("\r\n");

                        System.out.println("Paragraph " + j + " of section " + i + " on page" + ":" + markup.getNumber());
                        System.out.println(paragraphText.toString());

                        j++;
                    }
                    i++;
                }

tahir.manzoor · September 7, 2022, 1:55pm

@sraab1998

Could you please share your input PDF file here for testing? We will investigate the issue and provide you more information on it.

sraab1998 · September 7, 2022, 2:10pm

avb_pb.pdf (242.8 KB)

tahir.manzoor · September 7, 2022, 5:25pm

@sraab1998

We have logged this problem in our issue tracking system as PDFJAVA-41977. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.