Hi ,
I am trying to extract paragraphs from PDF using Paragraph Absorber. However , if a paragraph is flowing beyond the current column to the next column or if it is flowing beyond page boundaries aspose pdf is considering it as two different paragraphs but it should be one paragraph…
Document pddoc = new Document(new FileInputStream(“C:/Users/aswarna/Documents/FlowingErrosInParagraphIdentification.pdf”));
for (int i = 1; i <= pddoc.getPages().size(); i++) {
Page pdPage = pddoc.getPages().get_Item(i);
ParagraphAbsorber paraAbsorber1 = new ParagraphAbsorber();
paraAbsorber1.visit(pdPage);
List<PageMarkup> pm = paraAbsorber1.getPageMarkups();
Iterator<PageMarkup> pmIter1 = pm.iterator();
// Iterator<PageMarkup> pmIter2=
// paraAbsorber2.getPageMarkups().iterator();
while (pmIter1.hasNext()) {
PageMarkup markup = pmIter1.next();
List<MarkupSection> mss = markup.getSections();
Iterator<MarkupSection> msIter1 = mss.iterator();
while (msIter1.hasNext()) {
MarkupSection ms = msIter1.next();
List<MarkupParagraph> pgs = ms.getParagraphs();
Iterator<MarkupParagraph> mpIter1 = pgs.iterator();
while (mpIter1.hasNext()) {
MarkupParagraph p1 = mpIter1.next();
System.out.println(p1.getText());
System.out.println("\n\n\n");
}
}
}
}
If you observe in the given pdf lowerleft paragraph in the firstcolumn and top paragraph in the second column should be one paragraph.
Also bottom right corner paragraph of first page and first paragraph of second page should be also one paragraph.
Is there a way I can achieve this in aspose ?
issueContent.zip (232.0 KB)