Hi,
I am testing Aspose.Pdf for Java lib in order to determine if it provides a necessary functionality to include it in my application. I actually need to traverse through a PDF document paragraph by paragraph and extract text with it’s formatting values (font name, size, styles, etc.). I would also need to determine different types of objects, like tables, header, footer, etc. In addition to this, I would need to remove and add new text into a paragraph with the proper formatting.
I was using ParagraphAbsorber with a sample file (as it has been explained here) and encountered some problems.
-
Iterating through MarkupSections goes from the bottom to the top of the page.
Document document = new Document("sample.pdf"); ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber(); paragraphAbsorber.visit(document.getPages().get_Item(1)); for (PageMarkup page : paragraphAbsorber.getPageMarkups()) { int i = 1; for (MarkupSection section : page.getSections()) { int j = 1; for (MarkupParagraph paragraph : section.getParagraphs()) { StringBuilder paragraphText = new StringBuilder(); for (List<TextFragment> line : paragraph.getLines()) { for (TextFragment textFragment : line) { paragraphText.append(textFragment.getText()); } paragraphText.append("\n"); } paragraphText.append("\n"); System.out.println(String.format("Paragraph %d of section %d on page %d:", j, i, page.getNumber())); System.out.println(paragraphText.toString()); j++; } i++; } } }
This is the output:
Paragraph 1 of section 1 on page 1:
February 20, 1999
Paragraph 1 of section 2 on page 1:
Robert Maron
Paragraph 2 of section 2 on page 1:
Grzegorz Grudzinski´
Paragraph 1 of section 3 on page 1:
Sample PDF Document
If you check the attached file, you will notice the text is printed bottom to top.
-
Setting font style value to either Bold or Italic isn’t applied in a saved file (check screenshot bold.png (46.9 KB))
Document document = new Document("sample.pdf"); ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber(); paragraphAbsorber.visit(document.getPages().get_Item(1)); for (PageMarkup page : paragraphAbsorber.getPageMarkups()) { for (MarkupSection section : page.getSections()) { for (MarkupParagraph paragraph : section.getParagraphs()) { for (List<TextFragment> line : paragraph.getLines()) { for (TextFragment textFragment : line) { textFragment.getTextState().setFontStyle(FontStyles.Bold); } } } } } document.save("sample-saved.pdf", SaveFormat.Pdf);
-
Clearing TextFragments list and adding new TextFragment doesn’t change text in a saved file (check screenshot text_change.png (46.2 KB))
Document document = new Document("sample.pdf"); ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber(); paragraphAbsorber.visit(document.getPages().get_Item(1)); for (PageMarkup page : paragraphAbsorber.getPageMarkups()) { for (MarkupSection section : page.getSections()) { for (MarkupParagraph paragraph : section.getParagraphs()) { for (List<TextFragment> line : paragraph.getLines()) { line.clear(); line.add(new TextFragment("New text")); } } } } document.save("sample-saved.pdf", SaveFormat.Pdf);
Can you help me with these issues?
I’m using Aspose.Pdf for Java 18.3.
Thanks,
Zeljko