ParagraphAbsorber functionality

Hi,

I am testing Aspose.Pdf for Java lib in order to determine if it provides a necessary functionality to include it in my application. I actually need to traverse through a PDF document paragraph by paragraph and extract text with it’s formatting values (font name, size, styles, etc.). I would also need to determine different types of objects, like tables, header, footer, etc. In addition to this, I would need to remove and add new text into a paragraph with the proper formatting.

I was using ParagraphAbsorber with a sample file (as it has been explained here) and encountered some problems.

  1. Iterating through MarkupSections goes from the bottom to the top of the page.

     Document document = new Document("sample.pdf");
    
     ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber();
    
     paragraphAbsorber.visit(document.getPages().get_Item(1));
    
     for (PageMarkup page : paragraphAbsorber.getPageMarkups()) {
    
         int i = 1;
    
         for (MarkupSection section : page.getSections()) {
    
     	int j = 1;
    
     	for (MarkupParagraph paragraph : section.getParagraphs()) {
    
     	    StringBuilder paragraphText = new StringBuilder();
    
     	    for (List<TextFragment> line : paragraph.getLines()) {
    
     		for (TextFragment textFragment : line) {
    
     		    paragraphText.append(textFragment.getText());
    
     		}
    
     		paragraphText.append("\n");
     	    }
    
     	    paragraphText.append("\n");
    
     	    System.out.println(String.format("Paragraph %d of section %d on page %d:", j, i, page.getNumber()));
     	    System.out.println(paragraphText.toString());
    
     	    j++;
     	}
    
     	i++;
         }
     }
     }
    

This is the output:

Paragraph 1 of section 1 on page 1:
February 20, 1999

Paragraph 1 of section 2 on page 1:
Robert Maron

Paragraph 2 of section 2 on page 1:
Grzegorz Grudzinski´

Paragraph 1 of section 3 on page 1:
Sample PDF Document

If you check the attached file, you will notice the text is printed bottom to top.

  1. Setting font style value to either Bold or Italic isn’t applied in a saved file (check screenshot bold.png (46.9 KB))

     Document document = new Document("sample.pdf");
    
     ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber();
    
     paragraphAbsorber.visit(document.getPages().get_Item(1));
    
     for (PageMarkup page : paragraphAbsorber.getPageMarkups()) {
    
         for (MarkupSection section : page.getSections()) {
    
     	for (MarkupParagraph paragraph : section.getParagraphs()) {
    
     	    for (List<TextFragment> line : paragraph.getLines()) {
    
     		for (TextFragment textFragment : line) {
    
     		    textFragment.getTextState().setFontStyle(FontStyles.Bold);
     		}
     	    }
     	}
         }
     }
    
     document.save("sample-saved.pdf", SaveFormat.Pdf);
    
  2. Clearing TextFragments list and adding new TextFragment doesn’t change text in a saved file (check screenshot text_change.png (46.2 KB))

    Document document = new Document("sample.pdf");
    
     ParagraphAbsorber paragraphAbsorber = new ParagraphAbsorber();
    
     paragraphAbsorber.visit(document.getPages().get_Item(1));
    
     for (PageMarkup page : paragraphAbsorber.getPageMarkups()) {
    
         for (MarkupSection section : page.getSections()) {
    
     	for (MarkupParagraph paragraph : section.getParagraphs()) {
    
     	    for (List<TextFragment> line : paragraph.getLines()) {
    
     		line.clear();
     		
     		line.add(new TextFragment("New text"));
     		
     	    }
     	}
         }
     }
    
     document.save("sample-saved.pdf", SaveFormat.Pdf);
    

Can you help me with these issues?

I’m using Aspose.Pdf for Java 18.3.

Thanks,
Zeljko

@Zeljko,

Kindly send us your source PDF document. We will investigate your scenario in our environment, and share our findings with you.

Hi Imran,

Here’s the file: sample.pdf.zip (25.9 KB)

Thanks,
Zeljko

@Zeljko,

We managed to replicate the said issues as follows:

PDFJAVA-37654: Input PDF - an incorrect sequence of the retrieved paragraphs
PDFJAVA-37655: Input PDF - cannot change the font style of the text

We have linked your post to these tickets and will keep you informed regarding any available updates.

Please modify the code as follows:
Java

for (List<TextFragment> line : paragraph.getLines()) 
{
    TextFragment fragment = line.get(0);
    line.clear();
    fragment.setText("New text");
    line.add(fragment);
}

Hi Imran,

I’ve just tested the code you provided me with and noticed an issue. It looks like the text line has not been cleared. There is a leftover from the previous text. Please see the screenshot Screenshot from 2018-04-19 11-14-04.png (38.7 KB)

I’ve used the same sample file I already attached in the previous post. Here’s the resulting pdf: sample-saved.pdf.zip (25.8 KB)

Regards,
Zeljko

@Zeljko,

We are sorry for the inconvenience caused. Please modify the code as follows:
Java

for (List<TextFragment> line : paragraph.getLines()) {
    for (int i = 0; line.size() > i; i++) {
        if(i == 0 )
	    line.get(i).setText("New text");
	else
	    line.get(i).setText("");
    }
}

This is the output PDF: NewText_sample-saved.pdf (52.6 KB)

@Zeljko

Thanks for your patience.

In reference to earlier logged issue (filed as PDFJAVA-37665), the font “Times Roman” (please do not be confused with “Times New Roman”) should exists in standard font directories or path to the font should be pointed by the function:

Document.addLocalFontPath(... path to Times Roman font ...);

Please use suggested approach with Aspose.PDF for Java 18.4 and in case you still face any issue, feel free to let us know.

The issues you have found earlier (filed as PDFJAVA-37654) have been fixed in this update.