Extract text paragraph wise

Hello,


My requirement is to extract the text from PDF document, especially, from the multi-column document. [Like newspaper article]

I am putting my requirements here:
1. Read and extract the text in the order of columns.
2. Modify every paragraph in the extracted text to a single line of text.
3. Save the lines of text in string array.
4. Write the string array in text file.

Can you please provide the correct approach for these steps?

I am using TextParagraphAbsorber(Rectangle[] rectangles] class to read the text paragraph-wise. But I am not able to find a way to pass the rectangles argument value.

How can I get these rectangle array for every paragraph?

Thank you in advance!

Best regards,
Navnath


Hi Navnath,


Thanks for contacting support.

I am afraid currently Aspose.Pdf for Java do not support the feature to extract text based on columns and for the sake of implementation, we already have logged this requirement as PDFJAVA-35762 in our issue tracking system. However in order to cater the scenario related to your input file, we request you to please share the input document, so that we consider it during the implementation of this feature. As soon as the feature becomes available, we will let you know.

We are sorry for this inconvenience.

Hello Shahbaz,


Thank you for the reply.
Actually, there is an example ExtractTextBasedOnColumns.java in the example directory.
I used the same example for my requirement. It extracts the text based on columns properly. I am not sure why, as you are saying that its not implemented in Aspose.pdf java.

Now, my next requirement is to identify every new text paragraph.
I want to convert each text paragraph into a single line of text.

Is this possible with Aspose.pdf Java?

I am attaching my input file with this post.

Thank you in advance!
Navnath







Hi Navnath,


Thanks for the acknowledgement.

The API had some issues while manipulating columns in certain PDF files, so the issue was logged in bug tracking system. However concerning to your requirement, I am further testing the scenario using your shared document and will get back to you soon.

nnkumbhar212:
Now, my next requirement is to identify every new text paragraph.
I want to convert each text paragraph into a single line of text.
Hi Navnath,

The earlier shared ticket ID PDFJAVA-35762 is regarding extraction of text paragraph by paragraph (rather extracting the content from complete document). Once this feature gets implemented, we will let you know.

We are sorry for this inconvenience.

@nnkumbhar212

Thanks for your patience.

We are pleased to inform you that earlier logged feature request PDFJAVA-35762 has been fulfilled in Aspose.PDF for Java 18.1. We have implemented new functionality for searching sections and paragraphs in the text of PDF document pages. The following code snippets illustrates ParagraphAbsorber usage:

Sample #1 - Drawing border of sections and paragraphs of text on PDF page:

public void PDFJAVA_35762() 
    {
        initLicense();
        System.out.println("Is licensed = " + Document.isLicensed());

        String myDir = "E:/LocalTesting/";

        Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
        Page page = doc.getPages().get_Item(2);

        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.visit(page);

        PageMarkup markup = absorber.getPageMarkups().get(0);
        for (MarkupSection section : markup.getSections())
        {

            drawRectangleOnPageTest(section.getRectangle(), page);
            for (MarkupParagraph paragraph : section.getParagraphs())
            {
                drawPolygonOnPageTest(paragraph.getPoints(), page);
            }
        }
        doc.save(myDir + "amblatt2013-10-05_sections&paragraphs" + version + ".pdf");
    }
    
    private  void drawRectangleOnPageTest(Rectangle rectangle, Page page)
    {
        page.getContents().add(new Operator.GSave());
        page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
        page.getContents().add(new Operator.SetRGBColorStroke(0, 1, 0));
        page.getContents().add(new Operator.SetLineWidth(2));
        page.getContents().add(
            new Operator.Re(rectangle.getLLX(),
                rectangle.getLLY(),
                rectangle.getWidth(),
                rectangle.getHeight()));
        page.getContents().add(new Operator.ClosePathStroke());
        page.getContents().add(new Operator.GRestore());
    }
    
    private  void drawPolygonOnPageTest(Point[] polygon, Page page)
    {
        page.getContents().add(new Operator.GSave());
        page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
        page.getContents().add(new Operator.SetRGBColorStroke(0, 0, 1));
        page.getContents().add(new Operator.SetLineWidth(1));
        page.getContents().add(new Operator.MoveTo(polygon[0].getX(), polygon[0].getY()));
        for (int i = 1; i < polygon.length; i++)
        {
            page.getContents().add(new Operator.LineTo(polygon[i].getX(), polygon[i].getY()));
        }
        page.getContents().add(new Operator.LineTo(polygon[0].getX(), polygon[0].getY()));
        page.getContents().add(new Operator.ClosePathStroke());
        page.getContents().add(new Operator.GRestore());
    }

Sample #2 - Iterating through paragraphs collection and get text of them:

String myDir = "E:/LocalTesting/";

        Document doc = new Document(myDir + "amblatt2013-10-05.pdf");

        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.visit(doc);


        for ( PageMarkup markup : absorber.getPageMarkups())
        {
            int i = 1;
            
            for (MarkupSection section : markup.getSections())
            {
                int j = 1;
                for (MarkupParagraph paragraph : section.getParagraphs())
                {
                    StringBuilder paragraphText = new StringBuilder();

                    for(List<TextFragment> line : paragraph.getLines())
                    {
                        for(TextFragment fragment : line)
                        {
                            paragraphText.append(fragment.getText());
                        }
                        paragraphText.append("\r\n");
                    }
                    paragraphText.append("\r\n");
                    
                    System.out.println("Paragraph {"+j+"} of section {"+i+"} on page {"+markup.getNumber()+"}:");
                    System.out.println(paragraphText.toString());

                    j++;
                }
                i++;
            }
        }

Please try the functionality using suggested code snippet and in case you face any issue please provide details along with sample PDF document. We will test the scenario in our environment and address it accordingly.

PS: It would really be appreciated if you can share the JDK version in which you are working in your environment.