Extract text from pdf by para's

Kusumanchi.Rajesh · April 20, 2016, 8:26am

Hi Aspose,

i want to read the PDF file para by para in Body,Headers and Footers

Can aspose can do this.?

Thanks,

Rajesh

codewarior · April 21, 2016, 10:30am

Hi Rajesh,

Thanks for your interest in our API’s.

I am afraid the requested feature of extracting PDF file contents paragraph wise is currently not supported. However for the sake of correction, I have logged it as PDFNEWJAVA-35762 in our issue tracking system. We will further look into the details of this requirement and will keep you posted on the status of correction.

Furthermore, you can extract complete page contents but currently its not supported to extract contents from Header or Footer section. For the sake of implementation, I have logged this requirement as PDFNEWJAVA-35763 in our issue tracking system.

Please be patient and spare us little time. We are sorry for this inconvenience.

asad.ali · September 19, 2017, 12:25pm

@Kusumanchi.Rajesh

Thanks for your patience.

Please note that Headers and footers context is actual on pdf generation stage, but after saving the context, there is no more separation between headers/footers and main content. We cannot take any footers or headers from the just opened document.

In an open document, there is only a context that is located on certain coordinates. TextAbsorber, ImagePlacementAbsorber, TableAbsorber classes can be used. Also the context can be grouped in “marked-content element”. If headers or footers are grouped into such elements we may able to extract this content as images:

Document document = new Document(myDir+ "sample.pdf");
PdfExtractor pe = new PdfExtractor();
//Specify the folder to save extracted images
pe.extractMarkedContentAsImages(document.getPages().get_Item(1), myDir + "MarkedContentElementFolder");

Furthermore, regarding the other logged ticket PDFJAVA-35762, we will surely let you know, once we have some definite updates regarding its resolution. Please be patient and spare us little time.

We are sorry for the inconvenience.

asad.ali · February 11, 2018, 10:07pm

@Kusumanchi.Rajesh

Thanks for your patience.

We are pleased to inform you that earlier logged feature request PDFJAVA-35762 has been fulfilled in Aspose.PDF for Java 18.1. We have implemented new functionality for searching sections and paragraphs in the text of PDF document pages. The following code snippets illustrates ParagraphAbsorber usage:

Sample #1 - Drawing border of sections and paragraphs of text on PDF page:

public void PDFJAVA_35762()
{
    initLicense();
    System.out.println("Is licensed = " + Document.isLicensed());

    String myDir = "E:/LocalTesting/";

    Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
    Page page = doc.getPages().get_Item(2);

    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.visit(page);

    PageMarkup markup = absorber.getPageMarkups().get(0);
    for (MarkupSection section : markup.getSections())
    {

    drawRectangleOnPageTest(section.getRectangle(), page);
    for (MarkupParagraph paragraph : section.getParagraphs())
    {
        drawPolygonOnPageTest(paragraph.getPoints(), page);
    }
    }
    doc.save(myDir + "amblatt2013-10-05_sections&paragraphs" + version + ".pdf");
}
    
private void drawRectangleOnPageTest(Rectangle rectangle, Page page)
{
    page.getContents().add(new Operator.GSave());
    page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.getContents().add(new Operator.SetRGBColorStroke(0, 1, 0));
    page.getContents().add(new Operator.SetLineWidth(2));
    page.getContents().add(
        new Operator.Re(rectangle.getLLX(),
            rectangle.getLLY(),
            rectangle.getWidth(),
            rectangle.getHeight()));
    page.getContents().add(new Operator.ClosePathStroke());
    page.getContents().add(new Operator.GRestore());
}

private void drawPolygonOnPageTest(Point[] polygon, Page page)
{
    page.getContents().add(new Operator.GSave());
    page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.getContents().add(new Operator.SetRGBColorStroke(0, 0, 1));
    page.getContents().add(new Operator.SetLineWidth(1));
    page.getContents().add(new Operator.MoveTo(polygon[0].getX(), polygon[0].getY()));
    for (int i = 1; i < polygon.length; i++)
    {
        page.getContents().add(new Operator.LineTo(polygon[i].getX(), polygon[i].getY()));
    }
    page.getContents().add(new Operator.LineTo(polygon[0].getX(), polygon[0].getY()));
    page.getContents().add(new Operator.ClosePathStroke());
    page.getContents().add(new Operator.GRestore());
}

Sample #2 - Iterating through paragraphs collection and get text of them:

String myDir = "E:/LocalTesting/";

Document doc = new Document(myDir + "amblatt2013-10-05.pdf");

ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.visit(doc);


for (PageMarkup markup : absorber.getPageMarkups())
{
    int i = 1;

    for (MarkupSection section : markup.getSections())
    {
        int j = 1;
        for (MarkupParagraph paragraph : section.getParagraphs())
        {
            StringBuilder paragraphText = new StringBuilder();

            for (List<TextFragment> line : paragraph.getLines())
            {
                for (TextFragment fragment : line)
                {
                    paragraphText.append(fragment.getText());
                }
                paragraphText.append("\r\n");
            }
            paragraphText.append("\r\n");

            System.out.println("Paragraph {" + j + "} of section {" + i + "} on page {" + markup.getNumber() + "}:");
            System.out.println(paragraphText.toString());

            j++;
        }
        i++;
    }
}

Please try the functionality using suggested code snippet and in case you face any issue please provide details along with sample PDF document. We will test the scenario in our environment and address it accordingly.

PS: It would really be appreciated if you can share the JDK version in which you are working in your environment.