Extract the text from PDF with format's

Hi,


PFA, we need to read the format’s (i.e indentation and line spacing) from pdf document and also bold, italics & underline.

we were extracting the text from the below code from this is there any method to find the format’s

License license = new License();
license.setLicense(“Aspose.Pdf.lic”);
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("\Input_pdf.pdf");
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

pdfDocument.getPages().accept(textAbsorber);

String extractedText = textAbsorber.getText();
System.out.println(extractedText);

Can you please help us on this.

Thanks,
Rajesh

Hi Rajesh,

Thanks for using our API’s.

In order to accomplish your requirements, please try using following code snippet. However I am afraid the API does not support the feature to determine if Text is Bold or Italic. For the sake of implementation, I have logged this requirement as PDFJAVA-36125 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

// Open document
Document pdfDocument = new Document("c:/pdftest/Input_pdf.pdf");

// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[\\S]+");

// Set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);

// Accept the absorber for the first page of the document
pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into a collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the fragments
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
    System.out.println("Text: " + textFragment.getText());
    System.out.println("Position: " + textFragment.getPosition());
    System.out.println("XIndent: " + textFragment.getPosition().getXIndent());
    System.out.println("YIndent: " + textFragment.getPosition().getYIndent());
    System.out.println("Font - Name: " + textFragment.getTextState().getFont().getFontName());
    System.out.println("Font - IsAccessible: " + textFragment.getTextState().getFont().isAccessible());
    System.out.println("Font - IsEmbedded: " + textFragment.getTextState().getFont().isEmbedded());
    System.out.println("Font - IsSubset: " + textFragment.getTextState().getFont().isSubset());
    System.out.println("Font Size: " + textFragment.getTextState().getFontSize());
    System.out.println("Foreground Color: " + textFragment.getTextState().getForegroundColor());
    System.out.println("Is Underline: " + textFragment.getTextState().isUnderline());
    System.out.println("Line Spacing: " + textFragment.getTextState().getLineSpacing());
}

Hi Nayyer Shahbaz,


Thanks for your reply.

The TextFragmentCollection is reading the text in word by word but as per our requirement, we have to read in para by para.

Is there any way to read in paragraph wise because we want to know the indentation and line spacing for paragraph’s.

Caould you please help us on this

Thanks,
Rajesh

Hi Rajesh,


Thanks for sharing the details and sorry for the delayed response.

I am afraid currently Aspose.Pdf for Java does not support the feature to extract/read PDF file contents in paragraphs manner (paragraph wise). However for the sake of implementation, we already have logged this requirement as PDFNEWJAVA-35762 in our issue tracking system. We will further look into the details of this requirement and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

@Kusumanchi.Rajesh

Thanks for your patience.

We are pleased to inform you that earlier logged feature request PDFJAVA-35762 has been fulfilled in Aspose.PDF for Java 18.1. We have implemented new functionality for searching sections and paragraphs in the text of PDF document pages. The following code snippets illustrates ParagraphAbsorber usage:

Sample #1 - Drawing border of sections and paragraphs of text on PDF page:

public void PDFJAVA_35762()
{
    initLicense();
    System.out.println("Is licensed = " + Document.isLicensed());

    String myDir = "E:/LocalTesting/";

    Document doc = new Document(myDir + "amblatt2013-10-05.pdf");
    Page page = doc.getPages().get_Item(2);

    ParagraphAbsorber absorber = new ParagraphAbsorber();
    absorber.visit(page);

    PageMarkup markup = absorber.getPageMarkups().get(0);
    for (MarkupSection section : markup.getSections())
        {

    drawRectangleOnPageTest(section.getRectangle(), page);
    for (MarkupParagraph paragraph : section.getParagraphs())
    {
        drawPolygonOnPageTest(paragraph.getPoints(), page);
    }
}
doc.save(myDir + "amblatt2013-10-05_sections&paragraphs" + version + ".pdf");
}
    
private void drawRectangleOnPageTest(Rectangle rectangle, Page page)
{
    page.getContents().add(new Operator.GSave());
    page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.getContents().add(new Operator.SetRGBColorStroke(0, 1, 0));
    page.getContents().add(new Operator.SetLineWidth(2));
    page.getContents().add(
        new Operator.Re(rectangle.getLLX(),
            rectangle.getLLY(),
            rectangle.getWidth(),
            rectangle.getHeight()));
    page.getContents().add(new Operator.ClosePathStroke());
    page.getContents().add(new Operator.GRestore());
}

private void drawPolygonOnPageTest(Point[] polygon, Page page)
{
    page.getContents().add(new Operator.GSave());
    page.getContents().add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
    page.getContents().add(new Operator.SetRGBColorStroke(0, 0, 1));
    page.getContents().add(new Operator.SetLineWidth(1));
    page.getContents().add(new Operator.MoveTo(polygon[0].getX(), polygon[0].getY()));
    for (int i = 1; i < polygon.length; i++)
    {
        page.getContents().add(new Operator.LineTo(polygon[i].getX(), polygon[i].getY()));
    }
    page.getContents().add(new Operator.LineTo(polygon[0].getX(), polygon[0].getY()));
    page.getContents().add(new Operator.ClosePathStroke());
    page.getContents().add(new Operator.GRestore());
}

Sample #2 - Iterating through paragraphs collection and get text of them:

String myDir = "E:/LocalTesting/";

Document doc = new Document(myDir + "amblatt2013-10-05.pdf");

ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.visit(doc);


for (PageMarkup markup : absorber.getPageMarkups())
{
    int i = 1;

    for (MarkupSection section : markup.getSections())
    {
        int j = 1;
        for (MarkupParagraph paragraph : section.getParagraphs())
        {
            StringBuilder paragraphText = new StringBuilder();

            for (List<TextFragment> line : paragraph.getLines())
            {
                for (TextFragment fragment : line)
                {
                    paragraphText.append(fragment.getText());
                }
                paragraphText.append("\r\n");
            }
            paragraphText.append("\r\n");

            System.out.println("Paragraph {" + j + "} of section {" + i + "} on page {" + markup.getNumber() + "}:");
            System.out.println(paragraphText.toString());

            j++;
        }
        i++;
    }
}

Please try the functionality using suggested code snippet and in case you face any issue please provide details along with sample PDF document. We will test the scenario in our environment and address it accordingly.

PS: It would really be appreciated if you can share the JDK version in which you are working in your environment.

@codewarior
Has this changes been rolled out ?
I need this api also.

@ganesh.sv

Thanks for contacting support.

Earlier logged ticket was about an issue where Font Style was not able to be determined and it was closed because it was not a bug. A new enhancement request as PDFJAVA-38101 has been logged in our system to implement getter of FontStyle enumeration. We have linked the ticket with this thread so that you will be notified once enhancement is available. Please be patient and spare us little time.

We are sorry for the inconvenience.

How can I use the ParagraphAbsorber with TextSearchOptions?

I am processing pdf document with header, footer and text content in 3 columns/threads per page. I want to extract only text (without header and footer) by paragraphs.

TextAbsorber able to remove header, footer using TextSearchOptions, though not able to read paragraphs across columns/threads.

ParagraphAbsorber is able to read as paragraphs from 3 columns per page, though not able to remove header/footer.

@avin.patel

Could you please share the code snippet where you are using TextAbsorber and TextSearchOptions to discard header/footer from extracted text. Also, please share a sample PDF document for our reference. We will test the scenario in our environment and address it accordingly.

@ganesh.sv

Regarding PDFJAVA-38101, please use the below code snippet with 21.5 version of the API:

// Open document
Document pdfDocument = new Document("c:/pdftest/Input_pdf.pdf");

// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[\S]+");

// Set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);

// Accept the absorber for first page of document
pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the fragments
for (TextFragment textFragment : (Iterable) textFragmentCollection) {
// Display if text is bold or italic
    if (textFragment.getTextState().getFontStyle() == FontStyles.Bold)
        System.out.println("Text (" + textFragment.getText() + ") is Bold");
    else if (textFragment.getTextState().getFontStyle() == FontStyles.Italic)
        System.out.println("Text (" + textFragment.getText() + ") is Italic");
}

The issues you have found earlier (filed as PDFJAVA-38101) have been fixed in Aspose.PDF for Java 21.7.