Unable to extract correctly only text, para in PDF extraction using Aspose pdf for java

shashimn · May 28, 2019, 5:24am

Hello Aspose team, I have a PDF file that contains both text and tables. I need to extract only the text from the pdf without the tables. I am not able to extract only the text from the pdf. I am using Java 8 and Aspose PDF 19.2 jar. I am using the below code :

  InputStream fis = new FileInputStream(filename);
  com.aspose.pdf.Document pdfDoc = new com.aspose.pdf.Document(fis);
  ParagraphAbsorber pa = new ParagraphAbsorber();
  pa.visit(pdfDoc);
  StringBuilder sbUnstructuredText = new StringBuilder ();
  for (PageMarkup pm : pa.getPageMarkups()) {
    for (MarkupSection ms : pm.getSections()) {
      for (MarkupParagraph mp : ms.getParagraphs() ) {
        logger.debug("markup paragraph: " + mp.getText());
        for (TextFragment tfragment : mp.getFragments()) {
          sbUnstructuredText.append(tfragment.getText());

          logger.debug("textFragment: " + tfragment.getText());
          sbUnstructuredText.append("/n");
        }
        sbUnstructuredText.append("/n");

      }
    }
  }

The above also gets the Tables, which is not what I need. If there is any other way to get only Text please let me know.
I have attached the sample pdf I used.
attachment_UshurData_new.pdf (265.4 KB)

Regards,
Shashikant.

Farhan.Raza · May 28, 2019, 10:29am

@shashimn

Thank you for contacting support.

You may remove the text from each cell of first table on first page with below code snippet and save the output file to a stream, later you may load that stream into the instance of Document class and then extract text as per your requirements.

PdfAnnotationEditor editor = new PdfAnnotationEditor();
editor.bindPdf(dataDir + "attachment_UshurData_new.pdf");

// Create TableAbsorber object to find tables
TableAbsorber absorber = new TableAbsorber();

// Visit first page with absorber
absorber.visit(editor.getDocument().getPages().get_Item(1));

// Getting the table rectangle
Rectangle rect = absorber.getTableList().get_Item(0).getRectangle();

// clear text for the table
for (AbsorbedRow row : absorber.getTableList().get_Item(0).getRowList()) {
        for ( AbsorbedCell cell : row.getCellList()) {
                for (Object fragment : cell.getTextFragments()) {
                        ((TextFragment) fragment).setText("");
                }
        }
}
editor.save(dataDir + "tableContents_deleted.pdf");

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

shashimn · May 28, 2019, 2:27pm

Thanks for the response. I also got a hint on this forum thread.

Farhan.Raza · May 28, 2019, 11:47pm

@shashimn

You are right. Several requirements are already discussed in forums and documentation so browsing these spaces can often help. Please keep using our API and in event of any further query, feel free to ask.

avin.patel · August 29, 2019, 7:08pm

How to remove header and footer text while using above ParagraphAbsorber sample?

Thanks

Farhan.Raza · August 29, 2019, 10:49pm

@avin.patel

Thank you for contacting support.

Please always create separate topics for separate requirements where you can also refer to other topics, if related.

Moreover, please note that there is no specific mark that defines header or footer from other text or page contents. However, as a workaround, you may Extract Text from particular page region and then replace it with empty string.

You may also refer to the solution shared in this topic and modify the code a little, as per your requirements. Please feel free to contact us if you need any further assistance.