How to read PDF file and extract text, tables, cells etc via aspose-pdf?


#21

@HAREEM_HCL_COM

Thanks for writing back.

It is quite possible that your PDF document has nested tables inside it which is why the API is showing such behavior. Please note that each PDF document has its own type of structure and complexity and API may show different outputs depending upon it. In case API is not returning correct or desired output, we use to investigate the scenario with sample PDF and find reasons of the issue.

You may also try using similar code snippet with other PDF documents with tables inside them and in case you notice similar behavior with every type of document, please share you complete code snippet which you are using at your end. We will test it with our sample files and share our feedback with you.


#22

Hi Team,
I need to fetch a specific table, iterate through Header and Data, by each row.However, it does not return data by row, each cell data is iterated, and printed one after the other. So I could not make out any set of data from a single row.


#23

@HAREEM_HCL_COM

Earlier shared code snippet extracts the data in a sequence of rows. For example, please try to run following code snippet to extract data row by row and print it in a console output:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
String tempTable = "";
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tempTable += tf.getText();
       }
   }
   System.out.println(tempTable);
  }
}

Whereas, below code snippet can be used to extract data column-wise from a table:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
AbsorbedTable absorbedTable = absorber.getTableList().get_Item(0);
int absorbedRows = absorbedTable.getRowList().size();
int headerNo = 0;
String cellData = "";
for(int i =0; i < absorbedRows; i++){
 if(headerNo < absorbedTable.getRowList().get_Item(i).getCellList().size()) {
  for (TextFragment tf : absorbedTable.getRowList().get_Item(i).getCellList().get_Item(headerNo).getTextFragments()) {
      cellData += " " + tf.getText();
  }
 }
 else
 {
  break;
 }
 if(i == absorbedRows - 1) {
   System.out.println("Header/Col #: " + headerNo + " => " + cellData);
   headerNo++;
   i = -1;
   cellData = "";
  }
}

You can surely test these code snippets and modify them as per your requirements. In case you face any issue, please feel free to let us know.


#24

Hi @asad.ali

Can you show me how to remove the tabular content in a pdf and turn it into a new pdf file?

Also, can aspose remove tabular content in rtf file?

Thanks.


#25

@dasadla

Thanks for your inquiry.

You can remove table content from PDF by setting text fragments as empty string. For example like in below code:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tf.setText("");
       }
   }
  }
}

In case you face any issue while accomplishing your requirements, please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

RTF file format is support by Aspose.Words API and we will be updating you soon regarding removing table content from .rtf file.


#26

@dasadla,

Yes, you can remove all Tables or any single Table from Word document (RTF file) by using the following Aspose.Words for Java code:

Document doc = new Document("E:\\temp\\tables.rtf");

// Remove any Table
 Table tab = (Table) doc.getChildNodes(NodeType.TABLE, true).get(0);
 tab.remove();

// Remove all Tables
//doc.getChildNodes(NodeType.TABLE, true).clear();

doc.save("E:\\Temp\\awjava-19.4.pdf");

Hope, this helps.


#27

file 3.pdf (48.7 KB)

read the file with all the content and table row by row with headers.


#28

@vsantosh

Would you kindly share which API you are using i.e. Aspose.PDF for .NET or Java? We will test the scenario accordingly and share our feedback.