How to read PDF file and extract text, tables, cells etc via aspose-pdf?

asad.ali · October 29, 2018, 10:46am

Thanks for getting back to us.

We have tested your both methods using ParagraphAbsorber and TableAbsorber with Aspose.PDF for Java 18.9 and did not notice any issue. All text from PDF document was extracted. Furthermore, with this particular PDF document, ParagraphAbsorber Class is able to extract every text whereas, TableAbosrber can only be used to extract table data.

Please check following code snippet and extracted text output for your reference, which we have used in our environment to test the scenario:

Document doc = new Document(dataDir + "8.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){
  for (MarkupParagraph mp:ms.getParagraphs()){
   StringBuilder sb =new StringBuilder();
	 for(List< TextFragment> tflist : mp.getLines()){
		for(TextFragment tf:tflist ){
			sb.append(tf.getText());
		}
		sb.append("/r/n");
	  }
	  sb.append("/r/n");
	  System.out.println(sb);
  }
 }
}
		
try {
	TableAbsorber absorber = new TableAbsorber();
	PageCollection pc = doc.getPages();
	for(Page pg:pc){
	absorber.visit(pg);
	com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedTable> l = absorber.getTableList();
	for(AbsorbedTable table:l){
	 for(AbsorbedRow row:table.getRowList())
	 {
		for(AbsorbedCell cell:row.getCellList())
		{
		 System.out.println(cell.getRectangle());
		 for(TextFragment tf:cell.getTextFragments())
		 {
		  for(TextSegment ts:tf.getSegments())
		  {
	            System.out.println(ts.getText());
		  }
		 }
		}
	}
   }
 }
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

outputtext.zip (1.6 KB)

Furthermore, it is always recommended to use latest version because it contains more fixes and enhancements. Please use latest version of the API with valid license. In case you do not have a valid license, you can get a 30-days temporary license from our website. Please feel free to let us know if you face any issue.

yichunxia · October 31, 2018, 9:42am

Hi Asad,
Appreciate for your reply.
Finally I got it worked, and some other questions need your great support:

Are there any method to get only the text NOT in tables? so that I can handle separately with text only and table only cases. I tried ParagraphAbsorber and TextAbsorber, both of them printed out all the text contents.
All the return and spaces lost, all sentences and characters mixed together， for example：
CONSIGNEE
Line2 U.S.A
ADDRESS #3

was converted to: CONSIGNEELine2 U.S.AADDRESS #3

What does the coordinates mean? The pixel in the file from the left-top corner? or something else?

Thanks
Tony

asad.ali · October 31, 2018, 6:29pm

@yichunxia

Thanks for getting back to us.

It is good to know that things have started working at your side.

I am afraid there is no such method Aspose.PDF offers at the moment to achieve what you require. However, an investigation ticket has already been logged as PDFJAVA-38108 in our issue tracking system. We will further investigate the feasibility of such feature and as soon as we have some updates regarding its availability, we will let you know. Please spare us little time.

Would you please share the respective input PDF document with us. Furthermore, please also specify which way you are using to extract text from it (e.g. TableAbsorber, TextFragmentAbsorber, etc.). We will test the scenario in our environment and address it accordingly.

The coordinate system in Aspose.PDF API follows the standard coordinates as per PDF Specifications in which (0,0) means bottom-left. In other words the starting point in the PDF is from bottom-left corner PDF_Rectangle.png (2.8 KB). Furthermore, basic unit of measurement in Aspose.PDF is Point where 72 Points = 1 inch.

In case of further inquiry, please feel free to let us know.

yichunxia · November 1, 2018, 2:28am

Hi Asad,
Thanks for your explain, please find the attachment for the sample code with output and sample pdf file.
You can see the space between QINGDAO and JOBOFONE and other spaces lost.

Thanks for your continuous supportsample1.zip (125.1 KB)

Tony

asad.ali · November 1, 2018, 12:11pm

@yichunxia

Thanks for sharing requested details.

Please update your code snippet for ParagraphAbsorber usage as following in order to get correct text formatting:

Document doc = new Document(dataDir + "8.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
StringBuilder sb =new StringBuilder();
for (PageMarkup pm:pa.getPageMarkups()){
 for (MarkupSection ms:pm.getSections()){
   for (MarkupParagraph mp:ms.getParagraphs()){
     sb.append(mp.getText());
     sb.append('\n');
   }
 }
 sb.append('\n');
}
System.out.println(sb);

In case of any further assistance, please feel free to let us know.

asad.ali · December 2, 2018, 8:33pm

@yichunxia

Please use following code snippet with Aspose.PDF for Java 18.11 in order to meet your requirements:

Document doc = new Document(dataDir + "8.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

TableAbsorber absorber = new TableAbsorber();
PageCollection pc = doc.getPages();
String tempTable;
String text;
for (Page pg : pc) {
    text = "";
    absorber.visit(pg);
    textFragmentAbsorber.visit(pg);

    for (TextFragment tf : textFragmentAbsorber.getTextFragments()) {
        text += tf.getText();
    }

    for (AbsorbedTable table : absorber.getTableList()) {
      tempTable = "";
      for (AbsorbedRow row : table.getRowList()) {
         for (AbsorbedCell cell : row.getCellList()) {
            for (TextFragment tf : cell.getTextFragments()) {
              tempTable += tf.getText();
            }
         }
      }

      text = text.replace(tempTable, "");
    }
    
    // Get the text only 
    //System.out.println(text);
    
    // OR get the text as text fragments
    
    for (TextFragment tf : textFragmentAbsorber.getTextFragments()) {
      if (text.startsWith(tf.getText())) {
         text = text.substring(tf.getText().length(), text.length());
         System.out.print(tf.getText());
      }
    }
}

In case of any further assistance, please feel free to let us know.

HAREEM_HCL_COM · March 28, 2019, 12:05pm

Hi Team,(Java)
I need to extract content from PDF document.
We converted a AutoCAD drawing (.dwg) file into PDF Document and trying to read the values.
I tried snippet of code shared in this thread, did not work.
Please share a code.
Our PDF documents contains table and content in it.It does not have any plain text.
All content are inside Table.
Document cannot be shared.
Please assist.

Regards,
Mamtha.A.C.D.

HAREEM_HCL_COM · March 28, 2019, 1:13pm

Hi Team,(Java)
Unable to extract table content from PDF documents, using below code.
for (AbsorbedTable table : absorber.getTableList()) {
tempTable = “”;
for (AbsorbedRow row : table.getRowList()) {
for (AbsorbedCell cell : row.getCellList()) {
for (TextFragment tf : cell.getTextFragments()) {
tempTable += tf.getText();
}
}
}…

As paragraph, table contents are retrieved.

I need to extract content from PDF document.
This is a different scenario.
Able to retrieve Paragraphs, and not tables.
Document cannot be shared.
Please assist.

Regards,
Mamtha.A.C.D.

asad.ali · March 28, 2019, 5:36pm

@HAREEM_HCL_COM

Before using above shared code snippet, please make sure that your PDF document (which is obtained after converting .dwg) contains text or annotations inside it. In case it contains only image(s), you need to extract text from images using OCR operation. You may extract images from PDF using Aspose.PDF for Java and later perform OCR on image using Aspose.OCR for Java.

In case you still face any issue please share your sample document with us. Please note that we need sample document to investigate the scenario and replicate issue in our environment. We assure you that we do not disclose your sample files with anyone and they are used only for testing purposes. As soon as the scenario investigation is completed, the files are removed from the system.

You may share file privately in a private message by clicking over username and press the blue ‘Message’ button.

HAREEM_HCL_COM · March 29, 2019, 6:53am

Thanks for the update.
Hi Team,
The PDF document which we use, has PDF contnets only, it was not converted from any other file format.
So it is purely PDF Document, containing, paragraphs, Tables, images.
Below code, returns all the data from the Document, paragraphs, tables data.
Code ----------------------------------------------
Document pdfDocument = new Document(inputFileNameAndPath);

			System.out.println("ParaGraph content");	
			ParagraphAbsorber paraAbsober = new ParagraphAbsorber();
			paraAbsober.visit(pdfDocument);
			for (PageMarkup pm:paraAbsober.getPageMarkups()){
				
				for (MarkupSection ms:pm.getSections()){
				  for (MarkupParagraph markupParagraph:ms.getParagraphs()){
					System.out.println("Para --"+markupParagraph.getText());
				  }
				}
			}

I need to fetch only the table content, with key value pair, where paragraph returns column wise as single String.

So I need to iterate the table with heading, rows and columns, to set the output .
Please assist.

asad.ali · March 29, 2019, 5:34pm

@HAREEM_HCL_COM

Thanks for getting back to us.

The code snippet to extract table content has already been shared in this thread.

Please share sample key value pair example in which way you want to extract table content. Also as requested earlier, please share your sample PDF document. In case you do not want to share your original PDF file, you may create a sample PDF document with dummy content and share with us. This will help use understanding the scenario in better way and address it accordingly.

HAREEM_HCL_COM · April 5, 2019, 4:07pm

Hi Team,
My concern is , I have to fetch data from Tables, and set the values to a Collection.
However, using above AbsorbedTable Table code, no data are retrieved, if I use ParagraphAbsorber, it returns whole document, where I could not, extract only Table data.
It returns all data.
How to filter only table data, or need code to retreive only Table data.
I may have to fetch and replace teh values of cells in another table on the same document.

Regards,
Mamtha.A.C.D.

asad.ali · April 5, 2019, 9:43pm

@HAREEM_HCL_COM

There is no specific way to extract only table data from the content extracted by ParagraphAbsorber because, ParagraphAbsorber extracts only text and output may differ for different PDF documents.

The recommended and correct way to extract table data is using TableAbsorber and for some reasons, it is not working for your document. Which was why we requested you to share the sample PDF file so that we can investigate the issue and further proceed to rectify this behavior of the API.

HAREEM_HCL_COM · April 8, 2019, 1:30pm

Thank you all.
I am able to get table data from PDF. My initial concern, is over.

Now, when it returns the table, I see, it is returning in a very confused manager,
It is returning data of 2 tables to gether
it returns table from bottom of the page, and combines, 2 tables data to gether.
For instance,
from table one, it returns, each cell data on below iteration, treating , each cell of the table as AbsorbedCell .

for(AbsorbedCell cell:row.getCellList())
{
for(TextFragment tf:cell.getTextFragments())
{
for(TextSegment ts:tf.getSegments())
{
String data = ts.getText();//from each cell.
}
}
}

in other case, the entire 2 tables are treated as one single AbsorbedCell , and on iterating through AbsorbedCell , each cell is treated as TextFragment .

Y such difference,
in one case, Each cell is considered as AbsorbedCell .
In another case, Each cell is considered as TextFragment.
Please help me.

And,
I need the name of the table, atleast, differentiation between, table headers and table data. Is it possible with Aspose.PDF.

asad.ali · April 8, 2019, 6:51pm

@HAREEM_HCL_COM

Thanks for writing back.

It is quite possible that your PDF document has nested tables inside it which is why the API is showing such behavior. Please note that each PDF document has its own type of structure and complexity and API may show different outputs depending upon it. In case API is not returning correct or desired output, we use to investigate the scenario with sample PDF and find reasons of the issue.

You may also try using similar code snippet with other PDF documents with tables inside them and in case you notice similar behavior with every type of document, please share you complete code snippet which you are using at your end. We will test it with our sample files and share our feedback with you.

HAREEM_HCL_COM · April 9, 2019, 6:51am

Hi Team,
I need to fetch a specific table, iterate through Header and Data, by each row.However, it does not return data by row, each cell data is iterated, and printed one after the other. So I could not make out any set of data from a single row.

asad.ali · April 9, 2019, 6:38pm

@HAREEM_HCL_COM

Earlier shared code snippet extracts the data in a sequence of rows. For example, please try to run following code snippet to extract data row by row and print it in a console output:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
String tempTable = "";
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tempTable += tf.getText();
       }
   }
   System.out.println(tempTable);
  }
}

Whereas, below code snippet can be used to extract data column-wise from a table:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
AbsorbedTable absorbedTable = absorber.getTableList().get_Item(0);
int absorbedRows = absorbedTable.getRowList().size();
int headerNo = 0;
String cellData = "";
for(int i =0; i < absorbedRows; i++){
 if(headerNo < absorbedTable.getRowList().get_Item(i).getCellList().size()) {
  for (TextFragment tf : absorbedTable.getRowList().get_Item(i).getCellList().get_Item(headerNo).getTextFragments()) {
      cellData += " " + tf.getText();
  }
 }
 else
 {
  break;
 }
 if(i == absorbedRows - 1) {
   System.out.println("Header/Col #: " + headerNo + " => " + cellData);
   headerNo++;
   i = -1;
   cellData = "";
  }
}

You can surely test these code snippets and modify them as per your requirements. In case you face any issue, please feel free to let us know.

dasadla · April 10, 2019, 3:00am

Hi @asad.ali

Can you show me how to remove the tabular content in a pdf and turn it into a new pdf file?

Also, can aspose remove tabular content in rtf file?

Thanks.

asad.ali · April 10, 2019, 1:26pm

@dasadla

Thanks for your inquiry.

You can remove table content from PDF by setting text fragments as empty string. For example like in below code:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tf.setText("");
       }
   }
  }
}

In case you face any issue while accomplishing your requirements, please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

RTF file format is support by Aspose.Words API and we will be updating you soon regarding removing table content from .rtf file.

awais.hafeez · April 10, 2019, 2:20pm

@dasadla,

Yes, you can remove all Tables or any single Table from Word document (RTF file) by using the following Aspose.Words for Java code:

Document doc = new Document("E:\\temp\\tables.rtf");

// Remove any Table
 Table tab = (Table) doc.getChildNodes(NodeType.TABLE, true).get(0);
 tab.remove();

// Remove all Tables
//doc.getChildNodes(NodeType.TABLE, true).clear();

doc.save("E:\\Temp\\awjava-19.4.pdf");

Hope, this helps.