How to read PDF file and extract text, tables, cells etc via aspose-pdf?

yichunxia · October 25, 2018, 5:35am

Guys,
I’d like to read the PDF file like what we did for aspose-word with doc files, I want to get the rows of text paragraphs with text contents, and tables—cells—row----column with text contents in the table cells. I also want the coordinate relations for the cells like cell(0,0), cell(0,1)…

The only information I searched is the TableAbsorb example, I am not quite sure if this is the correct approach for me to achieve this, and actually the example reports NPE when visit the document page:

absorber.visit(doc.getPages().get_Item(1));

Exception in thread “main” java.lang.NullPointerException
at com.aspose.pdf.TableAbsorber.m5(Unknown Source)
at com.aspose.pdf.TableAbsorber.visit(Unknown Source)
at com.tonysoft.utest.aspose.Doc2Pdf.readPdf(Doc2Pdf.java:48)
at com.tonysoft.utest.aspose.Doc2Pdf.main(Doc2Pdf.java:40)

Please help me out for if this is doable? any examples? documents?

Appreciate
Tony

asad.ali · October 25, 2018, 12:16pm

@yichunxia

Thanks for contacting support.

Aspose.PDF for Java supports the features of text extraction, table extraction as well as extracting text by paragraphs. If you could please share your sample PDF document from which you want to extract content and some details about expected results, we will be able to assist you accordingly by sharing some sample code snippet(s).

yichunxia · October 26, 2018, 4:56am

sample.pdf (177.6 KB)
Please fine the file in attched. I want to read all the information of tables/cells/text contents. I f aspose-pdf can provide the coordinations of each cell, like left top and right bottom coordinations, that will be perfect.
another question, If my pdf file contains NO table at all, how if I try to use TableAbsorber to extract table cells? another scenario is how if I use paragraph absorber to extract a table in PDF?
Are there any way to detect what contains in a page? table or plain text paragraphs or others ?

Appreciate.
Tony

asad.ali · October 26, 2018, 3:53pm

@yichunxia

Thanks for sharing sample PDF document.

We have checked the file which you have shared and would like to share with you that you can extract table content as well as coordinates of cells using following code snippet:

try {
 Document doc = new Document(dataDir + "sample.pdf");
 TableAbsorber absorber = new TableAbsorber();
 PageCollection pc = doc.getPages();
 for(Page pg:pc){
  absorber.visit(pg);
  com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedTable> l = absorber.getTableList();
  for(AbsorbedTable table:l){
	 for(AbsorbedRow row:table.getRowList())
	 {
	  for(AbsorbedCell cell:row.getCellList())
	  {
	   System.out.println(cell.getRectangle());
	   for(TextFragment tf:cell.getTextFragments())
	   {
		 for(TextSegment ts:tf.getSegments())
		 {
		   System.out.println(ts.getText());
		 }
	   }
	  }
	 }
       }
      }
 } catch (Exception e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

In case your PDF does not contain any table, you cannot use TableAbsober as it will not be able to extract any data. Instead, you should use TextFragmentAbsober, TextAbsorber or ParagraphAbsorber according to your requirements.

You can check whether a PDF contains table or plain text by using TableAbsorber and checking count of absorbed table (e.g. absorber.getTableList().size()). If size is more than zero, it means document contains table(s) inside it. In case of any further assistance, please feel free to let us know.

yichunxia · October 29, 2018, 12:29am

error.png (19.4 KB)
Hi Asad,
Appreciate for your quick response, I got a compile error in my environment, and I am trialing aspose-pdf 18.7, with JDK 1.8.
Can you help me out on that?
Thanks
Tony

yichunxia · October 29, 2018, 12:43am

sample.zip (122.3 KB)

Hi Asad,
I tried your code, but I can not extract any text content from the pdf, can you help to have a look? attached is the sample code, and the pdf file I used.
I resolve the previous problem in the sample code, please also help to review if it is the correct approach.

Appreciate.
Tony

asad.ali · October 29, 2018, 10:46am

@yichunxia

Thanks for getting back to us.

We have tested your both methods using ParagraphAbsorber and TableAbsorber with Aspose.PDF for Java 18.9 and did not notice any issue. All text from PDF document was extracted. Furthermore, with this particular PDF document, ParagraphAbsorber Class is able to extract every text whereas, TableAbosrber can only be used to extract table data.

Please check following code snippet and extracted text output for your reference, which we have used in our environment to test the scenario:

Document doc = new Document(dataDir + "8.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
for (PageMarkup pm:pa.getPageMarkups()){
for (MarkupSection ms:pm.getSections()){
  for (MarkupParagraph mp:ms.getParagraphs()){
   StringBuilder sb =new StringBuilder();
	 for(List< TextFragment> tflist : mp.getLines()){
		for(TextFragment tf:tflist ){
			sb.append(tf.getText());
		}
		sb.append("/r/n");
	  }
	  sb.append("/r/n");
	  System.out.println(sb);
  }
 }
}
		
try {
	TableAbsorber absorber = new TableAbsorber();
	PageCollection pc = doc.getPages();
	for(Page pg:pc){
	absorber.visit(pg);
	com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList<AbsorbedTable> l = absorber.getTableList();
	for(AbsorbedTable table:l){
	 for(AbsorbedRow row:table.getRowList())
	 {
		for(AbsorbedCell cell:row.getCellList())
		{
		 System.out.println(cell.getRectangle());
		 for(TextFragment tf:cell.getTextFragments())
		 {
		  for(TextSegment ts:tf.getSegments())
		  {
	            System.out.println(ts.getText());
		  }
		 }
		}
	}
   }
 }
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

outputtext.zip (1.6 KB)

Furthermore, it is always recommended to use latest version because it contains more fixes and enhancements. Please use latest version of the API with valid license. In case you do not have a valid license, you can get a 30-days temporary license from our website. Please feel free to let us know if you face any issue.

yichunxia · October 31, 2018, 9:42am

Hi Asad,
Appreciate for your reply.
Finally I got it worked, and some other questions need your great support:

Are there any method to get only the text NOT in tables? so that I can handle separately with text only and table only cases. I tried ParagraphAbsorber and TextAbsorber, both of them printed out all the text contents.
All the return and spaces lost, all sentences and characters mixed together， for example：
CONSIGNEE
Line2 U.S.A
ADDRESS #3

was converted to: CONSIGNEELine2 U.S.AADDRESS #3

What does the coordinates mean? The pixel in the file from the left-top corner? or something else?

Thanks
Tony

asad.ali · October 31, 2018, 6:29pm

@yichunxia

Thanks for getting back to us.

It is good to know that things have started working at your side.

I am afraid there is no such method Aspose.PDF offers at the moment to achieve what you require. However, an investigation ticket has already been logged as PDFJAVA-38108 in our issue tracking system. We will further investigate the feasibility of such feature and as soon as we have some updates regarding its availability, we will let you know. Please spare us little time.

Would you please share the respective input PDF document with us. Furthermore, please also specify which way you are using to extract text from it (e.g. TableAbsorber, TextFragmentAbsorber, etc.). We will test the scenario in our environment and address it accordingly.

The coordinate system in Aspose.PDF API follows the standard coordinates as per PDF Specifications in which (0,0) means bottom-left. In other words the starting point in the PDF is from bottom-left corner PDF_Rectangle.png (2.8 KB). Furthermore, basic unit of measurement in Aspose.PDF is Point where 72 Points = 1 inch.

In case of further inquiry, please feel free to let us know.

yichunxia · November 1, 2018, 2:28am

Hi Asad,
Thanks for your explain, please find the attachment for the sample code with output and sample pdf file.
You can see the space between QINGDAO and JOBOFONE and other spaces lost.

Thanks for your continuous supportsample1.zip (125.1 KB)

Tony

asad.ali · November 1, 2018, 12:11pm

@yichunxia

Thanks for sharing requested details.

Please update your code snippet for ParagraphAbsorber usage as following in order to get correct text formatting:

Document doc = new Document(dataDir + "8.pdf");
ParagraphAbsorber pa = new ParagraphAbsorber();
pa.visit(doc);
StringBuilder sb =new StringBuilder();
for (PageMarkup pm:pa.getPageMarkups()){
 for (MarkupSection ms:pm.getSections()){
   for (MarkupParagraph mp:ms.getParagraphs()){
     sb.append(mp.getText());
     sb.append('\n');
   }
 }
 sb.append('\n');
}
System.out.println(sb);

In case of any further assistance, please feel free to let us know.

asad.ali · December 2, 2018, 8:33pm

@yichunxia

Please use following code snippet with Aspose.PDF for Java 18.11 in order to meet your requirements:

Document doc = new Document(dataDir + "8.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

TableAbsorber absorber = new TableAbsorber();
PageCollection pc = doc.getPages();
String tempTable;
String text;
for (Page pg : pc) {
    text = "";
    absorber.visit(pg);
    textFragmentAbsorber.visit(pg);

    for (TextFragment tf : textFragmentAbsorber.getTextFragments()) {
        text += tf.getText();
    }

    for (AbsorbedTable table : absorber.getTableList()) {
      tempTable = "";
      for (AbsorbedRow row : table.getRowList()) {
         for (AbsorbedCell cell : row.getCellList()) {
            for (TextFragment tf : cell.getTextFragments()) {
              tempTable += tf.getText();
            }
         }
      }

      text = text.replace(tempTable, "");
    }
    
    // Get the text only 
    //System.out.println(text);
    
    // OR get the text as text fragments
    
    for (TextFragment tf : textFragmentAbsorber.getTextFragments()) {
      if (text.startsWith(tf.getText())) {
         text = text.substring(tf.getText().length(), text.length());
         System.out.print(tf.getText());
      }
    }
}

In case of any further assistance, please feel free to let us know.

HAREEM_HCL_COM · March 28, 2019, 12:05pm

Hi Team,(Java)
I need to extract content from PDF document.
We converted a AutoCAD drawing (.dwg) file into PDF Document and trying to read the values.
I tried snippet of code shared in this thread, did not work.
Please share a code.
Our PDF documents contains table and content in it.It does not have any plain text.
All content are inside Table.
Document cannot be shared.
Please assist.

Regards,
Mamtha.A.C.D.

HAREEM_HCL_COM · March 28, 2019, 1:13pm

Hi Team,(Java)
Unable to extract table content from PDF documents, using below code.
for (AbsorbedTable table : absorber.getTableList()) {
tempTable = “”;
for (AbsorbedRow row : table.getRowList()) {
for (AbsorbedCell cell : row.getCellList()) {
for (TextFragment tf : cell.getTextFragments()) {
tempTable += tf.getText();
}
}
}…

As paragraph, table contents are retrieved.

I need to extract content from PDF document.
This is a different scenario.
Able to retrieve Paragraphs, and not tables.
Document cannot be shared.
Please assist.

Regards,
Mamtha.A.C.D.

asad.ali · March 28, 2019, 5:36pm

@HAREEM_HCL_COM

Before using above shared code snippet, please make sure that your PDF document (which is obtained after converting .dwg) contains text or annotations inside it. In case it contains only image(s), you need to extract text from images using OCR operation. You may extract images from PDF using Aspose.PDF for Java and later perform OCR on image using Aspose.OCR for Java.

In case you still face any issue please share your sample document with us. Please note that we need sample document to investigate the scenario and replicate issue in our environment. We assure you that we do not disclose your sample files with anyone and they are used only for testing purposes. As soon as the scenario investigation is completed, the files are removed from the system.

You may share file privately in a private message by clicking over username and press the blue ‘Message’ button.

HAREEM_HCL_COM · March 29, 2019, 6:53am

Thanks for the update.
Hi Team,
The PDF document which we use, has PDF contnets only, it was not converted from any other file format.
So it is purely PDF Document, containing, paragraphs, Tables, images.
Below code, returns all the data from the Document, paragraphs, tables data.
Code ----------------------------------------------
Document pdfDocument = new Document(inputFileNameAndPath);

			System.out.println("ParaGraph content");	
			ParagraphAbsorber paraAbsober = new ParagraphAbsorber();
			paraAbsober.visit(pdfDocument);
			for (PageMarkup pm:paraAbsober.getPageMarkups()){
				
				for (MarkupSection ms:pm.getSections()){
				  for (MarkupParagraph markupParagraph:ms.getParagraphs()){
					System.out.println("Para --"+markupParagraph.getText());
				  }
				}
			}

I need to fetch only the table content, with key value pair, where paragraph returns column wise as single String.

So I need to iterate the table with heading, rows and columns, to set the output .
Please assist.

asad.ali · March 29, 2019, 5:34pm

@HAREEM_HCL_COM

Thanks for getting back to us.

The code snippet to extract table content has already been shared in this thread.

Please share sample key value pair example in which way you want to extract table content. Also as requested earlier, please share your sample PDF document. In case you do not want to share your original PDF file, you may create a sample PDF document with dummy content and share with us. This will help use understanding the scenario in better way and address it accordingly.

HAREEM_HCL_COM · April 5, 2019, 4:07pm

Hi Team,
My concern is , I have to fetch data from Tables, and set the values to a Collection.
However, using above AbsorbedTable Table code, no data are retrieved, if I use ParagraphAbsorber, it returns whole document, where I could not, extract only Table data.
It returns all data.
How to filter only table data, or need code to retreive only Table data.
I may have to fetch and replace teh values of cells in another table on the same document.

Regards,
Mamtha.A.C.D.

asad.ali · April 5, 2019, 9:43pm

@HAREEM_HCL_COM

There is no specific way to extract only table data from the content extracted by ParagraphAbsorber because, ParagraphAbsorber extracts only text and output may differ for different PDF documents.

The recommended and correct way to extract table data is using TableAbsorber and for some reasons, it is not working for your document. Which was why we requested you to share the sample PDF file so that we can investigate the issue and further proceed to rectify this behavior of the API.

HAREEM_HCL_COM · April 8, 2019, 1:30pm

Thank you all.
I am able to get table data from PDF. My initial concern, is over.

Now, when it returns the table, I see, it is returning in a very confused manager,
It is returning data of 2 tables to gether
it returns table from bottom of the page, and combines, 2 tables data to gether.
For instance,
from table one, it returns, each cell data on below iteration, treating , each cell of the table as AbsorbedCell .

for(AbsorbedCell cell:row.getCellList())
{
for(TextFragment tf:cell.getTextFragments())
{
for(TextSegment ts:tf.getSegments())
{
String data = ts.getText();//from each cell.
}
}
}

in other case, the entire 2 tables are treated as one single AbsorbedCell , and on iterating through AbsorbedCell , each cell is treated as TextFragment .

Y such difference,
in one case, Each cell is considered as AbsorbedCell .
In another case, Each cell is considered as TextFragment.
Please help me.

And,
I need the name of the table, atleast, differentiation between, table headers and table data. Is it possible with Aspose.PDF.