How to read PDF file and extract text, tables, cells etc via aspose-pdf?

asad.ali · April 8, 2019, 6:51pm

Thanks for writing back.

It is quite possible that your PDF document has nested tables inside it which is why the API is showing such behavior. Please note that each PDF document has its own type of structure and complexity and API may show different outputs depending upon it. In case API is not returning correct or desired output, we use to investigate the scenario with sample PDF and find reasons of the issue.

You may also try using similar code snippet with other PDF documents with tables inside them and in case you notice similar behavior with every type of document, please share you complete code snippet which you are using at your end. We will test it with our sample files and share our feedback with you.

HAREEM_HCL_COM · April 9, 2019, 6:51am

Hi Team,
I need to fetch a specific table, iterate through Header and Data, by each row.However, it does not return data by row, each cell data is iterated, and printed one after the other. So I could not make out any set of data from a single row.

asad.ali · April 9, 2019, 6:38pm

@HAREEM_HCL_COM

Earlier shared code snippet extracts the data in a sequence of rows. For example, please try to run following code snippet to extract data row by row and print it in a console output:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
String tempTable = "";
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tempTable += tf.getText();
       }
   }
   System.out.println(tempTable);
  }
}

Whereas, below code snippet can be used to extract data column-wise from a table:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
AbsorbedTable absorbedTable = absorber.getTableList().get_Item(0);
int absorbedRows = absorbedTable.getRowList().size();
int headerNo = 0;
String cellData = "";
for(int i =0; i < absorbedRows; i++){
 if(headerNo < absorbedTable.getRowList().get_Item(i).getCellList().size()) {
  for (TextFragment tf : absorbedTable.getRowList().get_Item(i).getCellList().get_Item(headerNo).getTextFragments()) {
      cellData += " " + tf.getText();
  }
 }
 else
 {
  break;
 }
 if(i == absorbedRows - 1) {
   System.out.println("Header/Col #: " + headerNo + " => " + cellData);
   headerNo++;
   i = -1;
   cellData = "";
  }
}

You can surely test these code snippets and modify them as per your requirements. In case you face any issue, please feel free to let us know.

dasadla · April 10, 2019, 3:00am

Hi @asad.ali

Can you show me how to remove the tabular content in a pdf and turn it into a new pdf file?

Also, can aspose remove tabular content in rtf file?

Thanks.

asad.ali · April 10, 2019, 1:26pm

@dasadla

Thanks for your inquiry.

You can remove table content from PDF by setting text fragments as empty string. For example like in below code:

Document doc = new Document(dataDir + "TableWithRepeatingHeader.pdf");
TableAbsorber absorber = new TableAbsorber();
absorber.visit(doc.getPages().get_Item(1));
for (AbsorbedTable table : absorber.getTableList()) {
 for (AbsorbedRow row : table.getRowList()) {
   tempTable = "";
   for (AbsorbedCell cell : row.getCellList()) {
       for (TextFragment tf : cell.getTextFragments()) {
            tf.setText("");
       }
   }
  }
}

In case you face any issue while accomplishing your requirements, please share your sample PDF document with us. We will test the scenario in our environment and address it accordingly.

RTF file format is support by Aspose.Words API and we will be updating you soon regarding removing table content from .rtf file.

awais.hafeez · April 10, 2019, 2:20pm

@dasadla,

Yes, you can remove all Tables or any single Table from Word document (RTF file) by using the following Aspose.Words for Java code:

Document doc = new Document("E:\\temp\\tables.rtf");

// Remove any Table
 Table tab = (Table) doc.getChildNodes(NodeType.TABLE, true).get(0);
 tab.remove();

// Remove all Tables
//doc.getChildNodes(NodeType.TABLE, true).clear();

doc.save("E:\\Temp\\awjava-19.4.pdf");

Hope, this helps.

vsantosh · November 6, 2019, 10:19am

file 3.pdf (48.7 KB)

read the file with all the content and table row by row with headers.

asad.ali · November 6, 2019, 5:22pm

@vsantosh

Would you kindly share which API you are using i.e. Aspose.PDF for .NET or Java? We will test the scenario accordingly and share our feedback.

amitchakravarthy · May 4, 2020, 9:27am

Hi Asad

we are getting Caused by: java.lang.ClassNotFoundException: com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList cannot be found error when running project in OSGI but the same works when we run only single java file.
Table size is returning 0 for the table created without border. but works for with border

asad.ali · May 4, 2020, 6:27pm

@amitchakravarthy

Would you please share your sample PDF document along with complete sample code snippet. We will test the scenario in our environment and adddress it accordingly.

amitchakravarthy · May 6, 2020, 9:18am

Hi Asad,

In the MANIFEST.MF inside the aspose.pdf-19.11.jar file the packages com.aspose.pdf.internal and com.aspose.pdf.engine are marked as private. This prevents us from using the packages in OSGi. Access to the IGenericList class is not permitted and we get the NoClassDefFoundError error.

Directive in the MANIFEST.MF

Private-Package: com.aspose.pdf.internal,com.aspose.pdf.engine

public static void getTableObsorber(PageCollection col) {
	TableAbsorber absorber = new TableAbsorber();
		Page page =col.get_Item(1);
		absorber.visit(page);
		IGenericList<AbsorbedTable> l= absorber.getTableList();
		System.out.println("size"+absorber.getTableList().size());
		for (AbsorbedTable table : l) {
			for (AbsorbedRow row : table.getRowList()) {
				for (AbsorbedCell cell : row.getCellList()) {
					System.out.println(cell.getRectangle());
					for (TextFragment tf : cell.getTextFragments()) {
						for (TextSegment ts : tf.getSegments()) {
							System.out.println(ts.getText());
							LinkAnnotation linkAnnotation = new LinkAnnotation(page, cell.getRectangle());
							GoToRemoteAction remoteAction = new GoToRemoteAction("multiheader2.pdf",new XYZExplicitDestination(2, 0,
								cell.getRectangle().getHeight(),0));
							linkAnnotation.setAction(remoteAction);
							page.getAnnotations().add(linkAnnotation);
						}
					}
				}
			}
		}
}

Caused by: java.lang.NoClassDefFoundError: com/aspose/pdf/internal/ms/System/Collections/Generic/IGenericList
at com.dummy.pro.util.GenerateTocUtil.getTableObsorber(GenerateTocUtil.java:240)
at com.dummy.pro.bo.test.testServiceImpl.generateToc(testServiceImpl.java:1521)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:205)
at com.sun.proxy.$Proxy205.generateToc(Unknown Source)
… 46 more
Caused by: java.lang.ClassNotFoundException: com.aspose.pdf.internal.ms.System.Collections.Generic.IGenericList cannot be found by com.dummy.pro_10.2.0.0
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:501)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:421)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:412)
at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
… 55 more

ctd-toc-1-1.pdf (62.3 KB)

asad.ali · May 6, 2020, 8:05pm

@amitchakravarthy

We are afraid that private/internal components of the API cannot be exposed as they are obfuscated. However, we will surely investigate for the support of OSGI and we have logged an investigation ticket as PDFJAVA-39391 in our issue tracking system for the purpose. We will further look into it and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

aspose.notifier · June 23, 2020, 7:38pm

The issues you have found earlier (filed as PDFJAVA-39391) have been fixed in Aspose.PDF for Java 20.6.