Could get various data object sequentially from the pdf file?

ruhongcai · October 9, 2018, 2:08am

Hi,

Any way could read from the given PDF sequentially from beginning to the end by “unit/element/component”,
Return Result could be any List of Object where Object could be “Text”, “Table”, “Image” , etc Component of PDF file.

If so, please provide the sample code.

Many thanks!

Ruhong

asad.ali · October 9, 2018, 11:44am

@ruhongcai

Thanks for contacting support.

Would you please some more details like which environment you are working in e.g. Java or .NET. Also it would be helpful if you can please share a sample PDF document along with expected output file or screenshot. We will definitely test the scenario in our environment and address it accordingly.

ruhongcai · October 9, 2018, 3:07pm

Hi,

Thanks for response.
The attached is sample file, we run windows/unix in Java.

Expect Aspose PDF returns (input-attached sample_pdf)
In order:
Page1:
Header
leftside (“Printed… confidential”)
Table
text
image
text
image
table- bottom (“Comments:…”)
text (table name— “4.2 Export to Zip”)
table
bottom-line(“Effective template: XX-xXx-000-0007 , Revision 3.0”’)

Page2
Page3
Page4
Page5
…sample_pdf.pdf (4.0 MB)

Thanks!

Ruhong

asad.ali · October 9, 2018, 8:18pm

@ruhongcai

Thanks for sharing further details.

At the moment, you can extract images and text from PDF document using Aspose.PDF API. However, we are trying to apply DSR (document structure recognition) in order to get these elements and a ticket has been logged in our issue tracking system for this as PDFJAVA-38058. We have linked the ticket with this thread so that you will receive notification once the feature is available. Please spare us little time.

We are sorry for the inconvenience.

ruhongcai · October 9, 2018, 8:25pm

Hi,

Thanks for the information, when “extract images and text from PDF document using Aspose.PDF API”, does it return them in the same order as in pdf file?

How about table and cell?

Please let me know currently how many I could apply for these requirement? Many thanks! Sample code is appreciated.

Ruhong

asad.ali · October 9, 2018, 8:37pm

@ruhongcai

You can extract Text, Images, Annotations and Table data from PDF document using Aspose.PDF for Java. While iterating through Page Collection of a document, you will get respective data in sequence. You can surely find code examples of basic functionality over given links to extract mentioned data from PDF document. In case of any further assistance, please feel free to let us know.

ruhongcai · October 9, 2018, 8:47pm

Thanks!

Ruhong

ruhongcai · October 9, 2018, 11:15pm

Hi,

From the doc for API, mostly just extract Text, Images, Annotation, Table individually,
what I need is dynamically read all elements from Page and do some comparison work.

Table/Image all extends com.aspose.pdf.BaseParagraph,
however TextElement extends com.aspose.pdf.Element
So which way could get Collection of xxx from Page while xxx= <Text, Image, Table, etc>
in sequence.

“While iterating through Page Collection of a document, you will get respective data in sequence” — Could you show the sample code how to do this ?

Many thanks!

Ruhong

asad.ali · October 10, 2018, 10:13am

@ruhongcai

Thanks for writing back.

Please note that you can iterate through the Page Collection as follows:

Document pdfDocument = new Document(dataDir + "05.pdf");
PageCollection pageCollection = pdfDocument.getPages();
for(Page page:pageCollection){
	System.out.println(page.getNumber());
        // Extract Text from current page
        ///// Code for extracting text
        // Extract Images from current page
        ///// Code for extracting images
        // .......
}

As shared earlier, there is no way to extract the content in that particular sequence which you have mentioned. However, we are working over implementing similar feature i.e. DSR and once it is available you will definitely be prompted about it. Please spare us little time.

We are sorry for the inconvenience.

ruhongcai · October 10, 2018, 4:10pm

Thanks!

Ruhong