Any way could read from the given PDF sequentially from beginning to the end by “unit/element/component”,
Return Result could be any List of Object where Object could be “Text”, “Table”, “Image” , etc Component of PDF file.
Would you please some more details like which environment you are working in e.g. Java or .NET. Also it would be helpful if you can please share a sample PDF document along with expected output file or screenshot. We will definitely test the scenario in our environment and address it accordingly.
At the moment, you can extract images and text from PDF document using Aspose.PDF API. However, we are trying to apply DSR (document structure recognition) in order to get these elements and a ticket has been logged in our issue tracking system for this as PDFJAVA-38058. We have linked the ticket with this thread so that you will receive notification once the feature is available. Please spare us little time.
Thanks for the information, when “extract images and text from PDF document using Aspose.PDF API”, does it return them in the same order as in pdf file?
How about table and cell?
Please let me know currently how many I could apply for these requirement? Many thanks! Sample code is appreciated.
You can extract Text, Images, Annotations and Table data from PDF document using Aspose.PDF for Java. While iterating through Page Collection of a document, you will get respective data in sequence. You can surely find code examples of basic functionality over given links to extract mentioned data from PDF document. In case of any further assistance, please feel free to let us know.
From the doc for API, mostly just extract Text, Images, Annotation, Table individually,
what I need is dynamically read all elements from Page and do some comparison work.
Table/Image all extends com.aspose.pdf.BaseParagraph,
however TextElement extends com.aspose.pdf.Element
So which way could get Collection of xxx from Page while xxx= <Text, Image, Table, etc>
in sequence.
“While iterating through Page Collection of a document, you will get respective data in sequence” — Could you show the sample code how to do this ?
Please note that you can iterate through the Page Collection as follows:
Document pdfDocument = new Document(dataDir + "05.pdf");
PageCollection pageCollection = pdfDocument.getPages();
for(Page page:pageCollection){
System.out.println(page.getNumber());
// Extract Text from current page
///// Code for extracting text
// Extract Images from current page
///// Code for extracting images
// .......
}
As shared earlier, there is no way to extract the content in that particular sequence which you have mentioned. However, we are working over implementing similar feature i.e. DSR and once it is available you will definitely be prompted about it. Please spare us little time.