Extract PDF Content (Text, Image, Table, Form,etc) Top Down C# .NET

nnguyen9644 · March 25, 2022, 3:43pm

Hi, I’m trying to parse content of a PDF file page-by-page and in a top down fashion. I’m rather new to Aspose PDF so would love to get some guidance on this:

Basically, I want to traverse page by page. For each page, I want to traverse from top to bottom, checking if each element is text, image, form, etc then determine what to do with each of these elements.

So far, I basically iterate page by page with Document.Pages then using ParagraphAbsorber, TableAbsorber, Document.Pages.Resources.Images, and Document.Form to detect paragraph, table, image and form respectively in each page. However, I’m not sure how to determine the order of each element in each page. I would really appreciate any help on this, thank you

asad.ali · March 25, 2022, 8:12pm

@nnguyen9644

Sadly, there is no particular method in the API to extract the PDF content with sequence. However, you can check the position of extracted content on page if that is enough for you to guess its sequence. Otherwise, you can please share a sample PDF for our reference and we will log an investigation ticket to analyze your requirements further and share the ID with you.

nnguyen9644 · March 28, 2022, 4:07pm

Thanks for your quick response! I can try take a look into determining the order of each extracted content on the page by its position.
Beside that, I don’t have a specific sample pdf but a PDF file that I’m testing my code with is this:sample.pdf (69.8 KB)
Thank you!

asad.ali · March 28, 2022, 9:04pm

@nnguyen9644

We have logged a ticket as PDFNET-51574 in our issue tracking system for this feature investigation. We will look into its details and let you know in case we find some feasibility to achieve what you require. Please be patient and spare us some time.

We are sorry for the inconvenience.