Extract content from PDF with formatting

We are looking for the solution where we can extract the content of the PDF along with the formatting and styles (example: Heading-1, Bullets, Table, Images etc.) applied on it.

Please let us know if Aspose.PDF commercial would be able to get the needed information.

@pradeep.verma3

Aspose.PDF offers a way to extract Tables and Images from the PDF documents. As far as Bullets and Headings are concerned, it depends upon how they were added to the document. Would you please explain a bit more about your requirements. Do you want to only extract the information from PDF document and store it somewhere? OR you need to generate another copy of the PDF with similar formatting and style?

Please try to share some sample source PDF along with your expected output document. We will check the relevant details at our end and share our feedback with you accordingly.

Thank you for your response.

Our requirement is to get the information of the PDF contents along with their formatting. The PDF file which needs to be parsed, will be created through the DOCX with different styles/formatting on it.

We would create another process based on the hierarchy of the content inside PDF and render it in some other platform (other than file generation), so to do so we would need the sequence of the contents as well.
For example, if Heading-1 has Paragraphs and Images, and Heading-2 has List Paragraphs, then we need to collect all these in same sequence as they look in PDF file.

I have attached one sample DOCX file which has original formatting and the PDF file which is converted from same DOCX.

Expected Output should be like:
Heading-1: Sample Heading With Size One
List Paragraph: This is list 1
List Paragraph: This is list 2
Image: image001.png
Heading-2: Sample Heading With Size Two
List Paragraph: List-1 of H-2
List Paragraph: List-2 of H-2
Image: image002.jpg
Heading-3: Sample Heading With Size Three
List Paragraph: List-1 of H-3
List Paragraph: List-2 of H-3
Rectangle: Shape-1

sample.zip (651.1 KB)

Please advise on the getting the content back from PDF.

@pradeep.verma3

Thanks for further elaborating your requirements.

We have checked the files shared by you and we are afraid that your particular requirements of extracting heading-wise content would not be possible to achieve using Aspose.PDF. Please note that the reason is that the content in PDF is stored differently. There is no element such as header/footer or heading separately inside an existing PDF document. They can only be defined and specified at the time of PDF generation.

Furthermore, Aspose.PDF however offers the capability to extract text (every heading and header/footer would be recognized as text only), images, paragraphs, tables from an existing PDF document. The classification of the extracted content would not be possible with respect to their parent headings. In case you need further information, please feel free to let us know.