Extract all Header Text and Level from PDF Document

ln22 · April 30, 2025, 9:22pm

Please provide code parser all header text from an existing PDF document.

asad.ali · May 1, 2025, 12:09pm

There is no such entities like header and footer inside PDF file format. They are only provided in the API for generation purposes. You can use these classes to generate a PDF document by specifying header/footer. These header/footer become part of the rest of the content once PDF file is saved. They don’t exist as separate entities so that they can be extracted from existing document.

ln22 · May 1, 2025, 2:13pm

Sorry if my question was not clear. I am asking about section headers within the main body text itself. Basically, the titles for each section and thier level like Heading1, Heading2, Heading3 where Heading2 is within Heading1 and Heading3 is within Heading2. Can you think of any way to extract sections headers/titles in a reliable fashion from PDFs with Aspose PDF?

asad.ali · May 1, 2025, 7:26pm

@ln22

We need to investigate whether this could be possible using the API or not. Can you please share a sample PDF for our reference so that we can log an investigation ticket and share the ID with you?

ln22 · May 1, 2025, 7:29pm

EU_proposed_AI_regulation_40_Pages.pdf (977.1 KB)

Examples of headers:

EXPLANATORY MEMORANDUM
1. CONTEXT OF THE PROPOSAL
1.1. Reasons for and objectives of the proposal

asad.ali · May 2, 2025, 7:25am

@ln22

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFPYTHON-396

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.