Extract PDF paragraph headers and its content

sanjaybk · March 28, 2022, 5:09pm

Hello,

I’m trying to extract a PDF paragraph or section headers with its contents. Can you please help me extract section/paragraph headers and its contents from the attached document (page 3).

Also, i have attached a screenshot of an excel file. That is my desired output where i can have a header and its content.

Thank you.Etract.JPG (203.0 KB)
Testing_Headers.pdf (300.4 KB)

asad.ali · March 28, 2022, 8:10pm

@sanjaybk

Do you have predefined heading values? For example there is a heading “Local and TEFRA Approvals” in the shared PDF. Is its value is already known and you need to extract its respective paragraph? OR, you also need to determine the headings in the PDF automatically and extract respective content?

sanjaybk · March 28, 2022, 8:39pm

@asad.ali,

Good question. Sorry, i wouldn’t know the headers at all. I need to determine the headings and extract its paragraph(s) automatically.

asad.ali · March 29, 2022, 12:22am

@sanjaybk

You cannot distinguish the headings from other text content in the PDF. Please note that the headings are only used at the time of PDF generation and once the document is generated, they become part of the content just like other text items. So, we are afraid that it would not be feasible to determine headings in an existing PDF. In case you have further inquiries, please feel free to let us know.

sanjaybk · March 29, 2022, 1:29pm

@asad.ali,
Thanks. If in case, i know the header names before hand, can i then extract its contents/paragraph’s?

asad.ali · March 29, 2022, 9:12pm

@sanjaybk

In that case, you can extract complete text from the Page using TextAbsorber and perform String operations to separate the paragraphs based on headings.