Extract text from PDF section wise using Aspose.PDF

Dilip0527 · December 29, 2020, 10:23am

Hi I am having Pdf document I need to extract pdf by following ways.

Extract TOC alone
Extract whole text or content from every section in pdf

Can you guide me how to do that?

Regards
Dilip Kumar K

asad.ali · December 30, 2020, 9:31pm

There is no distinction between TOC elements and other links inside an existing PDF documents as they are only defined at the time of PDF generation. Furthermore, you can extract text on the basis of regular expressions using Aspose.PDF. Would you please share your sample PDF document with us so that we can further check it and respond you accordingly.

Dilip0527 · December 31, 2020, 5:25am

Symbiance-001.pdf (9.6 MB)

hi Asad,

Please find the attached pdf docuent

asad.ali · December 31, 2020, 8:13pm

@Dilip0527

Thanks for sharing the document.

Would you please also explain a bit more like which sections you want to extract in particular? Do you mean pages by sections?

Dilip0527 · January 2, 2021, 4:29am

Yes Asad, I will expain here I mean section wise content in the sense for example let us consider Title page section in the document which i given to you in the previous response I need to extract complete content which is present in the title page alone likewise I am expecting for all sections

Thanks
Dilip Kumar K

asad.ali · January 4, 2021, 4:28pm

@Dilip0527

There are no such sections defined in an existing PDF. Please note that headings/titles/header/footers are specified only at the time of PDF generation and once PDF is saved, they are become simple text elements. So, in your case, you can extract all text from PDF document and perform some string operations on extracted text to find/search your desired section. Please check following sample code snippet where text is extracted for first section i.e. TITLE PAGE:

Document doc = new Document(dataDir + "Symbiance-001.pdf");
TextAbsorber absorber = new TextAbsorber();
doc.Pages.Accept(absorber);
string extractedText = absorber.Text;

var start = extractedText.IndexOf("TITLE PAGE") + 12;
var match2 = extractedText.Substring(start, extractedText.IndexOf("CLINICAL PROTOCOL SYNOPSIS") - start);