Extract text from the header- footer and body

karine_87 · November 19, 2014, 7:51am

Hello,

We are using aspose-pdf-9.5.2.jar with the below code to extract the text of a pdf file:

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(this.path); com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();

But i am not finding how i can extract the text from the header of the first page, the footer in the last page and the body of the file.
Can you advise please,

codewarior · November 20, 2014, 6:42am

Hi Karine,

Thanks for using our API’s.

In order to get text from Footer/Header area of PDF file, please try using following code snippet.

[Java]

com.aspose.pdf.facades.PdfContentEditor
pce = new
com.aspose.pdf.facades.PdfContentEditor();<o:p></o:p>

pce.bindPdf("input.pdf");

com.aspose.pdf.facades.StampInfo[] infos = pce.getStamps(1);

for(int counter =1; counter<=infos.length;counter++)

{ System.out.println(infos[counter].getText()); }

However if you you need to get text of header/footer added with Adobe Acrobat (not stamps added by Aspose software), you may consider using page.getArtifacts(); method to read header and footer artifacts on the page.

[Java]

// bind source PDF file

com.aspose.pdf.Document doc = new com.aspose.pdf.Document("input.pdf");

// get artifacts collection of frist page

com.aspose.pdf.ArtifactCollection artifact = doc.getPages().get_Item(1).getArtifacts();

Iterator fi = artifact.iterator();

while(fi.hasNext())

{

if (((com.aspose.pdf.Artifact)fi.next()).getSubtype() == Artifact.ArtifactSubtype.Header || ((com.aspose.pdf.Artifact)fi.next()).getSubtype() == Artifact.ArtifactSubtype.Footer)

// print text of artifact

System.out.println(((com.aspose.pdf.Artifact)fi.next()).getText());

}

codewarior · November 20, 2014, 6:48am

Hi Karine,

In case you need to extract contents of PDF file without extracting contents of Header/Footer, please try using following code snippet.

[Java]

//open document<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("input.pdf");

//create TextAbsorber object to extract text

com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

textAbsorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(

pdfDocument.getPages().get_Item(1).getRect().getLLX(),

pdfDocument.getPages().get_Item(1).getRect().getLLY() + 20,

pdfDocument.getPages().get_Item(1).getRect().getURX(),

pdfDocument.getPages().get_Item(1).getRect().getURY() - 20));

//accept the absorber for all the pages

pdfDocument.getPages().accept(textAbsorber);

//get the extracted text

String extractedText = textAbsorber.getText();

//create a writer and open the file

BufferedWriter writer = new BufferedWriter(new FileWriter(new java.io.File("c:/pdftest/ExtractedText.txt")));

//write extracted contents

writer.write (extractedText);

//Close writer

writer.close();

karine_87 · December 29, 2014, 10:20am

Hi Nayyer,
Thank you for your response,
Your code to extract contents of pdf file without extracting the contents of header/footer works well.
But i am still having problems for header/footer,
The issue is that the headers and footers of my file aren’t added with Adobe Acrobat neither by aspose software.
I need to extract for example in the attached pdf file
the header as: "Guide [TAG] Category management Administration Initialization"
the footer as :“© EVER TEAM 2013 – EverSuite 5.1 – Ref: ES510 TAG.V1 3”,
i tried to use also the method of rectangle and textabsorber, but i couldn’t find the correct X and Y to use.

I appreciate if you can help me,
Thank you,
Karine

codewarior · December 30, 2014, 12:35am

Hi Karine,

Thanks for sharing the details.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-34642. We will
investigate this issue in details and will keep you updated on the status of a
correction. <o:p></o:p>

We apologize for your inconvenience.

ratnakar · June 18, 2018, 6:53am

HI Nayyer,

Can you please let me know how to extract header and footer from pdf using c# ?
I am waiting for quick response.

Regards,
Ratnakar

ratnakar · June 18, 2018, 6:55am

HI All,

Do you anybody know , how to extract header and footer from pdf .

Regards,
Ratnakar

asad.ali · June 18, 2018, 12:23pm

@ratnakar

Thanks for contacting support.

Please note that header/footer contents cannot be extracted separately from a PDF document. Because, once you add header/footer during PDF generation and save the document, they become merged with other content of PDF document - and there are no separate header and footer. However, in reference to above logged ticket i.e. PDFJAVA-34642, further investigation is pending to be made. As soon as we have some news on investigation progress, we will let you know. Please spare us little time.

We are sorry for the inconvenience.

ratnakar · June 18, 2018, 1:03pm

Thanks Asad for your information!!

Can we ignore header and footer while extract content from pdf?
At least deleted header and footer.

Here is the code I am using to extract data .

TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);

this is the code to extract data from pdf.

asad.ali · June 18, 2018, 7:44pm

@ratnakar

Thanks for your inquiry.

As shared earlier, once the text is extracted from PDF document, there is no differentiation between header/footer and main(body) content. However, as soon as we have some information in this regard, after investigation of earlier logged ticket, we will surely let you know. Please be patient and spare us little time.

We are sorry for the inconvenience.